Development of diagnostic model of lung cancer based on multiple tumor markers and data mining

Objective To develop early intelligent discriminative model of lung cancer and evaluate the efficiency of diagnosis value. Methods Based on the genetic polymorphism profile of CYP1A1-rs1048943, GSTM1, mEH-rs1051740, XRCC1-rs1799782 and XRCC1-rs25489 and the methylations of p16 and RASSF1A gene, and the length of telomere in the peripheral blood from 200 lung cancer patients and 200 health persons, the discriminative model was established through decision tree and ANN technique. Results ACU of the discriminative model based on multiple tumour markers increased by about 10%; The accuracy rate of decision tree model and ANN model for testing set were 93.00% and 89.62% respectively. The ROC analysis showed the decision tree model’s AUC is 0.929 (0.894∼0.964), the ANN model’s AUC is 0.894 (0.853∼0.935). However, the classify accuracy rate and AUC of Fisher discriminatory analysis model are all about 0.7. Conclusion The early intelligent discriminative model of lung cancer based on multiple tumor markers and data mining techniques has a higher accuracy rate and might be useful for early diagnosis of lung cancer.


INTRODUCTION
According to WHO data, cancer is the second cause of death which has caused about one-sixth of the death (8.8 million) in 2015 worldwide. Lung cancer is the leading cause of cancer death, which led to 1.69 million people died, accounting for about 19% [1]. In China, there are approximately 600,000 people died because of lung cancer. The morbidity and mortality of lung cancer is the highest in the malignant tumors [2]. The 5-year survival rate of IA stages lung cancer was 70%, but the total rate was only about 15%, and the standardized mortality rates are expected to continue rising [3]. Therefore, improvement of the early diagnosis has great clinical significance for the prevention and treatment of lung cancer.
With the development of gene expression profiling technology and data mining technology, people could obtain and analyze the early molecular events of lung cancer, and thus expected to achieve the secondary prevention of lung cancer [4]. To date, low-dose computed tomography (CT), Auto Fluorescence Bronchoscope and Liquid-Based Cytology versus Conventional Cytology are used for the diagnosis of lung cancer, which made some progress, but still have some limitations in sensitivity, specificity and applicability. Thus, starting from the serum www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No. 55), pp: 94793-94804 Research Paper markers, finding susceptible and effective biomarkers have become a hot research topic. At present, many single nucleotide polymorphisms (SNPs) associated with lung cancer have been found by GWAS, Taqman probe (Taqman real time PCR) assays, DNA sequencing technology, such as CYFRA21-1 [5,6], NSE [7], CA19-9 [8], KDM4A [9,10], TP53 [11], KRT81 [12], etc. In epigenetic field, methylation, histone modification, RNA correlation silence, telomere are also relates with the development of lung cancer [13][14][15][16]. Multiple tumor markers are usually used to improve the detection effect of early lung cancer, because the single tumor marker isn't reliable.
In this study, we screened the biomarkers related to genetic susceptibility and epigenic modification of relevant genes in lung cancer, and analyzed the relationship between these biomarkers and the occurrence of lung cancer and established the early intelligent diagnostic model of lung cancer was also established based on multiple tumor markers and data mining techniques. We also performed comparison between data mining and Fisher discriminatory analysis in the classification effect, explored the application value of tumor markers in the early warning of lung cancer, in order to construct the early intelligentized model for diagnosis.

General data of research objects was compared
The age difference between the case group and the control group was statistically significant (P<0.05), the gender difference was not statistically significant (P>0.05). The smoking rate of lung cancer group was higher than control group, the difference was statistically significant (P<0.05), seen in Table 1.

Correlation between the methylation of p16 gene and RASSF1A gene and lung cancer
The lung cancer patient group and the control group were divided into four layers according to the quartile of two genes methylation level, the results showed that the increase of p16 gene and RASSF1A gene correlated with increasing risk of lung cancer(P trend <0.05); The median of two genes methylation level was divided into two layers according to the median, the results showed the level of methylation higher than the median will cause increasing risk of lung cancer as seen in Table 3.

Analysis of the association between telomere relative length and lung cancer
The lung cancer patient group and the control group were divided into four layers according to the quartile of telomere relative length. With the risk analysis of lung cancer with the long telomere group as the reference group, the results showed that the shortening of telomere relative length correlated with increasing risk of lung cancer(P trend <0.001); Then according the median divided layers, the risk of lung cancer in patients with short telomere length is 3.258 times of the long telomere length group, the difference was statistically significant as seen in Table 4.

Evaluation of lung cancer discriminative model based on 5 genetic polymorphisms
Through analyzing the diagnostic value of three kinds of models by ROC, results showed the ROC curve area (AUC) of Fisher discriminant analysis is less than 0.7 showing the lower accurate diagnosis, but the AUC of decision tree and ANN are all closed to 0.9, showing the better accurate of diagnosis. The model prediction results are shown in Table 5 and Figure 1.

Evaluation of lung cancer discriminant model based on the methylation of p16 gene and RASSF1A gene and the relative length of telomere
Through analyzing the diagnostic value of models by ROC, the result showed the ROC curve area (AUC) of Fisher discriminant analysis is less than 0.7 showing the lower accurate diagnosis, but the AUC of decision tree and ANN model are more than 0.7 which indicates the moderate accurate diagnosis better than the diagnostic value of Fisher discriminant analysis as seen in Table 6 and Figure 2.

Evaluation of lung cancer discriminant model based on tumor markers
Through random extracted 75% and 25% of samples as the training set and the prediction set, the classification accuracy rate was 72.15% and 70.59% by Fisher discriminant analysis model after repeated training. However the classification accuracy rate was 92.96% and www.impactjournals.com/oncotarget    showing the moderate accurate diagnosis, the AUC of decision tree is more than 0.9, showing the better accurate diagnosis. The AUC of ANN is more than 0.9, also showing the better accurate diagnosis. Therefore, two kinds of data mining models are better than discriminant analysis model of diagnostic value. As seen in Table 7 and Figure 3.

Classification of lung cancer in early stage (I+II stage) by using decision tree and ANN model
Through combining the genetic polymorphism of CYP1A1-rs1048943, GSTM1, mEH-rs1051740, XRCC1-rs1799782 and XRCC1-rs25489, the methylation of p16 and RASSF1A gene, the length of telomere, smoking status and other factors, the early stage classification model of lung cancer was established by using decision tree and ANN techniques through repeated training. And then we classified the lung cancer in the early stage (I+II stage), evaluated the effectiveness and diagnostic value of the model. The results shown that the classification accuracy of the decision tree model is 96.36%, the ANN model is 89.09%, which illustrated the classification results was better as seen in Figure 4 and 5.

DISCUSSION
Recent studies have indicated that the occurrence of lung cancer is a multiple-factors and multiplestep process, and it is the result of interaction between genetic and environmental exposure factors [17]. Tumor  . Therefore, such tumor makers are likely useful tools for early diagnosis, treatment and prognosis of tumor. Therefore, genetic polymorphisms of CYP1A1-rs1048943, GSTM1, GSTT1, mEH-rs1051740 and XRCC1(rs1799782, rs25489), methylation of p16 and RASSF1A gene, and telomere length were analyzed in peripheral blood both from lung cancer patients and health controls to explore their correlation. The results showed that all indexes had different degrees of correlation with lung cancer. Smoking has the most closer relationship to lung cancer, which is consistent with other research results [19][20][21][22][23][24][25]. Compared with the diagnostic model based on different tumour markers, it has been shown that the AUC level of each discriminative model has been improved by about 10% based on multiple tumor markers, which indicates that the sensitivity and specificity of diagnosis can be substantially improved through combining different tumor makers compared to individual tumor marker. Therefore, multiple tumor marker analysis system is more suitable for the construction of the early intelligent discriminative model of lung cancer.
Data Mining, also called Knowledge Discovery from Database, is a complex process which extracts and excavates unknown and valuable knowledge such as model or regular pattern from mass incomplete, fuzzy, noisy, random of data [26][27][28]. The latest technology, such as database technology, machine learning, artificial intelligence, statistics, information retrieval and data visualization was combined together [29]. Fisher discriminant analysis is a traditional statistical classification method, the principle of this method is   [30]. The sensitivity, specificity and accuracy of lung cancer discrimination model, based on data mining technology, were higher than Fisher discriminant analysis model, the AUC of ANN model and decision tree model are 0.929 and 0.894 respectively, which based on multiple tumour markers, but the AUC of Fisher discriminant analysis model is 0.722, which indicated that the data mining technology is more suitable for lung cancer discriminant model. Due to lack significant correlation between indexes, various factors have complicated nonlinear relationship with lung cancer. The model of Fisher discriminant analysis is a linear model, which has a higher requirement for the data, and has great limitations in analysis the variation law of the nonlinear data system [31]. The data mining technology has better intelligent characteristics when dealing with complex nonlinear data for imprecise mathematical models, and identifies and taps the relationship and potential information of indicators by automatically learning, and describe the fuzzy evaluation, therefore, the limit of data types is smaller [32,33]. On the other hand, compared the methodology, the classification of data information by Fisher discriminant analysis, which based on the statistics attribute of samples, but the data mining technology is based on logic, which belongs to the category of intelligent machine learning.
Through further comparing two discriminant models, the sensitivity, specificity and accuracy of the decision tree model were 90.7%, 94.74%, 93%, and each index from decision tree model was better than the ANN model. The reasons probably are: firstly ANN is a processing network to deal with complex information, which composed by wide connection of many simple processing units [34], needs to transform the discrete   attributes of numerical value into numerical attributes, so ANN is more susceptible to data attributes than decision trees [35]. Secondly, the neural network has the better ability to manipulate data with time sequence [36][37][38], but it requires more data. Moreover, it needs to draw support from the rich experience in training, the training sample set should contain all the patterns, and the input data should as far as possible haven't relevant between each other, these lead the higher requirements of the sample data, which means the neural network is more suitable for the larger database [39]. In addition, the classification result of Decision tree model was simple, clear, intuitive structure [40][41][42], has more advantages in explaining and analyzing the results than ANN model. Finally, in this study tumor markers from 55 patients with diagnosed clinical early stage (I+II) lung cancers were used to evaluate the effectiveness and diagnostic value of the model. The accuracy rate of decision tree and ANN model is 96.36 and 89.09, respectively. The diagnostic efficiency with the new model was better than ANN model.
Limitations need to be considered in explaining the time and causal relationship between the occurrence of molecular events and lung cancer, although we tried to recruit more cases who on Clinical stage I and II, the inherently limitations of case-control design still exist. In the next step, with the permission of funds and technology, we will verify the efficiency of the diagnosis model by expanding the sample size and/or using prospective studies.
The early intelligent discriminative models of lung cancer has the better diagnostic effect and profound significance for diagnostic the early stage lung cancer, which based on multiple tumour markers and data mining techniques.

Each index detection method
Genomic DNA was extracted from 2 ml blood according to the instruction of the QIAamp DNA Mini kit.
The methylation level of p16 and RASSF1A were detected by real-time methylation specific PCR [48][49]. The relative telomere length was detected by real-time fluorescence quantitative PCR method. GAPDH was used as a reference gene [50].

General statistical analysis of data
The general statistical analysis was assessed by SPSS21.0 software, according statistical data type to choose description method, using mean±standard deviation when data was normal distribution, using median and inter-quartile range when data wasn't normal distribution, comparing count data groups used Student's t test or Wilcoxon rank sum test; Comparing count data groups used chi-square test, the correlation between indicators and lung cancer was determined using the logistic regression. α=0.05.

Data mining model establishment
All the data are normalized to [0, 1] with the max min method.
According to the proportion of 3:1, the data is divided into training set and prediction set by SPSS Clementine software of random sampling founction.
Based on Clementine SPSS 12 software of fisher discriminant analysis, decision tree C5.0 and BP neural network algorithm, the diagnostic model of lung cancer was established.
The model was evaluated with diagnostic test, the indexes include sensitivity, specificity, accuracy, area under the receiver operating characteristic curve (AUC), positive predictive value and negative predictive value. The AUC less than 0.5 shows the diagnosis hasn't significance; the AUC between 0.5~0.7 showing the lower accurate diagnosis; AUC between 0.7~0. 9 showing the medium accurate diagnosis; AUC more than 0.9, showing the higher accurate diagnosis.