Integrative microRNA and gene profiling data analysis reveals novel biomarkers and mechanisms for lung cancer

Background Studies on the accuracy of microRNAs (miRNAs) in diagnosing non-small cell lung cancer (NSCLC) have still controversial. Therefore, we conduct to systematically identify miRNAs related to NSCLC, and their target genes expression changes using microarray data sets. Methods We screened out five miRNAs and six genes microarray data sets that contained miRNAs and genes expression in NSCLC from Gene Expression Omnibus. Results Our analysis results indicated that fourteen miRNAs were significantly dysregulated in NSCLC. Five of them were up-regulated (miR-9, miR-708, miR-296-3p, miR-892b, miR-140-5P) while nine were down-regulated (miR-584, miR-218, miR-30b, miR-522, miR486-5P, miR-34c-3p, miR-34b, miR-516b, miR-592). The integrating diagnosis sensitivity (SE) and specificity (SP) were 82.6% and 89.9%, respectively. We also found that 4 target genes (p < 0.05, fold change > 2.0) were significant correlation with the 14 discovered miRNAs, and the classifiers we built from one training set predicted the validation set with higher accuracy (SE = 0.987, SP = 0.824). Conclusions Our results demonstrate that integrating miRNAs and target genes are valuable for identifying promising biomarkers, and provided a new insight on underlying mechanism of NSCLC. Further, our well-designed validation studies surely warrant the investigation of the role of target genes related to these 14 miRNAs in the prediction and development of NSCLC.


INTRODUCTION
Non-small cell lung cancer (NSCLC) remains one of the leading causes of cancer death, with a high mortality rate worldwide [1,2], accounting for over one quarter of cancer deaths in 2014 [1][2][3]. Recently, many studies have reported promising biomarkers for differential diagnosis of NSCLC [4 -13]. However, accurate biomarkers of NSCLC still remain largely unexplored.
Currently, the discovery of microRNAs (miRNAs), a class of small non-coding RNAs, has opened up a new perspective for cancer prediction and provides a novel approach for the initial screening of cancer, including NSCLC [14] [4]. Emerging evidence has reported that miRNAs are remarkably aberrant in tumors [15][16][17] [5][6][7], and may be involved in initiation and progression of NSCLC [18][19][20] [8][9][10]; in addition, due to their inherent nature, miRNAs seem to remain highly stable and provide more accurate prediction factors for clinical specimens [21,22] [11,12]. The above discovery shows that miRNAs are suitable as biomarkers for the diagnosis of NSCLC.
Unfortunately, several conflicting results are still present in independent studies [23,24] [13,14], which are often explained by different miRNA profiling systems and platforms. Although they separately have promising value for cancer differentiation, a systematic analysis of these collected data may be essential for further exploration of the applicability of miRNAs as biomarkers for the prediction of NSCLC.
Thus, our meta-analysis answers three questions: (1) whether some of the miRNAs could differentiate tissues as NSCLC or control, (2) whether there were relationships between promising miRNAs with target genes in functional annotation and pathways, and (3) whether genes targeted by these miRNAs are associated with NSCLC initiation and progression.

Regulation and predictive value of miRNA expression in lung cancer tissue
To determine whether the expression of miRNAs could be used to identify NSCLC and control cases, our initial search yielded 19 relevant data sets. After removing 3 duplicated data sets and 11 unqualified data sets ( Figure  1), three primary data sets(GSE15008, GSE36681, GSE29248) as a training cohort were further examined in this meta-analysis, which was comprised of a total of 263 cancer tissue samples and 236 control tissue samples. We received another two complete sets of miRNA data (GSE51853, GSE19945) as a validation cohort, which was composed of a total of 127 tissue samples. The five lung cancer microarray data sets was used to used to analyze the miRNA expression profiles of NSCLC tissues relative to their normal controls. The characteristics of these samples are shown ( Table 2, Table 4). Microarray data sets were normalized by control normalization algorithm using Agilent's GeneSpring 13.0. After normalization, batch effect was removed ( Figure 2).    (The threshold lines indicate 5% P-value. The bigger the -log (p-value) of pathway is, the more significantly the pathway is adjusted.) 470 overlapped miRNAs were differentially regulated by cancer cases with a cut off p-value of 0.05 and fold change of 1.5. Meanwhile we developed an algorithm based on the Weka tool to construct miRNAs predictive models by data mining; subsequently, we selected 14 miRNAs as a new integrated training set to construct a predictive model. The 14 miRNAs model performed stably in distinguishing between NSCLC and control cases, with a sensitivity of 82.6%, a specificity of 89.9%, a positive predictive value (PPV) of 87.5%, and a negative predictive value (NPV) of 85.8%. The AUC of the training set was 0.913. Up-regulated miRNAs (miR-9, miR-708, miR-296-3p, miR-892b, miR-140-5P) and nine down-regulated (miR-584, miR-218, miR-30b, miR-522, miR486-5P, miR-34c-3p, miR-34b, miR-516b, miR-592) in the 14 top miRNAs were significantly regulated by tumor cases (Table 1). Moreover, in validation, the 14 miRNAs model produced prediction sensitivity that increased continually and significantly: SE=88.14%, SP=91.18%, PPV=89.66%, NPV=89.86%, AUC=0.905 A study [25] reported that the target genes of multiple miRNAs play a crucial role in controlling stimulatory or inhibitory activity in tumorigenesis. Thus, the potential target genes of these miRNAs need to be identified.

MiRNA target prediction and functional analysis
In order to identify potential miRNA target genes, we first queried the three most popular computational databases MiRBase [26], PicTar [27], and Targetscan [28] to scan target genes on the principle of mutual recognition. 1743 overlapping target genes related to the top 14 miRNAs emerged as a particular group. Then physiological pathways of target genes were analyzed using the Ingenuity Pathway Analysis (IPA) tool.
Interestingly, the top 10 significant pathways which are shown in Figure 3 were enriched by the 1473 genes associated with cancer initiation and progression. Among them, Axonal Guidance Signaling Pathway, Insulin-like growth factor-1(IGF-1) Signaling, Integrin Signaling Pathway, and Ephrin Receptor Signaling Pathway were highly associated with NSCLC initiation and progression. The Axonal Guidance Signaling Pathway involves 77 target genes with NSCLC, the IGF-1 Signaling Pathway involves 22, the Integrin Pathway involves 35, and Ephrin Receptor Signaling Pathway involves 31. In general, these genes were regulated by each other either directly or indirectly. Yet the underlying values for these genes associated with pathways have not been clearly illuminated. Thus, it is necessary to investigate the interaction between target genes and the 14 significant miRNAs.
Validate the 14 significant miRNAs using gene microarray data, and explore the correlation with target genes In this meta-analysis, the predicted performance of miRNAs was further verified by gene expression data sets. After excluding those studies according to the previous including criteria, 6 studies remained. Gene expression data sets (GSE1987, GSE43458, GSE33532 (paired), GSE2514, GSE19804, GSE33532 (unpaired)) with 7254 common genes were extracted from Gene Expression Omnibus, which included 195 NSCLC and 178 normal cases. After normalizing the raw data, controlling sample quality, correcting background, and performing log2 transformation, the miRNAs were filtered according to a t-test p-value cut off of 0.05 and a 1.5 fold change cut off. A Bayesian statistical analysis with 5% false discovery rate (FDR) was selected as one of three criteria for significant variable value.
Statistical analysis identified 1263 differentially expressed genes (p < 0.05, FC > 1.5) in NSCLC versus normal cases. Moreover, we found 900 genes with FDR < 0.05 and FC > 1.5 targeted by 14 miRNAs, in which 100 genes had an FDR < 0.05 and FC > 2.0. Among them 71 genes were down-regulated and 29 genes were up-regulated in NSCLC cases. The 100 gene list of better FDR score were uploaded into the IPA tool. A gene network was computed (Figure 4). Nodes colored in red and green indicate up-regulated and down-regulated gene respectively. We could clearly see the gene interaction between the two regulation directions. The top 20 significant genes are listed in Table 3.
To further assess the prediction abilities of gene cross-validation, we divided six data sets which were retrieved from PubMed into two sets (training set, testing set) according to sample size. The training set was composed of 3 paired specimens (GSE19804, GSE33532, GSE43458) containing a total of 230 samples respectively. The remaining 3 data sets for testing were comprised of 3 unpaired specimens (GSE1987, GSE43458, GSE2514) containing a total of 143 samples. The characteristics of these samples are shown in Table 3.
Meanwhile we developed an algorithm based on the Weka tool to construct gene predictive models by data  (Table 5). In addition, hierarchical cluster analysis showed that the samples of training set and testing set were also clearly separated into 2 main classes ( Figure 5). This shows that these core genes can discriminate between NSCLC cases and normal cases.

Verification of microarray responses using real time QRT-PCR to verify the credibility of microarray and gene network modeling results
To verify our microarray meta-analysis results, we chose two cell lines, A549 lung adenocarcinoma cell lines and normal lung epithelial cells NL20, to conduct the experiments. We selected four miRNAs and all the four gene markers to perform real time quantitative PCR (QRT-PCR) in the two cell line. As illustrated in Figure  8, compared to normal control cell lines, has-miR-9, has-miR-296-3P, and the gene FAP were up-regulated whereas has-miR-522, has-miR-34b, the gene KIAA1462, the gene MMD and the gene CBX7 were down-regulated.

DISCUSSION
In our study (Figure 6), we focused primary on whether promising miRNAs could act as accurate biomarkers to discriminate NSCLC from normal cases by taking advantage of miRNA array data sets. We selected 5 microarray data sets and set out to systematically identify promising miRNAs that distinguish NSCLC and control.
In order to explore these interactions between miRNAs and target genes, we decided to perform a pathway analysis using the list of overlapped target genes referenced by the three computational databases. The top 10 significant pathways enriched 1473 genes associated with cancer initiation and progression.
In our following study of target genes, we took advantage of statistical computer tools to mine some available data for target genes, and then subsequently hold overlapping genes. We found that over half of the target genes with better FDR and higher FC were involved in NSCLC. The gene networks showed that many of these genes related to NSCLC, and interacted with each other.
To seek genes offering greater sensitivity and specificity, a statistical model based on six gene data sets was built. Finally, we selected the 4-gene index (KIAA, MMD, CBX7, and FAP) as a novel biomarker for diagnostic prediction of NSCLC. The index achieved 96.7% sensitivity, 88.1% specificity, 89.9% PPV, and 96.0% NPV in the training set, and with higher significance in testing set (SE = 98.7%, SP = 82.4%, PPV = 86.1%, NPV = 98.3%). Emerging reports [27,40,41] showed that MMD, CBX7, FAP played an important role in the proliferation of lung cancer [30,43,44]. It is noteworthy that the gene KIAA has never been reported as related to cancer, serving only as a risk factor in coronary artery disease [45] [42]. Our study showed KIAA was frequently directly or indirectly associated with the 14 promising miRNAs in NSCLC. Therefore, based on both miRNAs and target genes level, we generated a hypothetical model that can explain genetic and environmental factors that trigger NSCLC ( Figure  7). Genetic and environmental factors could affect the expression of miR-9, miR-584, miR-708, miR-218, miR-296-3p, miR-30b, miR-522, miR-486-5p, miR-34c-3p, miR-892b, miR-34b, miR-516b, miR-140, miR-592, then that of KIAA, MMD, CBX7, and FAP, and through interaction finally result in tumorigenesis. As illustrated in our experiment, has-miR-9, has-miR-296-3P, and the gene FAP were up-regulated whereas has-miR-522, has-miR-34b, the gene KIAA1462, the gene MMD and the gene CBX7 were down-regulated. The QRT-PCR results were consistent with the microarray meta-analysis results.
Overall, our results not only demonstrate that combining miRNAs and target genes improves our ability to identify promising biomarkers, but it also contributes to greater insight on new potential mechanisms and functions for predicting NSCLC.
Although we tried to avoid bias in our study, certain limitations still need to be considered while interpreting the result of our study. More research with experimental validation is clearly needed in order to find promising miRNAs and target genes using microarray platform to real-time RT-PCR assays, allowing broader accession and utilization in future clinical application. Third, further work is needed to investigate the relationship between miRNAs and genes.
Despite the above limitations, our study is the first meta-analysis to predict NSCLC from microarray data sets at both miRNA and gene level. This study also avoids distinguishing expression patterns in promising target genes that contradict those of the miRNAs. Detecting NSCLC using miRNAs and core genes still needs further validation as well.

Search strategy, eligibility and data extraction
Microarray data sets were extracted from NCBI and Gene Expression Omnibus by means of the MESH terms 'lung cancer/lung neoplasm/NSCLC' and 'microRNAs/ miRNA', in combination with the keyword 'lung tumor/ lung neoplasm/NSCLC' and 'gene expression/target gene', without restriction of language or publication.
Three reviewers (Ling Hu, Junmei Ai, and Hui Long) independently extracted the following data from all eligible studies. Eligible data sets had to meet the following criteria: all sample data sets (i) were from humans, (ii) focused on the diagnostic potential of miRNAs/genes for LC tissue, (iii) included microRNA array, (iv) came from raw data rather than matrix data/normalized data, and (v) were part of studies with included false discovery rate (FDR) and fold-change (FC) calculations. All data sets used in this study are summarized in Table 4.

MiRNAs and genes microarray data processing
The scale of miRNAs/genes expression in microarray data sets was consistently different due to different platforms and different batches [46] [15]. All statistical data sets were normalized and standardized to be approximately equal in scale and normally distributed. There is general agreement on the normalization of single miRNA/gene expression using the median value of expression of all miRNAs/genes of each data set [47,48] [16,17], and the expression of each case in each data set was compared with the respective control samples. We combined log 2 transformed data sets from different platforms into three: the miRNA data set, a paired gene data set, and an unpaired gene data set. 5% FDR in Bayesian statistical analysis was used to find statistically significant differential miRNAs between cancer and control cases.

Cell culture
A549 lung adenocarcinoma cell lines and normal lung epithelial cells (NL20) were purchased from the American Type Culture Collection (ATCC). The cells were cultured in minimum essential medium, Dulbecco's modified Eagle's medium (DMEM), and Ham's F12 medium supplemented with 10% fetal bovine serum (FBS) (Sigma Chemical Co., St.Louis, USA), penicillin (100 U/ mL) and streptomycin (100 µg/ mL) as antibiotics in a humidified atmosphere of 5% CO 2 at 37 o C.

RNA extraction
Total RNA was extracted using Qiagen miRNeasy kit (Qiagen, Valencia, CA) according to the manufacturer's protocol. In brief, the cell pellet was mixed with QIAzol Lysis Reagent and chloroform. After centrifugation at 12,000g at 4°C for 15 min, the aqueous phase was transferred into another tube, and 1.5 volumes of absolute ethanol were added. The mixture was then applied to miRNeasy Mini kit columns, following by washing with RWT and RPE buffers. The RNAs were finally eluted in 40 μl of RNase-free water.

Quantitative RT-PCR
MiRNAs and genes were measured using Taqman miRNA assay kits (Applied Biosystems, USA) according to the manufacturer's protocol. Briefly, about RNA was reverse transcribed with a TaqMan Reverse Transcription Kit (Applied Biosystems, USA). Expression levels of miRNAs and genes were quantified in triplicate by qRT-PCR using human TaqMan Assay Kits (Applied Biosystems, USA) on the ABI 7500 thermocycler (Applied Biosystems) according to the manufacturer's protocol. The expression value of miRNAs were normalized against an internal control (U6 RNA) and expression value of genes (mRNAs) were normalized using the internal control GAPDH.