Identification of lung adenocarcinoma specific dysregulated genes with diagnostic and prognostic value across 27 TCGA cancer types

As the most common histologic subtype of lung cancer, lung adenocarcinoma (LUAD) contributes to a majority of cancer-related deaths worldwide annually. In order to find specific biomarkers of LUAD that are able to distinguish LUAD from other types of cancer so as to improve the early diagnostic and prognostic power in LUAD, we analyzed 10098 tumor tissue samples across 27 TCGA cancer types and identified 112 specific expressed genes in LUAD. Meantime, 8240 LUAD dysregulated genes in tumor and normal samples were identified. Combining with the results of specific expressed genes and dysregulated genes in LUAD, we found there were 70 specific dysregulated genes in LUAD (LUAD-SDGs). Then ROC curve revealed six LUAD-SDGs that may be of strong diagnostic value to predict the existence of cancer (area under curve[AUC] > 95%). Kaplan-Meier survival analysis was performed to identify 6 LUAD-SDGs associated with patients’ prognosis (P-values < 0.001). Multivariate Cox proportional hazards regression was employed to demonstrate that the six LUAD-SDGs were independent prognostic factors. Then, we used the six overall survival (OS)-related LUAD-SDGs constructing a six-gene signature. Multivariate Cox regression analysis suggested that the six-gene signature was an independent prognostic factor of other clinical variables (hazard ratio [HR] = 1.5098, 95%CI = 1.2996-1.7538, P < 0.0001). Based on our findings, we first presented the LUAD-SDGs for LUAD diagnosis and prognosis. Our results may provide efficient biomarkers to clinical diagnostic and prognostic evaluation in LUAD.


INTRODUCTION
Lung cancer is one of the most frequently diagnosed cancers and contributes to the majority of cancer-related deaths. According to the latest statistics, there were about 1.8 million new lung cancer cases and 1.5 million people who died of lung cancer in the future [1]. As a subtype of lung cancer, lung adenocarcinoma (LUAD) is continuously growing in the proportion of lung cancer, which is presently the top diagnosed histological type in adult men and women [2]. Smoking is the leading cause of lung cancer, but LUAD is the histological type showing weakness associated with smoking, which occurs mainly to non-smokers and females [3,4]. Oncogene aberrations are now the most popular study factors that contribute to the carcinogenesis of LUAD [5,6]. Although there are lots of www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No. 50), pp: 87292-87306

Research Paper
contributions to individualized therapy of LUAD, patients with advanced-stage tumors have an overall survival of less than 2 years, which is mainly because they have lost the chance of surgery with a limiting selection of chemotherapy [7]. Besides, the molecular biomarkers that can effectively predict the outcome of LUAD patients have not been fully elucidated. Therefore it is of significant importance to figure out new LUAD special biomarkers for the diagnosis and prognosis of LUAD and to improve therapeutic effects.
In the genomics era, there emerge a lot of high throughput sequencing technologies and databases. They have contributed to the development of diagnostic and prognostic signatures of cancer. RNA-seq is a recently developed approach to deep-sequencing. It can identify unmapped genes, unrecognized non-coding RNAs and splice variants [8]. Iris H. Wei et al used RNA-seq data from international Cancer Genome Consortium and The Cancer Genome Atlas (TCGA) datasets conducting cancer-specific signatures. These signatures had high sensitivity and specificity and were able to identify metastatic cancer of unknown primary, including tumors originating from lung [9]. Besides, by using TCGA lung adenocarcinoma datasets, Anlin Feng et al found LUAD patients with high HMGB1 expression had a poor overall survival [10]. These indicate that these newly developed genome sequencing databases and methods can develop clinical biomarkers for LUAD. And there are some signatures which have been investigated to involve in the pathogenesis of lung cancer and have clinical associations with LUAD. Tony et al used a loss-offunction model to identify MALAT1 which was an active regulator of LUAD metastasis [11]. Ming et al. showed that decreased BANCER expression was associated with poor prognosis of lung cancer patients [12]. Gao et al found overexpression of serum miR-155 could be a diagnostic marker for the early detection of LUAD [13]. With the help of the next-generation sequencing biomarkers, we were able to detect the LUAD patients and predict the outcome of these patients in an early stage.
In this paper, in order to identify LUAD-SDGs, we analyzed large scale RNA-seq transcriptomes of 10098 tumor samples across 27 cancer types containing LUAD from TCGA. Meanwhile, dysregulated genes in LUAD were identified. Combining with the diagnostic and survival analysis, we found robust diagnostic and prognostic LUAD-SDGs. By utilizing the prognostic LUAD-SDGs, we conducted a six-gene signature that could effectively predict patients' survival.

Identification of specific dysregulated genes of lung adenocarcinoma (LUAD-SDGs)
To describe our study more clearly, we made a flow chart ( Figure 1). We first identified the LUAD-SDGs. We found there were 112 specific expressed genes of LUAD compared with other 26 different types of cancer. The principal components analysis (PCA) demonstrated separation between LUAD and other 26 cancer types ( Figure 2A). There were 8240 dysregulated expressed genes between LUAD tissues and normal tissues. PCA and hierarchical clustering showed clear separation and consistency in the expression profiles of the tumor and normal tissues of LUAD ( Figure 2B, 2C). The dysregulated expressed genes of LUAD were shown with volcano plot ( Figure 2D). We combined the results and demonstrated that there were 70 genes specifically dysregulated in LUAD ( Figure 2E).

Diagnostic value of specific dysregulated genes of lung adenocarcinoma (LUAD-SDGs)
Next we analyzed the possible diagnostic power of LUAD-SDGs in LUAD and tried to find the LUAD-SDGs with high diagnostic value. To evaluate the prediction performance of LUAD-SDGs, we performed the receiver operative curves (ROC). LUAD-SDGs with an AUC of more than 0.95 were selected as genes that may serve as a biomarker in the diagnosis of LUAD ( Figure 2F). As shown in Figure 3A, we demonstrated that there were great differences between LUAD patients and control groups. The sensitivity and specificity of MARCO, SFTPA2, SFPTA1, CHIPA2, SFTPC and RPL13AP17 were 0.9749, 0.9525, 0.9715, 0.9519, 0.9988 and 0.9673 respectively in the test group. We also showed the similar results in the validate group, whose sensitivity and specificity of MARCO, SFTPA2, SFPTA1, CHIPA2, SFTPC and RPL13AP17 were 0.9760, 0.9558, 0.9641, 0.9564, 0.9972 and 0.9803 respectively ( Figure 3B; Table  1). We found that SFTPC ranked top in terms of AUC in both test and validate group with average sensitivity of 100% and 97.6% specificity. These six LUAD-SDGs showed high diagnostic power in predicting LUAD, which indicated they may serve as important biomarkers in identification of LUAD. We validated the expression of the six LUAD-SDGs in the same TCGA datasets using t-test. We found that the expression of the six LUAD-SDGs was higher in LUAD than that of other tumors in 26 cancer types (log2FC > 1, P < 0.001). Compared with the expression of normal lung tissues, the expression of the six LUAD-SDGs was lower (log2FC < 1, P < 0.001) ( Figure 4A, 4B), which confirmed that these genes were specific dysregulated genes in LUAD and were able to serve as diagnostic biomarkers for the diagnosis of LUAD.

Prognostic role of specific dysregulated genes of lung adenocarcinoma (LUAD-SDGs)
Then we sought to investigate the correlation between all of the 70 LUAD-SDGs and overall survival of LUAD patients. We identified that there were six LUAD-SDGs whose expression were associated with the overall survival of LUAD patients, including RP11-513N24.11, RP11-206P5.2, SFTPB, CHIAP2, BMP5 and MS4A15 (Table 1). The Kaplan-Meier curves demonstrated that LUAD patients with higher expression of the six LUAD-SDGs had better survivals than those with low expression of the six LUAD-SDGs (log-Rank test, P < 0.001) ( Figure 5A-5F). In the univariate Cox regression analysis, we found that the six LUAD-SDGs were significantly associated with overall survival. Then we performed the multivariate Cox regression analysis to identify the independent prognostic role of the six LUAD-SDGs by adjusting other significant covariates including gender, stage, smoke and age. The results showed that the six LUAD-SDGs were independent prognostic factors in overall survival and were protective genes (HR < 1) for LUAD ( Figure 6A-6F; Table 2). Compared with other 26 types of cancer, the expression level of the six LUAD-SDGs in LUAD was the highest (log2FC > 1, P < 0.001). In non-cancer tissue of LUAD, the six LUAD-SDGs had higher expression level than the cancer tissue of LUAD. All the results suggested that the six LUAD-SDGs may act as special prognostic biomarkers for LUAD patients. (Figure 7A, 7B).

The predictive role of a six-gene signature in LUAD
We used the six OS-related LUAD-SDGs to construct a six-gene signature. A risk score was created based on a linear combination of the expression profiles of prognostic genes weighted by estimated regression coefficient, a risk score was created as follows: Risk score = (-0.013 * expression level of RP11-513N24.11) + (-0.066 * expression level of RP11-206P5.2) + (-0.029 * expression level of SFTPB) + (-0.028 * expression level of CHIAP2) + (-0.087 * expression level of BMP5) + (-0.029 * expression level of MS4A15). Then, patients were divided into high-risk group (n = 131) and low-risk group (n = 392) using the 75 th percentile risk score as the cutoff point to investigate the prognostic role of the six-gene signature in overall survival. We observed that patients in the high-risk group had lower expression of the six genes compared with low-risk group ( Figure 8A). This was consistent with the results of multivariate Cox regression analysis of the six genes, which demonstrated that all of them were protective genes ( Figure 6A-6F). The univariate analysis showed that the six-gene risk score was significantly related to prognosis of LUAD patients ( Table 2). The multivariate analysis indicated that the risk score statistically significantly stratified the patients for overall survival (HR =1.5129, 95%CI = 1.2988-1.7624, P < 0.0001), independent of age, gender, smoke and pathological stage. (Figure 8B, 8C; Table 2). We also evaluated the six-gene prognostic signature in patients at pathological stage I-II, pathological stage IV-III, N0 (without lymph node metastasis), M0(without distant metastasis). As a result, the stratification analysis showed that the six-gene signature could predict patients with different prognosis in different subgroups including pathological stage I-II, pathological stage IV-III, N0, M0 ( Figure 8D-8G).

DISCUSSION
In our study, we first compared the expression of genes in 27 cancer types including LUAD between tumor and normal. As a result, all 70 LUAD-SDGs were identified. As we know, there were several known cancerspecific biomarkers that had already been translated into clinical diagnostic and prognostic signatures. For example, compared with other malignancies, alpha-fetoprotein (AFP) was specific expressed in hepatocellular carcinoma [14]. Besides, FUMIO et al showed hepatocellular carcinoma patients with normal AFP values had a better survival than those with high values, which indicated that AFP was a useful prognostic signature for hepatocellular carcinoma patients [15]. Welch et al showed prostate specific antigen (PSA) screening increased diagnosis of prostate cancer in patients younger than 50 years in the USA from 1986 to 2005 [16]. Inspired by these cancerspecific biomarkers, we are interested in whether the LUAD-SDGs can serve as good biomarkers for patients' diagnosis and prognosis.
In order to explore the above questions, we conducted the ROC to evaluate the diagnostic role of LUAD-SDGs. The cutoff point was selected (AUC > 95%) and only six genes: MARCO, SFTPA1, SFTPA2, CHIPA2, SFTPC, and RPL13AP17 reached this condition.
Among the six LUAD-SDGs, surfactant protein A including SFTPA1 and SFTPA2 were reported closely associated with lung cancer. It has been reported that SFTPA1 mutation was associated with familial idiopathic interstitial pneumonia and lung cancer, which shed light on the key role of SFTPA1 in the pathogenesis of several chronic respiratory diseases, especially in lung cancer [17]. Genetic defects in SFTPA2 were demonstrated closely related to lung cancer [18]. Surfactant protein A plays an important role in the maintenance of normal lung function, and the abnormal surfactant protein A may lead to the disorder of lung innate immunity and cause the occurrence of lung cancer. What's more, SFTPA1, SFTPA2 and SFTPC , surfactant genes which have been reported to express exclusively in type II penumocytes, have showed their unique roles in LUAD [19]. Meanwhile, there are some reports approving that surfactant genes such as pro-surfactant protein B (pro-SFTPB) was a promising blood biomarker for non-small-cell lung cancer. Don D. Sin et al showed plasma levels of pro-SFTPB were associated with early-stage lung cancer, indicating it could be used in predicting early-stage NSCLC patients.    Besides, William R. Wikoff et al found combination of pro-SFTPB and N1, N12-diacetylspermine had a high sensitivity and specificity in diagnosing NSCLC [20,21]. It is a great inspiration to us, as some of the six LUAD-SDGs may have the potential to become the diagnostic biomarkers for LUAD or to enhance the accuracy when combined with pathological diagnosis. Then we explored the possible roles of LUAD-SDGs in LUAD patients' prognosis. Among the 70 LUAD-SDGs, six of them were identified significantly associated with patients' prognosis. In some studies, the expression of BMP5 demonstrated significantly lower in LUAD tissues than that in normal lung tissues, which was consistent with our result [22]. While there was no report about the prognostic function of BMP5, Pro-surfactant protein B (pro-SFTPB) was reported as a promising blood biomarker for non-small-cell lung cancer [20,21]. Ayumu et al demonstrated that increasing concentration of plasma pro-SFTPB was associated with higher lung cancer risk, which confirmed that SFTPB might be a valuable biomarker in lung cancer risk prediction models [20,23]. Besides, CHIAP2 was also reported associated with LUAD patients' prognosis. JING SUI et al identified that CHIAP2 was a cancer specific lncRNA and was positively correlated with OS of LUAD [24]. However, except for CHIAP2 and SFTPB, other four LUAD-SDGs did not have any report about the relationship between their expressions and LUAD patients' prognosis. Our study explored that the six LUAD-SDGs: BMP5, SFTPB, CHIAP2, RP11-513N24.1, RP11-206P5.2 and MS4A15 which may be positively associated with survival. In order to explore whether the six LUAD-SDGs can serve as independent prognostic factors for overall survival in LUAD, the multivariate Cox regression analysis was performed. In conclusion, the results showed that the six LUAD-SDGs were protective genes (HR < 1) for LUAD and played independent prognostic roles in LUAD.
Finally, we used the six survival-related LUAD-SDGs to construct a six-gene signature, which proved to be a prognostic biomarker for LUAD. We also performed Cox regression analyses to evaluate whether the prognostic power of the six-gene signature was independent of other clinical variables, such as tumor stage, smoking, age and gender. We demonstrated that the prognostic power of the six-gene signature was independent after taking other clinical variables into account. Besides, in the stratified analyses, patients were classified into high-risk and low-risk groups. Similar prognostic power was tested in the early patients (stage I-II), the advanced stage(III-IV), N0 (without lymph metastasis), M0 (without distant metastasis). In general, all the results suggested that the six-gene signature owned a good independent prognostic power even when other clinical variables are taken into account. Sudhanshu Shukla et al first presented the RNA-seq prognostic signature consisting of four genes for LUAD [25]. However, in our work, we first demonstrated that LUAD-SDGs could also act as good prognostic signatures for LUAD, which could provide insights into new prognostic biomarkers exploration.   In conclusion, by analyzing the large scale RNAseq of 10098 tumor tissues across 27 TCGA cancer types, we identified six LUAD-SDGs which showed diagnosis power. Meanwhile, other six LUAD-SDGs were significantly associated with LUAD patients' prognosis.
Moreover, the six-gene signature could effectively predict patients' prognosis and function as a new independent prognostic biomarker for LUAD. But we have to admit that there are limitations in our study. Firstly, the main limitation of the study is the lack of cross-validation cohort from other databases. Validating these genes in a larger cohort of LUAD patients can make the LUAD-SDGs for diagnosis and prognosis more convincing. Secondly, the six LUAD-SDGs with diagnosis power should be validated in serums samples to own their true diagnosis value. But these limitations are certain to be solved in future investigation. [26][27][28][29][30][31][32][33][34][35][36][37][38][39]

RNA-seq data collection and processing
Transcriptome data and clinical data were downloaded from The Cancer Genome Atlas (TCGA) data portal (https://cancergenome.nih.gov/). We chose 27 cancer types, all of which included over 100 tumor samples. We collected and analyzed data from 10098 tumor samples, which contained 535 tumor samples with related normal samples and clinical information in this study ( Table 3). The transcriptomic correlation of RNA expression was determined by RNA-seq (Fragments Per Kilobase Million [FPKM]) and the expression profiles were normalized by log2 transformed. To ensure the reliability of the detection, we removed genes whose FPKM values over 50% of the samples is equal to 0.

Analysis of specific dysregulated genes in LUAD
Firstly the dysregulated genes in LUAD between tumor and normal were identified using t-test (P value< 0.05, log2FC > 1 or <-1). Then the specific expressed genes of LUAD were acquired by comparing the expression of the genes between LUAD and other tumors in 26 cancer types one by one (P< 0.05, log2FC > 1). Specific dysregulated genes in LUAD were selected, which were specific expressed in LUAD among 27 cancer types and dysregulated expressed between tumor and normal of LUAD.

Diagnosis and survival analysis
All 535 LUAD patients data were randomly divided into test group (n=267) and validate group (n=268). The receiver operator characteristic (ROC) curves were used to evaluate the specificity and sensitivity of the LUAD-SDGs. The genes with high diagnostic power were selected (AUC > 95%). Meanwhile, for overall survival analysis, Kaplan-Meier survival and log-rank test were used to compare significant differences between subgroups with univariate analysis.

Construction of a prognostic signature
Univariate and multivariate Cox proportional hazards regression were used to assess the LUAD-SDGs whose expressions were significantly associated with patients' survival. Hazard ratios (HRs) from multivariable cox regression analysis were used to identify protective (HR < 1) or risky genes (HR > 1). Subsequently, a prognostic signature was constructed based on a linear combination of the expression profiles of prognostic genes weighted by the estimated regression coefficient [25,[40][41][42]. N was the number of prognostic genes, Expi was the expression of genes and Coei was the coefficient value. A risk score was constructed with the regression coefficients from this model and 75 th percentile was chosen as the threshold.

Statistical analysis
T-test was used to compare the dysregulated genes in LUAD between tumor and normal. Besides, it was also used to analyze specific expressed genes in LUAD compared with other tumors in 26 cancer types. The receiver operating characteristic (ROC) curve was performed using R package "pROC" [43]. The Kaplan-Meier method with a log-rank test was used to assess patients' survival using the R packages "survival" [44]. The univariate and multivariable Cox proportional hazards regression were performed using the R packages "BhGLM" [45]. Heatmaps were generated with z-score normalization with each column using R packages "gplots" [46]. All analyses were performed using R software (version 3.2.2). A statistically significant difference when P value < 0.05 was considered.