Relapse-related long non-coding RNA signature to improve prognosis prediction of lung adenocarcinoma

Increasing evidence has highlighted the important roles of dysregulated long non-coding RNA (lncRNA) expression in tumorigenesis, tumor progression and metastasis. However, lncRNA expression patterns and their prognostic value for tumor relapse in lung adenocarcinoma (LUAD) patients have not been systematically elucidated. In this study, we evaluated lncRNA expression profiles by repurposing the publicly available microarray expression profiles from a large cohort of LUAD patients and identified specific lncRNA signature closely associated with tumor relapse in LUAD from significantly altered lncRNAs using the weighted voting algorithm and cross-validation strategy, which was able to discriminate between relapsed and non-relapsed LUAD patients with sensitivity of 90.9% and specificity of 81.8%. From the discovery dataset, we developed a risk score model represented by the nine relapse-related lncRNAs for prognosis prediction, which classified patients into high-risk and low-risk subgroups with significantly different recurrence-free survival (HR=45.728, 95% CI=6.241-335.1; p=1.69e-04). The prognostic value of this relapse-related lncRNA signature was confirmed in the testing dataset and other two independent datasets. Multivariable Cox regression analysis and stratified analysis showed that the relapse-related lncRNA signature was independent of other clinical variables. Integrative in silico functional analysis suggested that these nine relapse-related lncRNAs revealed biological relevance to disease relapse, such as cell cycle, DNA repair and damage and cell death. Our study demonstrated that the relapse-related lncRNA signature may not only help to identify LUAD patients at high risk of relapse benefiting from adjuvant therapy but also could provide novel insights into the understanding of molecular mechanism of recurrent disease.


INTRODUCTION
Lung cancer, including small cell lung cancer (SCC) and non-small cell lung cancer (NSCLC), is one of the most common cancers that severely threaten human health.The number of deaths from lung cancer is increasing, and it is estimated that nearly one in four cancer-related deaths is due to lung cancer [ 1 ].Lung adenocarcinoma (LUAD) is the most frequent histological subtype of NSCLC and its incidence remains a rapidly increasing trend over the past few decades in China [ 2 ].Despite improvement in diagnosis and treatment, the overall fi ve-year survival rate for LUAD patients is only about 15% [ 3 ].Moreover, more than 30% of patients treated with surgical resection will have a relapse within fi ve years after surgery [ 4 ].
Recent advances in large-scale genomic analysis and high-throughput sequencing technologies have greatly increased our understanding of non-coding RNA (ncRNA) world.It has become increasingly apparent that a large proportion of human genome can be transcribed and produced a huge number of ncRNA molecules [ 5 ].NcRNAs are briefl y divided into two broad categories on the basis of their size: short ncRNAs and long ncRNAs.MicroRNA (miRNAs) is very important short ncRNAs, recently several predictors have been proposed to accurately predict miRNAs from other RNAs [ 6 -8 ], which are very useful for the studies of the ncRNAs.Recently a web server called repRNA [ 9 ] was established to extract various features from RNA sequences, which will benefi t the studies of RNAs.Long non-coding RNAs (lncRNAs), representing the largest class of ncRNAs, are mRNA-like transcripts and defi ned arbitrarily as ncRNAs of greater than 200 nucleotides in length [ 10 , 11 ].Evidence from growing publications has demonstrated that lncRNAs play important roles in various fundamental biological processes including development, differentiation and metabolism by executing functions as signals, decoys, guides and scaffolds [ 12 , 13 ].Furthermore, there is increasing evidence that lncRNAs are emerging as crucial components in the cancer paradigm [ 14 ].The fi ndings from transcriptome profi ling analysis have shown highly aberrant lncRNA expression pattern in various types of human cancer [ 15 , 16 ].These differently expressed lncRNAs may be associated with tumorigenesis, tumor progression and metastasis [ 17 , 18 ].Moreover, lncRNA expression tended to be cell-, tissueand cancer-type specifi c, thus making them attractive as independent biomarkers for diagnosis and prognosis [ 19 ].Some well-known lncRNAs, such as HOTAIR , MALAT1 , H19 , Xist , HULC and PTENP1 , have been found to possess oncogenic or/and tumor suppressor properties in various types of cancer [ 20 , 21 ].Several combinations of multiple lncRNAs were proposed as potential prognostic signature associated with overall survival in some cancers, including glioblastoma multiforme, colorectal cancer, breast cancer, oesophageal squamous cell carcinoma, non-small cell lung cancer and multiple myeloma [ 22 -29 ].Recent studies have shown the close relationship between cancer metastasis/ relapse and dysregulated lncRNA expression [ 28 , 30 -34 ], implying the potential of lncRNAs as biomarkers to predict the risk of cancer metastasis/relapse.However, lncRNA expression patterns and their prognostic value for LUAD relapse have not been systematically elucidated.
In this study, we performed a systematic analysis of lncRNA expression profi les across 403 LUAD patients who did or did not relapse by repurposing the publicly available microarray expression profi les to determine whether there is signifi cantly altered lncRNA expression pattern that could distinguish LUAD with relapse and without relapse.We aimed to detect potential lncRNA biomarkers closely correlated with LUAD relapse, and to develop novel lncRNA signature to identify LUAD patients who are at the higher risk for developing relapse.

Identifi cation of altered lncRNA expression associated with tumor relapse
Here, the Okayama dataset, which is the largest dataset in our study, contains 64 LUAD patients who developed relapse and 162 relapse-free LUAD patients [ 35 ].To identify relapse-related lncRNAs, we selected 88 favorable patients (alive > 5 years without any evidence of relapse) and 33 fatal samples (dead in 5 years with evidence of relapse) in the Okayama dataset to form a discovery dataset (n=121).The remaining patients in the Okayama dataset was considered as the internal testing dataset (n=105).Analysis of lncRNA expression profi les for LUAD patients in the discovery dataset revealed obvious differences and identifi ed a total of 25 differentially expressed lncRNAs (adjusted p-value <0.01 after Bonferroni correction) between LUAD patients who developed relapse and relapse-free LUAD patients (Supplementary File S1).Unsupervised hierarchical clustering of 121 LUAD patients in the discovery dataset according to the expression patterns of these 25 differentially expressed lncRNAs showed two distant patient clusters, which were highly correlated with the tumor relapse status (p=7.557e-12,chi-square test; Figure 1A ).Indeed, cluster I contained close to the majority of relapsed patients (n=30; 90.9%).Conversely, cluster II contained the majority of non-relapsed patients (n=70; 79.5%).Moreover, the Kaplan-Meier analysis and log-rank test revealed signifi cant difference in recurrencefree survival (RFS) between these two patients subgroups (p=4.93-e14,log-rank test; Figure 1B ).At three and fi ve years, the RFS rates of LUAD patients in cluster II were 95.9% and 95.9%, respectively, whereas the corresponding rates in the cluster I were 45.8% and 37.5%, respectively.The above results demonstrated that dysregulated lncRNAs might have a predictive power in the prognosis of LUAD patients.

Identifi cation of optimal relapse-related lncRNA set
To identify relapse-related lncRNA signature, we used the weighted voting classifi cation algorithm to predict outcome with the expression levels of these 25 differentially expressed lncRNAs as described in Materials and methods.These 25 differentially expressed lncRNAs were fi rstly ranked according to signal-to-noise metric.Then the average number of misclassifi ed patients of the 5-fold cross-validation in 100 permutations was calculated when increasing numbers of top ranked predictive lncRNAs ( Figure 2A ).As a result, nine lncRNAs were found to yield a balance between accuracy and the number of lnRNAs, and were identifi ed as optimal relapse-related lncRNA set.When choosing more than nine lncRNAs, there is a very slight increase in prediction accuracy ( Figure 2B ).With the selected nine lncRNAs and relapse status taken together, 121 LUAD patients were assigned as either relapse or relapse-free with accuracy of 84.3%.The classifi cation of 121 LUAD patients produced a receiver operating characteristic (ROC) curve with AUC of 0.923, sensitivity of 90.9%, and specifi city of 81.8% ( Figure 2C ).Furthermore, the Kaplan-Meier analysis for RFS demonstrated a signifi cant difference between the groups predicted to be relapse or relapse-free (p=3.44e-15,log-rank test; Figure 2D ).At three and fi ve years, the RFS rates of LUAD patients in the predicted relapse-free group www.impactjournals.com/oncotargetwere 96% and 96%, respectively, whereas the corresponding rates in the predicted relapse group were 45.7% and 34.8%, respectively.We clustered LUAD patients in the discovery dataset according to the expression levels of nine relapserelated lncRNAs by hierarchical clustering analysis and obtained two distinctive patient groups with signifi cantly different RFS (p=2.56e-07,log-rank test; Figure 3A and 3B ).These results revealed better performance of nine relapserelated lncRNAs in prognosis prediction.

Construction of relapse-related lncRNA signature from the discovery dataset
We applied univariate Cox proportional hazard regression to each of these nine relapse-related lncRNAs and found all of them signifi cantly correlated with patient's RFS ( Table 1 ).We then used these nine relapse-related lncRNAs to construct a signature by the risk score method as the classifi er for prognosis prediction.This risk score model was defi ned as a linear combination of the expression levels of the nine relapse-related lncRNAs and the multivariate Cox regression coeffi cient as the weight as follows: (-1.049×expression value of DACT3-AS1 ) + (0.027×expression value of CTD-2524L6.3 ) + (0.485×expression value of EFCAB14-AS1 ) + (0.93×expression value of AP000679.2 ) + (0.439×expression value of CTB-129P6.4 ) + (0.091× expression value of LRRC2-AS1 ) + (-0.357×expression value of RP11-517O13.3 ) + (0.489×expression value of AL133249.1 )+(0.015×expressionvalue of CTC-366B18.2 ).Each LUAD patient in the discovery dataset was assigned a risk score and was classifi ed into different prognostic groups (the high-risk group and low-risk group) according to the threshold of the median risk score (-0.054).The Kaplan-Meier analysis demonstrated a signifi cant difference in RFS between two patient groups predicted to have good or poor prognosis (median RFS > 8 years vs. 3.73 years, p=1.01e-10, log-rank test; Figure 4A ).The three-year and fi ve-year RFS rates of the high-risk group were 55% and 46.7%, respectively, whereas the corresponding rates in the low-risk group were 98.4% and 98.4%, respectively.The univariate analysis revealed a signifi cant association between the risk score and RFS, in which the hazard ratio (HR) of highrisk group versus low-risk group for RFS is 45.728 (95% confi dence interval (CI) 6.241-335.1;p=1.69e-04;Table 2 ).The time-dependent ROC curves analysis for the relapserelated lncRNA signature prognostic model achieved an AUC of 0.932 at fi ve years of RFS ( Figure 4B ).These results demonstrated that the relapse-related lncRNA signature has better performance in prognosis prediction of LUAD.
The distribution of prognostic risk scores, the relapse status and expression pattern of lncRNA signature of 121 LUAD patients in the discovery dataset was shown in Figure 4C .Of these nine relapse-related lncRNAs, fi ve were protective lncRNAs whose high expression were associated with low risk, and the remaining four were risky lncRNAs whose high expression were associated with high risk.

Validation of relapse-related lncRNA signature in the testing and entire Okayama dataset
To confi rm our fi ndings, the predictive ability of relapse-related lncRNA signature was validated in LUAD patients from the testing dataset and entire Okayama dataset.With the same risk score formula and cutoff value derived from the discovery dataset, patients of the testing dataset were classifi ed into the high-risk group (n=58) and low-risk group (n=47).As in the discovery dataset, the RFS time of patients in the high-risk group was signifi cantly shorter than that in the low-risk group (median RFS 4.01 years vs. > 5 years, p=8.14e-07, logrank test) ( Figure 5A ).The risk stratifi cation of the entire Okayama dataset (i.e.combined the discovery and testing series) also yielded similar result.This relapse-related lncRNA signature was able to separate 226 patients in the entire Okayama dataset into two groups with signifi cantly different RFS (median 3.73 years vs. > 8 years, p=1.11e-16, log-rank test) ( Figure 5B ).A signifi cant association between the relapse-related lncRNA signature and RFS in the univariate Cox regression analysis was observed both in the testing and entire Okayama datasets.The hazard ratios of the high-risk group versus the lowrisk group for RFS was 11.02 (p=7.96e-05;95% CI 3.346-36.3) in the testing dataset, and 20.66 (p=4.7e-09;95% CI 7.5-56.91) in the entire Okayama dataset ( Table 2 ).The distribution of prognostic risk scores, the relapse status and expression pattern of lncRNA signature of LUAD patients in the testing and entire Okayama datasets were shown in Figure 5C and 5D .Patients with high prognostic scores tended to express risky lncRNAs, whereas patients with low prognostic scores tended to express protective lncRNAs.

Further validation of relapse-related lncRNA signature with two additional independent datasets of LUAD patients
Further validation of the prognostic power of relapserelated lncRNA signature in LUAD patients was conducted using two additional completely independent cohorts of 124 and 53 LUAD patients obtained from Der's study [ 36 ] and Botling's study [ 37 ], which will be further referred to as the Der dataset and Botling dataset.The median cutoff value of risk score obtained from the discovery dataset was used for the Der dataset and Botling dataset to classify patients into either high-risk or low-risk groups.For the Der dataset, patients with high-risk scores had signifi cantly shorter RFS than those with low-risk scores (median 3.74 years vs. 7.61 years, p=2.42e-02, log-rank test) ( Figure 6A ).Among patients in the Botling dataset, the high-risk group and lowrisk group were marginally signifi cantly different in their RFS (median 1.38 years vs. 7.96 years, p=7.07e-02, logrank test) ( Figure 6B ).At three and fi ve years, the respective  2 ).The distribution of prognostic risk scores, the relapse status and expression pattern of lncRNA signature of LUAD patients in the two independent cohorts were consistent with those observed in the discovery, testing and entire Okayama datasets ( Figure 6C and 6D ).

Independence of prognostic value of relapse-related lncRNA signature from other clinical variables
To determine whether the prognostic value of the relapse-related lncRNA signature was independent of other clinical variables, we conducted a multivariate Cox regression analysis including lncRNA signature, age, gender, smoking status and tumor stage as covariates.The results of multivariable Cox regression analysis from fi ve LUAD patient datasets showed that the relapse-related lncRNA signature was still signifi cantly associated with RFS after adjusted by these clinical variables in each dataset ( Table 2 ).We also found that age and tumor stage was signifi cant in the multivariate analysis in some datasets.So we performed data stratifi cation analysis according to the age and tumor stage.All LUAD patients enrolled in this study were fi rst stratifi ed into either the younger stratum (age≤65) or the elder stratum (age>65).This analysis showed that within each age stratum, the relapse-related lncRNA signature could further classifi ed patients into the high-risk group and low-risk group with signifi cantly different RFS (median 4.73 years vs. > 8.68 years, p=2.55e-12 for younger stratum; median 2.94 years vs. 8.83 years, p=3.39e-05 for elder stratum; log-rank test) ( Figure 7A and 7B ).Next, the stratifi ed analysis was carried out in tumor stage, which stratifi ed patients into the stage I stratum and stage II stratum.For patients within stage I stratum, signifi cant differences for RFS between high-risk group and low-risk group were observed (median 4.68 years vs. >10 years, p=2.8e-13, log-rank test) ( Figure 7C ).Among stage II patients, RFS was also marginally signifi cantly different between the groups with high-risk and low-risk scores (median 2.26 years vs. 4.88 years, p=9.38e-02, log-rank test) ( Figure 7D ).Because of limited patient size, the stratifi ed analysis was not conducted for stage III (n=8) and IV (n=1) patients.Taken together, the results of multivariate Cox regression and stratifi cation analyses suggested that the relapse-related lncRNA signature is independent of other clinical features for prognosis prediction of LUAD patients.Moreover, the discrimination performance of relapse-related lncRNA signature measured by the C-index was much higher than that of other clinical variables in each dataset ( Table 3 ), demonstrating the better predictive ability to discriminate between LUAD patients who are or not likely to develop relapse.

Identifi cation of associated biological functions of relapse-related lncRNA signature
As an initial step toward gaining insights into the biological functions of relapse-related lncRNA signature, we fi rst applied GSEA to identify associated biological pathways and processes from gene expression profi les of LUAD patients in the high-risk and low-risk groups classifi ed by the relapse-related lncRNA signature in the Okayama dataset.The high-risk scores were associated with coordinated transcriptional up-regulation of multiple gene sets ( Figure 8A and 8B ) (Supplementary File S2), mainly involved in glucose metabolism and proteasome that have been reported to be involved in lung cancer [ 38 , 39 ].The low-risk score was accompanied with up-regulation of circadian clock and JNK-MAPK pathway ( Figure 8A and 8B ) (Supplementary File S2), both of which have inhibitory effects on the growth of lung cancer [ 40 , 41 ].Then we measured the co-expressed relationships between nine relapse-related lncRNAs and mRNAs by calculating the Pearson correlation coeffi cient of paired lncRNA and mRNA expression profi les and identifi ed 1414 mRNAs positively correlated (ranked top 1%) with at least one of nine relapse-related lncRNAs.The functional enrichment analysis of GO and KEGG pathway revealed that the co-expressed mRNAs were most signifi cantly enriched in 15 GO functional annotation clusters (mainly involved in cell cycle, DNA repair and damage, macromolecular complex assembly, RNA splicing and cell death) ( Figure 8C ) (Supplementary File S3), and 10 KEGG pathways including cell cycle, oocyte meiosis, progesterone-mediated oocyte maturation, DNA replication, insulin signaling pathway, spliceosome, p53 signaling pathway, One carbon pool by folate, gap junction and ErbB signaling pathway (p<0.05 and Fold Enrichment>2) ( Figure 8D ) (Supplementary File S4).This integrative functional analysis suggested that the dysregulated expression of relapse-related lncRNAs might affect the critical biological pathways and processes involved in tumor progression and recurrence.

DISCUSSION
LUAD, the most frequent type of NSCLC, remains to be the leading cause of cancer-related deaths in women and men.LUAD is a recurrent disease, and more than 30% of patients still faced relapse after surgical resection and treatment and ultimately die of relapse [ 4 ].In the past years, great efforts have been made to improve our understanding of the possible molecular mechanism of relapse process at protein, mRNA and microRNA (miRNA) levels, and some protein/mRNA/miRNA-based predictive signature were reported to identify patients at the high risk of relapse, which will enable them to benefi t from adjuvant therapy [ 35 , 42 -50 ].LncRNAs is a novel layer of gene regulation network and their aberrant expression has been demonstrated to be associated with tumorigenesis, tumor progression and metastasis [ 14 , 17 , 18 , 30 ]. Until now, several lncRNAs, including MALAT-1 , CCAT2 , HOTAIR , and ZXF1 , have been found to contribute to LUAD [ 51 -54 ].More recently, differentially  expressed lncRNAs were observed between LUAD tissues and normal tissues by microarray analysis [ 55 ].However, lncRNA expression patterns and their prognostic value for LUAD relapse have not been systematically investigated.
In the present study, we obtained lncRNA expression profi les in 403 LUAD patients by repurposing the publicly available microarray expression profi les, and performed comparison analysis between the groups of 121 LUAD patients that dead in 5 years with evidence of relapse and those that alive > 5 years without any evidence of relapse.Although Okayama et al. analyzed ALK-positive and triple negative LUAD patients [ 35 ], current study did not differentiate them.We found that two patient groups have signifi cantly different lncRNA expression patterns and identifi ed 25 differentially expressed lncRNAs.Hierarchical clustering analysis revealed that these differentially expressed lncRNAs were signifi cantly correlated with LUAD relapse.By using the lncRNA expression data and a weighted voting algorithm, we were able to generate an optimal set of nine lncRNAs that could clearly distinguish patients who developed relapse from those who did not relapse with high accuracy, demonstrating their potential clinically application to identify patients with higher risk of relapse and improve prognosis prediction of LUAD.In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test.However, of the three test methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset as elaborated in [ 56 ].Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors [ 57 -61 ].However, to reduce the computational time, we adopted the 5-fold cross-validation in this study as done by many investigators with SVM as the prediction engine.Furthermore, the small subset of predictive lncRNAs perhaps not only has better transferability in the clinics but also reduce the possibilities of false positives in the selection of predictive lncRNAs.By focusing on relapserelated lncRNAs, we constructed a nine-lncRNA signature for prognosis by the risk score method based on the linear combination of expression data of relapse-related lncRNAs and weighted by the regression coeffi cients from multivariate Cox regression analysis, which effectively classifi ed patients of the discovery dataset into high-risk group and low-risk group with signifi cantly different RFS.Moreover, the prognostic power of this relapserelated lncRNA signature was further validated by the testing dataset and other two independent non-overlapping datasets, indicating good reproducibility and robustness of relapse-related lncRNA signature in patient prognosis.
The conventional indicators for making adjuvant treatment decisions for patients after surgical resection are based on some clinical factors, such as tumor stage, tumor size, margin status and so on [ 45 ].Therefore, we performed multivariate Cox regression analysis to assess the independence of the relapse-related lncRNA signature in prognosis prediction, and in these datasets, the signature maintained an independent correlation with RFS after adjusting for age, gender, stage and smoking status ( Table 2 ).However, it can be found that age and stage also were two important factors affecting RFS when accessed in the multivariate Cox regression analysis.So, we further stratifi ed patients according age and stage, and applied this lncRNA signature to classify patients within the same age stratum or the same stage into two subgroups with good and poor prognosis.The stratifi cation analysis showed that patients in the predicted good prognosis group tended to have longer RFS than those in the predicted poor prognosis group across these stratifi ed patient datasets, demonstrating the age-and stage-independent prognostic value of this lncRNA signature.Moreover, the discriminatory power of relapse-related lncRNA signature measured by the C-index value was better than the discrimination provided by the clinical variables in different datasets.Taken together, this relapse-related lncRNA signature was a signifi cant and independent prognostic marker in predicting relapse risk of patients with LUAD.
Although the number of identifi ed lncRNAs in human is continuously increasing, only a small proportion have been well functionally characterized to date and the functional study of lncRNAs remains to be in its infancy.For example, only 181 lncRNAs with functional evidence were recorded in lncRNAdb v2.0 database by manually literature mining [ 62 ].To gain functional insight into these nine relapse-related lncRNAs, we performed an integrative bioinformatics analysis to predict lncRNA function which could overcome the bias derived from the single prediction method.We found that the GO composition of co-expressed mRNAs with these lncRNAs revealed biological relevance to disease relapse, such as cell cycle, DNA repair and damage and cell death.Moreover, the pathway analysis also revealed a signifi cant enrichment of co-expressed mRNAs with these lncRNAs in lung cancer-related pathways.These in silico functional analysis based on the co-expressed mRNAs with lncRNAs suggested that expression variation of these relapse-related lncRNAs might affect some critical biological pathways and processes involved in tumor progression and recurrence.However, biological signifi cance of these relapse-related lncRNAs should be validated using wet experiments on cell lines and clinical samples in the future.As shown in a series of recent publications [ 63 -68 ] in developing or reporting new methods or fi ndings, user-friendly and publicly accessible web-servers will signifi cantly enhance their impacts [ 69 ], we shall make efforts in our future work to provide a web-server for the method presented in this article.
In summary, we investigated lncRNA expression patterns in LUAD relapse and their effects on patient outcome for the fi rst time.We identifi ed an optimal small set of nine lncRNAs whose expression patterns were able to discriminate between relapsed and non-relapsed LUAD patients with sensitivity of 90.9% and specifi city of 81.8%.A relapse-related lncRNA signature was developed by the risk model method that effectively classifi ed patients into good and poor prognosis groups across different datasets.Moreover, the prognostic power of this signature was independent of other clinical variables.The relapse-related lncRNA signature may not only help to identify LUAD patients at high risk of relapse benefi ting from adjuvant therapy but also could provide novel insights into the molecular mechanism of recurrent disease.

Acquisition and analysis of lncRNA expression profi les of LUAD patients
The raw array data (.CEL fi les) of 403 LUAD patients were retrieved from the GEO database and were uniformly pre-processed using the Robust Multichip Average (RMA) algorithm for background correction, quantile normalization and log2-transformation [ 70 ].To account for the heterogeneity of multiple microarray datasets in systematic measurement, each dataset was standardized independently by the Z-score transformation to scale expression intensities of each probe as follows [ 71 ]: Z score e e i = -( )/ d Where e i the raw intensity data of probe i , e is the overall average intensity of probes in a single experiment and δ is the standard deviation (SD) of all of the measured intensities.
The probe sequences of Affymetrix HG-U133 Plus 2.0 array were obtained from the Affymetrix website ( http:// www.affymetrix.com).LncRNA expression data of 403 LUAD patients were obtained by repurposing Affymetrix array probes as previous described [ 72 , 73 ].Briefl y, probe sets for Affymetrix HG-U133 Plus 2.0 array were reannotated to the human genome (GRCh38) and lncRNA genes based on the annotations from GENCODE (release 21) using SeqMap tool [ 74 ].Then those probes (or probe sets) that were uniquely mapped to the human genome and lncRNA genes with no mismatch were generated to represent the lncRNAs.For each lncRNA, all corresponding probe set signals were averaged to produce a single expression value.Finally, the expression data of 2313 lncRNAs was obtained.
The expression profi les of lncRNAs between relapse-free LUAD patients (alive > 5 years without any evidence of relapse) and LUAD patients who developed relapse (dead in 5 years with evidence of relapse) were compared and the differentially expressed lncRNAs were identifi ed using two-tailed T-test.Those lncRNAs with an adjusted p-value <0.01 after Bonferroni correction were considered as differentially expressed lncRNAs.The unsupervised hierarchical clustering of both LUAD patients and lncRNAs was performed with R software using the euclidean distance and complete linkage method.

Identifi cation of relapse-related lncRNA set
To identify optimal lncRNA set associated with relapse of LUAD patients, we used the weighted voting algorithm to develop a supervised classifi cation model, assessed using 5-fold cross-validation with 100 randomized permutations in the discovery dataset as follows: (i) patients of discovery dataset were divided into fi ve non-overlapping sets with equal quantity.(ii) With four of fi ve sample sets, the signal-to-noise statistic ( S lnc-i ) of each lncRNA was calculated as Finally, the top N which yielded the optimal numbers of learning errors was selected as the optimal number (OPN) of predictive lncRNAs.The frequencies of lncRNAs in 500 candidate lncRNAs ranking list according to their signal-to-noise statistic were ranked and top OPN of lncRNAs were identifi ed as optimal relapse-related lncRNA signature.

Statistical analysis for classifi cation and prediction
The association between expression levels of relapse-related lncRNAs and patients' RFS was assessed using the univariate Cox regression analysis.RFS was calculated as the time to tumor recurrence or death due to any cause, and was censored at the time of last followingup when recurrence has happened.To construct a prognostic model, the relapse-related lncRNAs were fi tted in the multivariate Cox regression model in the discovery dataset.Then we applied these relapse-related lncRNAs to build an expression signature by risk score method as follows [ 75 ]: Where N is the number of relapse-related lncRNAs, Exp i is the expression levels of ln cRNA i , and w i is the estimated regression coeffi cient of ln cRNA i derived from the above multivariable Cox regression analysis in the discovery dataset.This relapse-related lncRNA expression signature was established by taking into account the contribution of independent relapse-related lncRNA to patient's RFS.Finally, LUAD patients were assigned a risk score according to the relapse-related lncRNA expression signature, and were divided into high-risk and low-risk groups using the median of the risk score generated from the discovery dataset as the cutoff value.The LUAD patients with higher scores were considered to have high risk of poor outcome.The difference in RFS between high-risk group and lowrisk group was demonstrated by Kaplan-Meier survival plots, and the statistical signifi cance was assessed by two-sided log-rank test.Univariate and multivariate analyses with Cox proportional hazards regression were performed with RFS as the dependent variable and relapse-related lncRNA risk score and clinical features as explanatory variables in each dataset.Hazard ratio (HR) and 95% confi dence intervals (CI) was estimated by Cox proportional hazards regression model.The sensitivity and specifi city of relapse-related lncRNA risk score in RFS prediction was evaluated by analysis of the timedependent ROC curve within 5 years as the defi ning point.The Harrell's concordance index (C-index) was calculated to quantify the discriminatory power of relapse-related lncRNA risk score [ 76 ].A C-index of 1.0 indicates perfect prediction accuracy, whereas a C-index of 0.5 represents prognosis prediction is equivalent to random guessing.All statistical analyses were performed using R software and Bio-conductor.

Integrative prediction analysis of lncRNA function
In order to explore the potential biological roles of lncRNA, we performed function enrichment analysis by the integration of gene sets, Gene Ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG).Gene set enrichment analysis (GSEA) was performed by the JAVA program using MSigDB (c2.cp.v5.0, 1330 gene sets) to rank gene set associated with risk score by enrichment score [ 77 ].The gene sets with positive enrichment score (or negative enrichment score) and p-value <0.01 after performing 1000 permutations of the risk-phenotype labels were considered as signifi cant enriched gene sets in which most of the genes are up-regulated accompanied with highrisk scores (or low-risk scores).GO and KEGG enrichment analysis were carried out to assess over-representation of functional categories among a gene set of interest using DAVID Bioinformatics Tool (version 6.7) limited to GO terms in the "Biological Process"(GOTERM-BP-FAT) and KEGG pathway categories [ 78 ].Functional categories with p-value of <0.05 and an enrichment score of >2 using the whole human genome as background were considered signifi cant.Signifi cant functional annotations of GSEA and GO analysis were organized into an interaction network with similar functions using the Enrichment Map [ 79 ] plugin in Cytoscape 3.2.1 [ 80 ].

Figure 1 :
Figure 1: LncRNA expression patterns can distinguish patients who developed relapse from those who did not relapse in the discovery dataset.A. The unsupervised hierarchical clustering heatmap of 121 patients based on the 25 differentially expressed lncRNAs in the discovery dataset.B. Kaplan-Meier survival curve for RFS in the two lncRNA transcriptomic classifi cations.

Figure 2 :
Figure 2: Identifi cation of the relapse-related lncRNA signature in the discovery dataset.A. The learning errors for top N -lncRNA model using 5-fold cross-validation procedures with 100 random partitions of the discovery dataset.B. The variance rate of classifi cation accuracy when increasing numbers of the predictive lncRNAs.C. ROC analysis of the relapse-related lncRNA signature for relapse prediction within fi ve years as the defi ning point.D. Kaplan-Meier survival curve for RFS of patients with relapse or relapse-free according to nine-lncRNA model-based prediction.

Figure 3 :
Figure 3: The heatmap and survival analysis of hierarchical clustering based on relapse-related lncRNA signature in the discovery dataset.A. The unsupervised hierarchical clustering heatmap of 121 patients based on selected optimal nine lncRNAs in the discovery dataset.B. Kaplan-Meier survival curve for RFS in the two lncRNA transcriptomic classifi cations.

Figure 4 :
Figure 4: The relapse-related signature by risk score method in prognosis of RFS of patients with LUAD in the discovery dataset.A. Kaplan-Meier survival curve for RFS of patients with high-risk or low-risk scores.B. ROC analysis of the risk score model for prognosis prediction within fi ve years as the defi ning point.C. The distribution of risk scores, patients' relapse status and the heatmap of lncRNA expression profi les.

Figure 5 :
Figure 5: The relapse-related signature predicts RFS of patients with LUAD in the testing dataset and entire Okayama dataset.Kaplan-Meier survival curves of RFS between high-risk and low-risk patients in A. the testing dataset and B. entire Okayama dataset.The distribution of risk scores, patients' relapse status and the heatmap of lncRNA expression profi les in C. the testing dataset and D. entire Okayama dataset.

Figure 6 :
Figure 6: Independent validation of relapse-related signature for prognosis prediction in two additional independent datasets.Kaplan-Meier survival curves of RFS between high-risk and low-risk patients in A. the Der dataset and B. the Botling dataset.The distribution of risk scores, patients' relapse status and the heatmap of lncRNA expression profi les in C. the Der dataset and D. the Botling dataset.

Figure 7 :
Figure 7: Prognosis prediction in patients stratifi ed by age and tumor stage.Kaplan-Meier survival curves for younger patients A. and elder patients B. Kaplan-Meier survival curves for stage I patients C. and stage II patients D.

Figure 8 :
Figure 8: Functional analysis of the prognostic lncRNAs.A. The enrichment map of gene sets with each node represents a gene set and an edge represents the proportion of shared genes between connecting gene sets.B. The enriched biological pathways and processes associated with risk score.C. The functional enrichment map of GO terms with each node represents a GO term and an edge represents the proportion of shared genes between connecting GO terms.D. The enriched KEGG pathways ranked by −log 10 (p-value).

−
) is the mean value and standard deviation (SD) of expression level of lncRNA i in LUAD patients who developed relapse (relapse-free).In addition, the classifi cation boundary of two classes (relapse or non-relapse) for each lncRNA i was also calculated as e ln c-i is the expression level of lncRNA i in the corresponding patient.The votes of top N ranked lncRNAs were summed to determine the relapse or relapse-free class of sample.(v) repeat steps i-iii for each of the fi ve non-overlapping sets.(vi) The 5-fold cross-validation process was repeated 100 times.The average number of misclassifi ed patients of 100 randomized permutations for top N model as follows:

Table 4 : Clinical features of LUAD patients enrolled in this study
(iii) lncRNAs were ranked according to its signal-to-noise statistic with absolute value, and top N ranked lncRNA were selected to develop a supervised classifi cation model.The top N was initially set to top 1, and increase one lncRNA each time until N was equal to the number of candidate lncRNAs.(iv) The above classifi cation model was used to classify patients in the remaining one set into relapse or non-relapse based on voting rules: each lncRNA i in the top N model has a vote V ln c-i