A long non-coding RNA expression signature to predict survival of patients with colon adenocarcinoma

Colon cancer is the most common type of gastrointestinal cancer and is still the leading cause of cancer-related mortality worldwide. Long non-coding RNAs (lncRNAs) have been proved to be superior biomarkers in cancer diagnosis and prognosis than miRNAs and protein-coding genes. In the current study, our objective was to detect novel lncRNA biomarkers by analyzing lncRNA expression profiles and clinical data in a large cohort of patients with colon patients from The Cancer Genome Atlas (TCGA). By using Cox regression analysis, we identified two lncRNAs (SNHG6 and CTD-2354A18.1) which could be independent prognostic factors for predicting clinical outcome in colon adenocarcinoma. Then a linear combination of these two lncRNA biomarkers (SNHG6 and CTD-2354A18.1), termed two-lncRNA signature, was developed in the training set as a predictor for OS in patients with colon adenocarcinoma, and validated in the testing set and the entire patient set. This two-lncRNA signature demonstrated significant prognostic performance in both the testing set and the entire patient set which classified the patients into two groups with significantly different OS. The multivariate and stratified analysis suggested that the prognostic value of the two-lncRNA signature was independent of other traditional clinical variables. Functional analysis suggested that these two lncRNA biomarkers might be mainly involved in transcription/translation-related or DNA repair-related biological processes. In summary, our results warrant further studies on these lncRNAs that will improve our understanding of the mechanisms associated with pathogenesis and progression of colon adenocarcinoma.


INTRODUCTION
Colon cancer is the most common type of gastrointestinal cancer and remains the leading cause of cancer-related mortality worldwide [1]. Colon adenocarcinoma is the most common colon cancer type and accounts for 98% of newly diagnosed colon cancer cases. Surgery combined with other therapeutic options (including cryotherapy, radiofrequency ablation or chemotherapy) is the standard treatment strategy. Despite advances in diagnosis and treatment, the prognosis for patients with metastatic disease remains poor [2]. Hence, the identification of more exact molecular biomarkers for identifying high-risk patient subgroup in survival is evidently needed and would enable clinicians to choose more suitable treatment strategies thus leading to improved clinical outcomes The advent of full genome sequences and comprehensive analysis of functional elements have suggested that a large portion of the genome can be transcribed and led to the discovery of extensive transcription of large RNA transcripts termed long noncoding RNAs (lncRNAs) [3,4]. lncRNAs are arbitrarily defined as ncRNA genes larger than 200 bp

Research Paper
Oncotarget 101299 www.impactjournals.com/oncotarget which clearly distinguishes from small regulatory RNAs [5]. It is now well appreciated that lncRNAs exert a diverse spectrum of regulatory mechanisms in numerous biological processes at the transcriptional level, posttranscriptional level and epigenetic level [6,7]. Like miRNAs and protein-coding genes, lncRNA also revealed differential expression in carcinogenesis compared to normal tissues, some of which have been identified as a tumor suppressor or oncogenes associated with cancer pathogenesis and development [8][9][10][11]. Furthermore, accumulating reports of aberrant lncRNA expression have suggested that lncRNAs may potentially serve as novel independent biomarkers for early diagnosis, prognosis and metastasis prediction in various cancer types [12][13][14][15][16][17][18].
In the current study, our objective was to detect novel lncRNA biomarkers by analyzing lncRNA expression profiles and clinical data in a large cohort of patients with colon patients from The Cancer Genome Atlas (TCGA). Furthermore, we tire to identify a lncRNAbased expression signature to predict patient's clinical outcome. Finally, we performed bioinformatics analysis to infer the possible biological functions of lncRNA signature in colon adenocarcinoma.

Identification of prognostic lncRNAs in the training set
To detect the prognostic lncRNAs in colon adenocarcinoma patients, we first performed univariate Cox regression analyses to explore the association between expression levels of lncRNAs and the OS of the patients with colon adenocarcinoma, and identified 14 lncRNAs (p < 0.01) that are significantly associated with OS in the training set (Supplementary Table 1). Then we performed multivariate Cox regression analysis for these 14 lncRNAs and found that two of 14 lncRNAs (SNHG6 and CTD-2354A18.1) may be independent prognostic factors (p < 0.1) ( Table 1). Of them, we found that higher expression level of SNHG6 was associated with good outcome and a higher expression level of CTD-2354A18.1 was associated with poor outcome.

Development of lncRNA expression signature for survival prediction in the training set
Selected two prognostic lncRNAs were fitted in a multivariate Cox regression analysis to obtain their relative coefficient. Then a two-lncRNA expression signature was developed by constructing a mathematical formula using a linear combination of the expression values of two prognostic lncRNAs (SNHG6 and CTD-2354A18.1) and the multivariate Cox regression coefficient as the weight as follows: LncRNA signature-based risk score = (−1.2023*expression value of SNHG6) + (1.106* expression value of CTD-2354A18.1). Finally, each of the patients in the training set could get a risk score based on two-lncRNA expression signature and classified into highrisk group and low-risk group using the median risk score of the training set as the risk cutoff. As shown in Figure  1A, patients with low-risk score had a better OS than those with high-risk scores (log rank p < 0.001) ( Figure  1A). The area under the curve (AUC) of time-dependent ROC curves for the two-lncRNA expression signature was 0.881 at three years of OS ( Figure 1B). The hazard ratios of high-risk group versus low-risk group for OS were 2.718 (p < 0.001; 95% CI, 1.681-4.397) in the univariate analysis ( Table 2). Distribution of the lncRNA risk score, the survival status of the patients and the expression pattern of two prognostic lncRNAs was also shown in Figure 1C. We found that patients in the high-risk group tended to express high level of CTD-2354A18.1, whereas patients in the low-risk group tended to express high level of SNHG6 ( Figure 1C).

Further validation of lncRNA expression signature for survival prediction in the testing set and entire patient set
The two-lncRNA expression signature was then validated for its prognostic value in the testing set of 78 patients. By using the same lncRNA signature-based risk score formula, patients of the testing set were divided into the high-risk group (n = 33) and the low-risk group (n = 45) the median score of the training set as the cutoff value. In the consistent with the findings from the training set, patients in the high-risk group had significantly shorter median OS than those in the low-risk group (log rank p = 0.045) (Figure 2A). The AUC for the two-lncRNA expression signature was 0.656 at three years of OS ( Figure 1B). In the univariate analysis, the hazard ratios of high -risk group versus low-risk group for OS were 2.72 (p = 0.054; 95% CI, 0.984-7.516) ( Table 2). Distribution of the lncRNA risk score, the survival status of the patients and expression pattern of two prognostic lncRNAs was also shown in Figure 2C. We found that the expression pattern of two prognostic lncRNAs in patients with highrisk score or low-risk score is consistent with observations in the training set ( Figure 2C).
We further validated the two-lncRNA expression signature in the entire patient set (combined training set and testing set). With the two-lncRNA expression signature, patients of the entire patient set were divided www.impactjournals.com/oncotarget into a high-risk group (n = 72) or a low-risk group (n = 85) with significantly different OS (log rank p < 0.001), which was similar to those observed in the training set and testing set ( Figure 3A). The hazard ratios of high -risk group versus low-risk group for OS were 5.441 (p < 0.001; 95% CI, 2.196-13.478) in the univariate analysis ( Table 2). The AUC for the two-lncRNA expression signature was 0.723 at three years of OS ( Figure 3B). Distribution of the lncRNA risk score, the survival status of the patients and the expression pattern of two prognostic lncRNAs were also shown in Figure 3C.

Prognostic value of the lncRNA expression signature is independent of clinical variables
To examine whether the prognostic value of the lncRNA expression signature for survival prediction is independent of other clinical variables, we performed multivariate Cox regression analysis with age, gender, stage and lncRNA signature-based risk score as covariables in each of three patient sets. Results from multivariate Cox regression analysis suggested that the lncRNA expression signature, age and stage were   (Table 2). We next carried out a stratified analysis for age and stage to evaluate whether the lncRNA expression signature could predict OS of patients with the same age and stage features. All patients were stratified into a younger stratum (< 75) or an elder stratum. Results of survival analysis suggested that within each stratum, the two-lncRNA expression signature could further subdivide patients into the high-risk group and low-risk group with significantly different survival time (log rank p = 0.004 for younger stratum and log rank p = 0.005 for elder stratum) ( Figure  4A and 4B). Then all patients were early-stage stratum (I/II) and advanced-stage stratum (III/IV). With the two-lncRNA expression signature, patients with early-stage were classified into the high-risk group (n = 42) and lowrisk group (n = 55). The OS time of the high-risk group patients was significantly shorter than that of low-risk group patients (log rank p = 0.009) ( Figure 4C). Similar results were observed when patients with advanced-stage stratum were classified as high-risk (n = 30) or low-risk (n = 30) according to their two-lncRNA expression signature (log rank p = 0.006) ( Figure 4D).

Functional enrichment analysis of genes correlated with the prognostic lncRNAs
To investigate the potential functional roles of identified prognostic lncRNAs, we first examined the expression correlation between prognostic lncRNAs and protein-coding genes (PCGs) for all patients and identified 327 PCGs positively or negatively co-expressed with prognostic lncRNAs (top 5%). Then we performed functional enrichment analysis for 327 PCGs positively or negatively co-expressed with prognostic lncRNAs using DAVID web tool. Results of GO analysis suggested that these PCGs correlated with the prognostic lncRNAs were enriched in two functional clusters involved in transcription/translation-related or DNA repair-related biological processes (Table 3). KEGG functional enrichment analysis suggested that PCGs correlated with the prognostic lncRNAs were involved in five biological pathways, including Ribosome, Melanogenesis, Wnt signaling pathway, Glutamatergic synapse and Gastric acid secretion (Table 3). Moreover, 327 PCGs were found to be associated with INFECTION and CANCER disease classes through enrichment analysis in GAD_DISEASE_ CLASS (Table 3).

DISCUSSION
Colon cancer is a multifactorial disease with etiology encompassing genetic factors, environmental exposures and inflammatory conditions of the digestive tract. Although traditional prognostic and predictive factors for colorectal cancer such as age, tumor stage, surgical margins, number of affected local lymph nodes and tumor grade have produced significant improvements in patient clinical outcome [27], they reveals obvious limitations in distinguishing related cancer risk subgroup that have different clinical outcomes due to the molecular heterogeneity. Therefore, the prognostic potential of molecular markers has been systematically investigated in extensive clinical transcriptome research over the last decade. For example, a well-known gene signature, ColoPrint ® , was developed to predict disease relapse in  [29]. Another study reported prognostic signature comprising of 113 probe sets (termed CRC-113) to predict prognosis in patients with colorectal cancer [30]. With increasing knowledge in the field of ncRNA research, the development in the identification of molecular markers has been the shift in focus from protein-coding genes to ncRNAs [31,32]. Some miRNAbased signatures were firstly identified in recent studies [33,34]. Recently, lncRNAs have been discovered as a key component of genome regulatory network and contribute to the hallmark of cancer [5,35]. Accumulating evidence showed that lncRNA displays restricted tissue-specific and cancer-specific expression patterns distinguishing from miRNAs and protein-coding genes [8]. Moreover, tissue-specific and cancer-specific lncRNA expression was detectable and stable in body fluids, sputum and urine of cancer patients [8]. This specificity represents potentials of lncRNAs as superior biomarkers in cancer diagnosis and prognosis than miRNAs and protein-coding genes [36][37][38][39]. Several groups have initially investigated the prognostic value of lncRNA in colon cancers. Chen  [19]. In one lncRNA profiling study based on GEO data, a lncRNA expression signature containing six lncRNA was first identified to improve prognosis prediction of colorectal cancer [40]. A recent study confirmed eight novel lncRNAs as associated with the progression of colon cancer [26]. However, it should be noted that there was little overlap between lncRNAs identified as biomarkers in colorectal cancer in previous studies, indicating that seeking lncRNA biomarkers of colorectal cancer is presently in its infancy and the further comprehensive investigation is necessary.
In this study, we performed integrative analysis for lncRNA expression profiles and clinical data in 157 patients with colon adenocarcinoma from TCGA to access the prognostic value of lncRNA expression for OS in patients with colon adenocarcinoma. We first subjected the lncRNA expression profiles of 79 patients in the training set to Cox proportional regression analysis for evaluating the relationship between lncRNA expression and clinical outcome, and identified two lncRNAs (SNHG6 and CTD-2354A18.1) which could be considered as independent prognostic factors for predicting clinical outcome in colon adenocarcinoma. More important, using a sample-splitting approach, a linear combination of these two lncRNA biomarkers (SNHG6 and CTD-2354A18.1), termed two-lncRNA signature, was developed in the training set as a predictor for OS in patients with colon adenocarcinoma, and validated in the testing set and entire patient set. This two-lncRNA signature demonstrated significant prognostic performance in both the testing set and entire patient set which classified the patients into two groups with significantly different OS. Further analysis revealed that the prognostic value of the two-lncRNA signature was independent of other traditional clinical variables, such as age, gender, stage. Moreover, in the stratified analysis, the two-lncRNA signature showed prognostic value both in early-stage and advanced-stage patients, and both in younger and elder patients. These suggested that the two-lncRNA signature is an independent prognostic predictor in colon adenocarcinoma.
The two-lncRNA signature identified in this study included one lncRNAs (SNHG6) that was protective and another lncRNA (CTD-2354A18.1) that was risky with respect to their association between their expression and patients' survival. In the literature, lncRNA SNHG6 has been reported to be differentially expressed in hepatocellular carcinoma (HCC) and may be a potential biomarker for patients with HCC [41]. Yang's study found that SNHG6 is overexpressed in gastric cancer tissues and cell lines measured by quantitative real-time polymerase chain reaction (qRT-PCR) and might serve as a candidate prognostic biomarker for gastric cancer patients [42]. Another lncRNA biomarker, CTD-2354A18. 1, has also been shown to differentially expressed in gastric cancer tissue and normal tissue and may play a key role in the pathogenesis of gastric cancer [43]. However, the functional roles of these two lncRNA biomarkers in colon adenocarcinoma still remain to be explored. We infer potential functional roles of these two lncRNA biomarkers in colon adenocarcinoma performed by GO and KEGG pathway functional enrichment analysis, and found that these two lncRNA biomarkers may be mainly involved in transcription/translation-related or DNA repair-related biological processes, which is consistent with previous study [41].
In conclusion, our study identified a novel lncRNA expression signature comprising two lncRNAs (SNHG6 and CTD-2354A18.1), which can be used as an independent prognostic marker of OS for patients with colon adenocarcinoma. More importantly, this lncRNA signature could provide additional prognostic information beyond clinicopathological factors. Our results warrant further studies on these lncRNAs that will improve our understanding of the mechanisms associated with pathogenesis and progression of colon adenocarcinoma.

Patient dataset
Colon adenocarcinoma patients and corresponding clinical data used in this study were obtained from were obtained from The Cancer Genome Atlas (TCGA) data portal (https://cancergenome.nih.gov/). Then downloaded clinical data were matched to the lncRNA expression profile. Therefore, some patients without clinical information or lncRNA expression profile were excluded. Finally, a total of 157 colon adenocarcinoma patients were retained for further analysis. In this study, 157 colon adenocarcinoma patients were randomly assigned to a training set (n = 79) and a testing set (n = 78). The training set was used to detect prognostic lncRNA. The detailed clinical characteristics of the colon adenocarcinoma patients are summarized in Table 4.

LncRNA expression data procession
Expression levels of 12,727 lncRNA genes from 157 colon adenocarcinoma patients were downloaded from the TANRIC(The Atlas of ncRNA in Cancer, http://bioinformatics.mdanderson.org/) (version 1.0. 6) which is an open-access resource for expression profiles of lncRNAs in large patient cohorts of 20 cancer types including TCGA [44]. Briefly, 12,727 lncRNAs was obtained by examining the overlapping between lncRNA exons and any known coding genes based on the annotations of GENCODE and RefGene. Then, expression levels of 12,727 lncRNAs were quantified using reads per kilobase per million mapped reads (RPKM). Due to the very low expression level of lncRNAs Colon adenocarcinoma patients, we filtered out lncRNAs with expression value of 0 in more than 30% samples and obtained 1296 lncRNAs for further analysis. The expressive value of 1296 lncRNAs was log transformed using log2 (RPKM+0.001) and z-score normalized.

Statistical analysis
Univariate and multivariate Cox regression analysis was first performed to evaluate the association between Oncotarget 101306 www.impactjournals.com/oncotarget the expression levels of lncRNAs and the OS of patients with colon adenocarcinoma and to identify prognostic lncRNAs. A lncRNA expression signature was constructed by a linear combination of the expression values of prognostic lncRNAs and the multivariate Cox regression coefficient as the weight. By using the median risk score as the cutoff point, patients can be classified into highrisk and low-risk groups. The Kaplan-Meier curves and log rank test were performed to assess survival differences between the low-risk and high-risk groups using the R package "survival". Multivariate Cox regression analysis and data stratification analysis was performed to test whether the lncRNA expression signature was independent of clinical features. Time-dependent ROC curves were used to compare the sensitivity and specificity of the three-survival prediction based on the lncRNA expression signature using the R package "survivalROC" [45]. All analyses were performed using the R/BioConductor (version 3.0.2).