A long non-coding RNA signature for predicting survival in patients with colorectal cancer

Dysregulation of long non-coding RNA (lncRNA) plays important roles in cancer development and progression. In this work, we attempted to develop a lncRNA signature to improve prognosis prediction of colorectal cancer. A comprehensive analysis for the lncRNA expression and corresponding clinical information of 344 colorectal patients has been performed based on the data from The Cancer Genome Atlas (TCGA). We randomly divided TCGA data into a training set (n = 172) and a testing set (n = 172). A four-lncRNA signature has been established which was significantly associated with the overall survival of colorectal cancer patients. Based on the four-lncRNA signature, the training set can be classified into high-risk and low-risk groups with significantly different survival. The result can be further validated in the testing dataset and another independent dataset. Further analyses suggested that the prognostic power of the four-lncRNA signature was independent of other clinical variables. The identification of lncRNA signature indicated that lncRNAs could be novel independent biomarkers for predicting the survival in patients with colorectal cancer.


INTRODUCTION
Colorectal cancer (CRC) is the third most common malignancy, and is the major cause of cancer-related death worldwide [1,2]. The incidence of colorectal cancer is gradually increasing in the developed areas. To date, surgery followed by adjuvant therapy is still the most common option for CRC patients. Despite an improved understanding of the molecular mechanism of CRC, the overall survival (OS) of CRC patients has not been dramatically improved and the four-year survival rate remains very low [3]. It is an urgent need to identify novel independent biomarkers for the diagnostic and prognosis of CRC.
With the advancements of transcriptome profiling, the roles of long non-coding RNAs (lncRNAs) have received great attention in the development of human cancer researches. LncRNAs are an important category of non-coding RNAs with little or no protein-coding capacity [4,5]. It has been documented that lncRNAs play important roles in regulating gene expression at transcriptional, posttranscriptional and epigenetic levels [4,[6][7][8]. Moreover, lncRNAs can participate in various biological processes and pathways, such as cell growth and immune response [7,9,10]. Recently, many lncRNAs have been examined to play critical oncogenic or tumor suppressive roles in various types of cancers [11][12][13][14].
We here attempted to develop a lncRNA signature to improve prognosis prediction of CRC. We identified a four-lncRNA signature by using the sample-splitting method. Our results demonstrated the four-lncRNA signature can provide a novel insight into the understanding of the underlying molecular mechanism of CRC.

Identification of prognostic lncRNAs from the training dataset
The 344 CRC patients were randomly divided into a training dataset (n = 172) and a testing dataset (n = 172). At first, we identified the prognostic lncRNAs from the training set. A univariate Cox regression analysis was performed to evaluate the association between lncRNA expression and overall survival of CRC patients. Based on the threshold of P-value < 0.01, four lncRNAs were identified to be significantly correlated with overall survival of CRC patients. The detailed information of these four lncRNAs was showed in Table 1. Positive coefficients represent that higher expression profiles were associated with shorter overall survival (SPRY4-IT1), whereas negative coefficients represent that higher expression level of lncRNA expression was associated with longer survival (LINC01133, Loc554202 and RP11-727F15.13).
A four-lncRNA signature for predicting overall survival of CRC patients These four lncRNAs were analyzed using a multivariate Cox regression analysis to establish a lncRNA signature for predicting patients' overall survival. We constructed a risk-score formula by integrating the lncRNA expressions and corresponding estimated regression coefficient derived from above multivariate Cox regression analysis, as follows: Risk score = (0.322 × expression value of SPRY4-IT1) + (-0.134 × expression value of Loc554202) + (-0.336 × expression value of LINC01133) + (-0.231 × expression value of RP11-727F15.13). We calculated four-lncRNA signature risk score for each CRC patient, and ranked them according to risk score values. These 172 CRC patients can be divided into a high-risk group (n = 90) and a low-risk group (n = 82) using the median risk score as the threshold.
A significant difference of overall survival between the high-risk group and low-risk group was observed (P-value = 1.74E-06; Figure 1A). It is obvious that CRC patients in the high-risk group had significantly shorter survival (median 18 months) than those in the lowrisk group (median 24.5 months). The time-dependent ROC curve analysis achieved an AUC of 0.727 at the overall survival of five years ( Figure 1B), suggesting a competitive performance of the four-lncRNA signature for survival prediction. The lncRNA risk score were significantly associated with overall survival of CRC patients using the univariate Cox regression analysis ( Table 2).

Validation of the four-lncRNA signature for survival prediction in the testing dataset and another independent dataset
We confirmed our results using the testing set. Using the same risk score formula, 172 CRC patients can be classified into a high-risk group (n = 77) and a low-risk group (n = 95) with the same cutoff point derived from the training dataset. The result showed that a significant difference of overall survival between the high-risk group and the low-risk group (P-value = 0.00439, median 17.5 months vs. 23 months; Figure 2A). The AUC value in the testing set was 0.712 at the overall survival of four years, and the lncRNA risk score was significantly associated with patients' overall survival (Table 2). Next, we performed the same analysis in the entire TCGA CRC dataset. similar results were obtained. The lncRNA signature can classify 344 CRC patients into a high-risk group (n = 166) and a low-risk group (n = 178) with significant difference of overall survival (P-value = 6.9E-05, median 16 months vs. 23 months; Figure 2B). The AUC value in the entire set was 0.721 at the overall survival of four years. Further analysis indicated that lncRNA risk score was significantly associated with CRC patients' overall survival in the entire TCGA CRC dataset (Table 2). We further validated our lncRNA signature in an independent CRC data (GSE14333). As shown in Figure 2C, lncRNA signature can effectively predict overall survival in CRC patients. A significant difference of overall survival between the high-risk group (n = 125) and the low-risk group (n = 72) was observed (P-value = 0.0183, median 38.3 months vs.

Independence of the lncRNA signature for survival prediction from other clinical variables
We examined whether the prognostic power of the lncRNA signature was independent of other clinical variables, such as age, gender, subtype and tumor stage. The multivariate Cox regression analyses were performed, and the results suggested that the lncRNA risk score was also significantly associated with overall survival. The lncRNA signature still maintained a significant association with overall survival after adjustment for other clinical variables ( Table 2). The result showed that patient age and tumor stage were significantly associated with overall survival. A series stratified analyses have been performed according to age and tumor stage, respectively. At first, all CRC patients were stratified into a younger group (n = 132, age < 65) and an elder group (n = 212, age ≥ 65). The lncRNA signature can divided the younger group into a high-risk subgroup (n = 85) and a low-risk subgroup (n = 47) with significant difference of survival (P-value = 0.00416, median 23 months vs. 50.85 months; Figure 3A). As for the elder group, the four-lncRNA signature was also able to classify them into a high-risk subgroup (n = 147) and a low-risk subgroup (n = 65) with significantly different survival (P-value = 0.00742, median 13.3 months vs. 20.1 months; Figure 3B). Next, all CRC patients were stratified by tumor stage into an early subgroup (stage I and II, n = 196) and a late subgroup (stage III and IV, n = 148), respectively. The result of stratified analysis showed effective prognostic power in both early subgroup and late subgroup. As shown in Figure 4A, patients in the early subgroup can be divided into a high-risk group (n = 92) with shorter survival and a low-risk group (n = 104) with longer survival (P-value = 0.00189, median 26 months vs. 51.05 months). Similar results were obtained in the late subgroup (P-value = 2.48E-04, median 16 months vs. 24.5 months; Figure 4B). These result demonstrated that the prognostic ability of lncRNA signature is independent of other clinical variables for the prediction of survival in CRC patients.

Functional implications of the prognostic lncRNAs
We investigated the potential functional roles of the four prognostic lncRNAs in CRC. Spearman correlation coefficients were calculated between lncRNAs and protein-coding genes using the expression profiles of 344 CRC patients. A total of 732 protein-coding genes were positively correlated with either of the four lncRNAs (Spearman correlation coefficient > 0.6). Functional enrichment analyses indicated that these protein-coding genes were significantly enriched in 20 GO categories (P-value of < 0.01, Figure 5). These functionally enriched GO categories included assembly and disassembly of protein and macromolecules, transcription, signal transduction and response to stimulus, cell apoptosis and Derived from the univariate and multivariate Cox regression analyses in CRC patients of the training dataset.

DISCUSSION
Great efforts have been devoted to detect prognostic biomarkers for CRC at protein-coding and non-coding genes [21,22,26,27]. Mounting evidence suggested that expression changes of lncRNAs are implicated in tumorigenesis by acting as tumor oncogenes or suppressor [8,28]. Moreover, dysregulation of lncRNA has been measured in various cancer types, highlighting their potential roles as novel independent biomarkers for cancer prognosis [10,[29][30][31][32]. Some works have identified potential prognostic lncRNA signatures to predict overall survival in many cancer types, such as glioblastoma, lung cancer, etc. [15,18]. However, the prognostic power of lncRNA signature for predicting survival in patients with CRC has still not been investigated.
Up to date, many lncRNAs have been discovered in human over the past decades [33]. However, only few of them are well characterized in human cancers. Among these four lncRNAs, SPRY4-IT1 and LINC01133 have been reported to be prognostic factors in patients with CRC [34,35]. In this work, we identified that four lncRNAs are significantly associated with CRC patients' survival and established a four-lncRNA signature for the prediction of survival. The result suggested a competitive performance of four-lncRNA signature for predicting survival. This finding can be validated by using TCGA testing set and another independent dataset, which demonstrated the reliability and reproducibility of the four-lncRNA signature for predicting CRC patients' survival. Further stratified analyses after controlling for age and tumor stage showed that the prognostic power of the four-lncRNA signature was independent of other clinical variables for survival prediction of patients with CRC.
Previous studies documented that lncRNAs participated in biological processes by positively regulating protein-coding genes involved in the same processes. It is possible to predict lncRNA biological functions based on their co-expressed protein-coding genes [36][37][38]. Here, we performed GO enrichment analyses for lncRNA co-expressed protein-coding genes. The results demonstrated the important functional roles of the four prognostic lncRNAs in CRC tumorigenesis.
Taken together, we performed a comprehensive analysis for lncRNA expression profiles and corresponding clinical information in CRC patients. Our work identified that four prognostic lncRNAs were significantly associated with CRC patients' survival. A four-lncRNA signature was established to effectively predict patients' survival.    (A) The functional enrichment map of GO terms. Each node represents a GO category. An edge represents the overlap of the shared genes between connecting terms. Node size represents the number of gene in the GO terms. Color intensity is proportional to enrichment significance. www.oncotarget.com The four-lncRNA signature might function as novel independent biomarkers for CRC prognosis. Our work gains insight into the understanding of the molecular mechanism of CRC.

CRC datasets and clinical information
CRC lncRNA data and corresponding clinical information were downloaded from TCGA data portal. A total of 344 CRC patients were included in this work after removal of patients without clear clinical information. The lncRNAs derived from TCGA were annotated based on GENCODE database [39] to reduce redundant. The lncRNA expressions were defined as those with an average Fragments Per Kilobase of transcript per Million fragments mapped (FPKM) ≥ 0.1. The lncRNAs expression profiles were normalized by log2 transformed. At last, a total of 14,467 lncRNAs were enrolled in 344 CRC patients.

Identification of prognostic lncRNA signature
We randomly divided CRC patients into a training set (n = 172) and a testing set (n = 172). In this training set, the association between the lncRNA expression and the overall survival of CRC patients was evaluated using a univariate Cox regression analysis. The lncRNAs that are significantly associated with the overall survival of CRC patients were identified based on the threshold of P-value < 0.01. Next, those selected lncRNAs were subjected to a multivariate Cox regression analysis. We established a risk score formula according to the lncRNA expression, weighted by the regression coefficients derived from the multivariable Cox regression analysis. Then, CRC patients in the training set can be divided into high-risk or low-risk groups by using the median risk score as a threshold.
The survival differences between high-risk and lowrisk group in each dataset can be evaluated by the Kaplan-Meier analyses. Multivariate Cox regression and stratified analyses were carried out to evaluate whether the prognostic power of the four-lncRNA signature was independent of other clinical variables. The receiver operating characteristic (ROC) curve analyses were performed to evaluate the competitive performance for overall survival prediction. Area under the ROC curve (AUC) values were calculated. All analyses were performed using R package.

Functional enrichment analyses
Since lncRNAs are always co-expressed with neighboring coding genes, we calculated spearman correlation coefficients to evaluate co-expression relationships between lncRNAs and protein-coding genes. Functional enrichment analyses for those co-expressed protein-coding genes were performed using the DAVID software [40,41]. Gene Ontology (GO) categories with a P-value of < 0.01 were considered as significantly enriched function annotations.

Author contributions
MX, YW and DS conceived and designed the experiments. YW, JS, XW, TL analyzed the data and wrote the manuscript. All authors read and approved the final manuscript.