Comprehensive investigation of a novel differentially expressed lncRNA expression profile signature to assess the survival of patients with colorectal adenocarcinoma

Growing evidence has shown that long non-coding RNAs (lncRNAs) can serve as prospective markers for survival in patients with colorectal adenocarcinoma. However, most studies have explored a limited number of lncRNAs in a small number of cases. The objective of this study is to identify a panel of lncRNA signature that could evaluate the prognosis in colorectal adenocarcinoma based on the data from The Cancer Genome Atlas (TCGA). Altogether, 371 colon adenocarcinoma (COAD) patients with complete clinical data were included in our study as the test cohort. A total of 578 differentially expressed lncRNAs (DELs) were observed, among which 20 lncRNAs closely related to overall survival (OS) in COAD patients were identified using a Cox proportional regression model. A risk score formula was developed to assess the prognostic value of the lncRNA signature in COAD with four lncRNAs (LINC01555, RP11-610P16.1, RP11-108K3.1 and LINC01207), which were identified to possess the most remarkable correlation with OS in COAD patients. COAD patients with a high-risk score had poorer OS than those with a low-risk score. The multivariate Cox regression analyses confirmed that the four-lncRNA signature could function as an independent prognostic indicator for COAD patients, which was largely mirrored in the validating cohort with rectal adenocarcinoma (READ) containing 158 cases. In addition, the correlative genes of LINC01555 and LINC01207 were enriched in the cAMP signaling and mucin type O-Glycan biosynthesis pathways. With further validation in the future, our study indicates that the four-lncRNA signature could serve as an independent biomarker for survival of colorectal adenocarcinoma.


INTRODUCTION
Colon adenocarcinoma (COAD) is one of the most frequently diagnosed cancers and a top cause of cancer death globally. Approximately 1.2 million new cases are diagnosed, causing 0.6 million deaths per year all over the world. Cancer metastasis remains the key cause of COAD death [1][2][3][4][5][6]. The five-year overall survival (OS) rate of patients with primary COAD can be up to 80-90%, but it is reduced to 5-10% in patients with metastatic tumor [7][8][9][10][11][12]. Rectal adenocarcinoma (READ), which shares similar

Research Paper
Oncotarget 16812 www.impactjournals.com/oncotarget molecular mechanism with COAD, has a comparable high incidence and poor prognosis [1-4, 6-8, 10]. Therefore, the assessment of prognostic factors is pivotal for the management of unresectable colorectal cancer patients. Several underlying mechanisms have been described in the past decades, including multiple molecular alterations, which confer the tumorigenesis and progression of colorectal cancer [13][14][15][16][17][18]. However, the exact underlying molecular markers for survival assessment of colorectal cancer remain largely unknown.
Long non-coding RNAs (lncRNAs) are a class of non-protein coding RNAs of more than 200 nucleotides, which are broadly distributed in the genome and can modulate gene expression [19][20][21][22]. Recent accumulating evidence has demonstrated that lncRNA expression profiles are frequently changed in tumors compared to that of the adjacent non-tumorous tissues in numerous cancers [23][24][25][26][27]. The altered lncRNA expression profile has been proposed to correlate with the progression and survival in patients with various cancers, including colorectal cancer, which reveals the potential of lncRNAs to act as cancer biomarkers [28][29][30][31][32][33][34][35]. However, most of the previous studies explored a limited number of lncRNAs in a small number of cases [36][37][38][39].
Previous studies have verified that a lncRNA expression signature could be obtained from the database of The Cancer Genome Atlas (TCGA), which offers a platform for researchers to download and assess free public datasets [40][41][42][43][44][45]. Towards this, in the current study, the TCGA database was first applied to gather lncRNA gene expression profiles in COAD. By performing a comprehensive lncRNA expression profile assessment, we identified a lncRNA signature in COAD with four lncRNAs (LINC01555, RP11-610P16.1, RP11-108K3.1, and LINC01207), as a new candidate indicator with the potential to predict the OS in COAD patients. Furthermore, the prognostic value of this four-lncRNA-signature was validated in a READ cohort (Supplementary Figure 1).

Differentially expressed lncRNAs (DELs)
We initially performed differential expression analysis by comparing the expression of 7589 lncRNAs in COAD and normal colon tissue. The edgeR package identified 1430 differentially expressed lncRNAs (DELs, Figure 1A) and the DEseq package identified 584 DELs ( Figure 1B). We combined these two groups of DELs together and 578 DELs showed a consistent direction of differential expression across the two methods ( Figure 1C, Figure 2). Next, we excluded those cases without sufficient survival data, leaving 224 DELs that were selected for further survival analysis (Supplementary Table 1).
The COAD patients were divided into two groups of low-risk and high-risk based on the individual inflection point of the prognostic risk score (Figure 4). The risk score could largely predict the 5-year survival of COAD patients, as the area under ROC curve (AUC) was 0.706 ( Figure  5A). Furthermore, K-M curves confirmed that the survival time of patients in the high-risk group was 72.935 ± 7.398 months, predominantly shorter than that of the low-risk group (103.402 ± 8.679 months, P < 0.001, Figure 5B).
Meanwhile, the prognostic value of different clinical parameters was also compared to that of the risk score. The univariate Cox proportional hazards regression showed that a number of parameters could predict poorer survival of COAD (Table 2). However, when analyzed by multivariate Cox proportional hazards regression test, only neoplasm recurrence together with the risk score from the DELs, was independent prognostic indictor of COAD ( Table 2). The K-M curves of the above clinical features are shown in Figure 6.
We also assessed the relationship between the risk score based on the DELs signature and various clinical features, and the risk score showed moderate prognostic value for predicting the status of tumor stage, metastasis and lymphatic invasion ( Figure 7). The expression pattern of these four DELs in the low-and high-risk group is also presented in Figure 8.

Validation of the four-DEL-signature in READ
For validation, the READ patients were also divided into low-risk and high-risk groups on the basis of the prognostic risk score ( Figure 9A). Though the AUC of ROC to predict 5-year survival was slightly less than that in COAD patients ( Figure 9B), K-M curves did show a close relationship between the four-DEL-signature and survival (P = 0.014, Figure 9C). Additionally, both univariate (HR: 3.006, 95% CI: 1.192 -7.586, P = 0.020) and multivariate Cox proportional hazards regression tests (HR: 8.602, 95% CI: 1.159-63.839, P = 0.035) revealed that the risk score of four-DEL-signature www.impactjournals.com/oncotarget   Table 2).

Functional assessment of the prognostic DELs
Only correlative genes for LINC01555 and LINC01207 could be determined using the Multi-Experiment Matrix (MEM, Figure 10 and Figure 11). Forty genes were collected for LINC01555 and 78 genes for LINC01207. Using kyoto encyclopedia of genes and genomes (KEGG) pathway analysis, the correlative genes of LINC01555 were found to be enriched in the cAMP signaling pathway and the neuroactive ligand-receptor interaction pathways, while genes related to LINC01207 were enriched in mucin type O-Glycan biosynthesis pathway, the -lacto/neolacto-series glycosphingolipid biosynthesis pathway and metabolic pathway. Similarly, some Gene Ontology (GO) terms were also enriched   Figure 12). For construction of the proteinprotein interaction (PPI) network, there were 3 genes with more than 3 nodes for LINC01207 (GALNT4, GALNT7, MUC13), which were regarded as hub genes. However, we failed to observe similar hub genes for LINC01555.

DISCUSSION
In the current study, to define lncRNAs significantly related to OS, a univariate Cox proportional hazards regression with the significance level set at 0.05 was first performed on 7589 lncRNAs from 371 COAD patients according to the defined criterion in a large number of COAD patients based on the data downloaded from the TCGA database. A total of four lncRNAs were identified. We then developed a risk score by combining the four lncRNAs and found that this four-lncRNA signature could independently predict OS in COAD patients, which was further validated in READ patients. As far as we know, this is the first study to construct a risk score by mining TCGA data for the survival assessment of colorectal cancer patients. Colorectal cancer is one of the deadliest solid malignancies, and the involvement of noncoding RNAs in the development, diagnosis, and prognosis of colorectal cancer has been widely investigated [33][34][35]. In addition to the aggressive properties of colorectal cancer, the lack of specific biomarkers for its diagnosis, therapeutic effect monitoring and prognosis might also be responsible for the low survival rate. Hence, there is a critical need for reliable prognostic factors pinpointing a poor outcome.
Recent large-scale genomic analyses have made it possible to reveal a catalogue of molecular characteristics associated with colorectal cancer outcome [44,[46][47][48][49][50][51][52][53]. However, most of the existing studies have focused on mRNA and microRNA expression [54][55][56]. Knowledge is now rapidly emerging on the functional roles of lncRNAs in cancer initiation and progression, representing a significant untapped molecular resource for cancer prognosis as well.       The relationship between aberrant lncRNAs and survival of colorectal cancer has been studied in small samples using distinct approaches. Li et al. [57] analyzed the prognostic value of 21 lncRNAs by PCR array in 30 colorectal cancer patients and reported that higher levels of AFAP1-AS1, BCAR4, H19, HOXA-AS2, MALAT1 or PVT1 and a lower level of ADAMTS9-AS2 could predict a poor prognosis of colorectal cancer patients. Similarly, Wang et al. [58] studied lncRNA expression profiling using microarray in six cases of colorectal cancer patients. Multivariate Cox analysis revealed that lncRNA NR_029373 and NR_034119 were both Oncotarget 16822 www.impactjournals.com/oncotarget independently related to the disease-specific survival rate. Furthermore, Sun et al. [25] searched in GEO datasets and achieved five studies: GSE8671, GSE22598, GSE23878, GSE9348, and GSE37364, that studied lncRNAs in 150 cases of colorectal cancer patients. They found that one lncRNA, AK098081 could be considered an independent risk factor for colorectal cancer patients (HR = 1.896, 95% CI = 1.393-2.579, P < 0.001). Surprisingly, no consistent lncRNA has been verified by different groups, potentially due, at least in part, to the limited sample size and differing detection methods. Compared with previous studies, our study uses data from the TCGA database with high-throughput analysis of lncRNAs from a larger sample size. Herein, we report that expression of four novel lncRNAs (LINC01555, RP11-610P16.1, RP11-108K3.1 and LINC01207) could also become a new independent risk factor for colorectal cancer patients. Moreover, the risk score constructed from these four lncRNAs could be an indicator for the colorectal cancer patients in the clinical setting.
No study as of yet has investigated the function of the aforementioned four lncRNAs. Here, we performed MEM to gather the correlative genes of these four lncRNAs. However, the correlative genes were only found for LINC01555 and LINC01207 in this step. Interestingly, the correlative genes of LINC01555 were enriched in cAMP signaling pathway and neuroactive ligand-receptor interaction pathway, whereas the genes Oncotarget 16823 www.impactjournals.com/oncotarget correlative to LINC01207 were enriched in mucin type O-Glycan biosynthesis pathway, -the lacto/neolactoseries glycosphingolipid biosynthesis pathway and metabolic pathway, which are all classical signaling pathways closely related to the tumorigenesis and progression of malignancies. For instance, intracellular cAMP has been proposed to impact the biological behavior, namely to suppress the growth of colorectal cancer cells [59,60]. Most likely, LINC01555 could play substantial roles in the tumorigenesis and development of colorectal cancer via influencing cAMP signaling pathway. Therefore, the functional enrichment analysis may offer a clue for elucidating the role of LINC01555 and LINC01207 in carcinogenesis of colorectal cancer and the specific underlying molecular mechanisms. However, since the research on the clinical and biological function of LINC01555, RP11-610P16.1, RP11-108K3.1, and LINC01207 is still nonexistent in colorectal cancer patients, there is a lot of research that needs to be accomplished.
The findings of the current study may have substantial clinical significance or implications; however, some limitations should be considered. First, the mean time of follow-up was 29.375 months for COAD and 26.965 months for READ patients, and a study including a longer follow-up time is warranted to validate our findings in the future. Second, the data from TCGA were based on the RNA-seq technique; other experimental In conclusion, by analyzing the genome-wide lncRNA expression profiles in a large cohort from TCGA, we identified a four-lncRNA signature, which could act as an indicator for patient outcome and could be a potential independent biomarker for prognosis prediction of colorectal cancer. We will gather clinical samples and validated these findings experimentally in our future work.

Differentially expressed lncRNAs
RNA sequencing (RNA-Seq) data from 521 individuals with COAD were obtained from TCGA data portal (https://tcga-data.nci.nih.gov/docs/publications/ tcga/?), including data from 480 COAD tissue samples and 41 non-tumorous adjacent-normal colon tissue samples up to November 9, 2016. Since the data were provided by TCGA, additional approval by an ethics committee was not needed. Data processing was performed in line with the TCGA human subject protection and data Figure 12: GO and KEGG term analysis of potential genes related to LINC01555 and LINC01207. The Rich factor shows the degree of enrichment, which was calculated by the formula: (the number of selected genes in a term/total number of selected genes)/(the total number of genes in a term of the database/ the total number of genes in the database). The Node size represents the number of selected genes, and color represents the p-value of the enrichment analysis. CC, cellular component; MF, molecular function; BP, biological process.
Oncotarget 16825 www.impactjournals.com/oncotarget access policies. This dataset consisted of called gene counts for 60,244 mRNAs, which were assessed on the IlluminaHiSeq RNA-Seq platform. In the current paper, only lncRNAs with description from NCBI or Ensemble were selected for further study. Finally, we obtained the expression profiles of 7581 lncRNAs. We then filtered the differentially expressed lncRNAs (DELs) using two individual R packages: edgeR [61,62] and DEseq [63,64], with Padj < 0.05 and logFC > 1 of expression level between comparison of tumor and adjacent normal colon tissue. Since a strategy that combines edgeR and DESeq is proposed for large sample sizes from TCGA [65], here in the current study, this combination of edgeR and DESeq was adopted. The overlapping DELs obtained using both edgeR and DEseq were sent for further survival analysis. For validation, the relevant data including lncRNA levels and clinicopathological parameters were also downloaded for 158 READ tissues and 10 non-tumorous controls.

Construction of the DEL-based prognostic signature and statistical analysis
The DELs that were 0 in greater than 10% of all subjects were eliminated. The expression level of each DEL was log2-transformed for further analysis. Meanwhile, clinicopathological parameters and survival data were also downloaded from TCGA. The subjects without clinical data were excluded, which resulted in a 371-sample cohort with 224 DELs enrolled in the survival analysis (Supplementary Table 1). The end-point in our study was OS. The average follow-up time was 29.375 months in this COAD cohort and 26.965 months in READ cases.
The univariate Cox proportional hazards regression with a significance level set at 0.05 was performed to obtain the DELs that are closely correlated with the OS. A total of four DELs were identified. The multivariate cox regression model was further performed to test the prognostic value of the DELs. A prognosis risk score for predicting OS was established on the basis of a linear combination of the expression level multiplied by the regression coefficient derived from the multivariate cox regression model (β) with the following formula as previously reported: Risk score = exp DEL1 *β DEL1 + exp DEL2 *β DEL2 + … exp DELn *β DELn .
COAD and READ patients were divided into two groups: high-score and low-score, with the cut-off of the individual inflection point of the prognostic risk score [66]. Univariate Cox proportional hazards regression analyses were further conducted to investigate the effects of various clinical characteristics and the risk score on the OS of COAD and READ patients. Each predictor identified via the univariate analysis was further evaluated by a multivariate Cox proportional hazards regression analysis to determine whether the lncRNA prognostic model was independent of other clinical variables, adjusting for age, tumor stage, grade, surgical debulking status, and risk scores. Hazard ratio (HR) and 95% confidence intervals (CIs) were assessed. The time-dependent receiver operating characteristic (ROC) curve analysis within 5 years as the defining point was also performed using the R package "survivalROC", to evaluate the predictive accuracy of the prognostic model for time-dependent disease outcomes [67]. Kaplan-Meier survival curves were used to estimate the OS time for COAD and READ patients with predicted high-or low-risk scores, and the survival differences between the high-risk group and lowrisk group were assessed by a two-sided log-rank test using SPSS 22.0 software (SPSS Inc., Chicago, IL, USA) [68].
The relationship between the DEL signature and clinical features were examined using a Chi-square test. An ROC curve was drawn to assess the predictive significance of the risk score for the patient outcome after the first course of treatment. If a two-sided P-value was less than 0.05, statistical significance was determined. The statistical analyses were conducted with SPSS22.0 software (SPSS Inc., Chicago, IL, USA).

Functional assessment of prognostic DELs
The correlative genes of the DELs were collected using the Multi-Experiment Matrix (MEM) (http:// biit.cs.ut.ee/mem/index.cgi) [69]. Given a gene as an input, the MEM ranks other genes by similarity in each separable data set. In essence, this analysis is a new rank aggregation method that takes the individual rankings and determines a score of significance, and then, a ranking across all datasets at the same time. Functional enrichment analysis at the GO and KEGG pathway levels and PPI assessments were employed to infer the lncRNA function with the DAVID Bioinformatics Tool (https:// david.ncifcrf.gov/, version 6.7) and STRING database. The enriched GO terms and the KEGG pathways with p-value < 0.05 were regarded as potential function of the prognostic lncRNAs. The DELs and the correlative genes were visualized as a network using Cytoscape. A selection of protein pairs from the PPI assessment with an association score greater than 0.4 and a number of nodes beyond 3 were the final products of the correlative genes.