A five-long non-coding RNA signature to improve prognosis prediction of clear cell renal cell carcinoma

Recent works have reported that long non-coding RNAs (lncRNAs) play critical roles in tumorigenesis and prognosis of cancers, suggesting the potential utility of lncRNAs as cancer prognostic markers. However, lncRNA signatures in predicting the survival of patients with clear cell renal cell carcinoma (ccRCC) remain unknown. In this study, we attempted to identify lncRNA signatures and their prognostic values in ccRCC. Using lncRNA expression profiling data in 440 ccRCC tumors from The Cancer Genome Atlas (TCGA) data, a five-lncRNA signature (AC069513.4, AC003092.1, CTC-205M6.2, RP11-507K2.3, U91328.21) has been identified to be significantly associated with ccRCC patients’ overall survival in both training set and testing set. Based on the lncRNA signature, ccRCC patients could be divided into high-risk and low-risk group with significantly different survival rate. Further multivariable Cox regression analysis suggested that the prognostic value of this signature was independent of clinical factors. Functional enrichment analyses showed the potential functional roles of the five prognostic lncRNAs in ccRCC oncogenesis. These results indicated that this five-lncRNA signature could be used as an independent prognostic biomarker in the prediction of ccRCC patients’ survival.


INTRODUCTION
Renal cell carcinoma (RCC) is one of the most common renal malignancies worldwide, with an estimated 15,000 deaths every year [1]. Recent studies showed that incidence and mortality rates of RCC are increasing in the United States [2]. The vast majority of RCC subtypes are classified as clear cell renal cell carcinoma (ccRCC), which account for 70-80% of all RCCs. ccRCC has been reported to have the highest rate of progression and mortality [3,4]. The standard of care for ccRCC remains surgical excision, and many ccRCC patients will be cured by surgery. However, about 30% of ccRCC patients had metastases and would die following removal of a confined tumor [4,5]. To date, no widely accepted molecular biomarkers for ccRCC aggressiveness have been available. Great efforts to improve the early-stage detection of ccRCC are warranted.
Prognostic lncRNA signatures have been examined in many cancer types [18,[27][28][29][30][31]. In this work, we used a cohort of 440 ccRCC patients from The Cancer Genome Atlas (TCGA) data to detect the potential lncRNA signature in predicting the survival of ccRCC patients. We identified a five-lncRNA signature from the TCGA dataset, and determined its independence of clinical factors. The identification of prognostic lncRNAs suggested the potential roles of lncRNA in ccRCC pathogenesis.

Detecting the prognostic lncRNAs from the training set
The 440 TCGA ccRCC patients were randomly divided into a training (n = 220) set or a testing set (n = 220), respectively. Based on the training set, the lncRNAs were subjected to univariable Cox regression model, and a total of five lncRNAs were significantly correlated with the patients overall survival (P-value < 0.001; Table 1). Three of them (AC069513.4, AC003092.1, RP11-507K2.3) had positive coefficients, representing that the higher expression level was associated with shorter survival. The negative coefficients for the remaining two lncRNAs (CTC-205M6.2, U91328.21) suggest higher levels of expression were related to longer survival.

The five-lncRNA signature and patients' survival in the training set
Base on the expression level of five lncRNAs, we designed a risk-score formula for ccRCC patients' survival prediction. The risk score formula is as following: Risk score= (1.43 × expression level of AC069513.4) + (0.81 × expression level of AC003092.1) + (1.64 × expression level of RP11-507K2.3) + (-6.56 × expression level of CTC-205M6.2) + (-1.72 × expression level of U91328.21). Next, we calculated the lncRNA-based risk score for each ccRCC patient in the training set, and divided ccRCC patients into high-risk (n=110) and lowrisk groups (n=110) using the median risk score value as a threshold. The Kaplan-Meier curves showed that patients in the high-risk group suffered worse prognosis than the patients in the low-risk group (33.3 months vs. 40.1 months, P-value=3.2e-6; Figure 1A). The overall survival rate of the patients in the low-risk group was 80% in 3 years, 70% in 6 years and 60% in 9 years, whereas the survival rate in high-risk group was only 65% in 3 years, 45% in 6 years and 20% in 9 years, respectively. To evaluate the competitive performance of the five-lncRNA signature, time-dependent ROC curve analysis was measured, and the AUC score for the five-lncRNA signature was 0.703 ( Figure 1B), demonstrating the better performance of survival prediction in the training dataset. Univariate Cox regression analysis showed that the five-lncRNA risk score were significantly associated with patients' survival (P-value < 0.001, HR = 1.151, 95% CI = 1.1-1.2; Table 2). The distribution of the risk score, overall survival and expression profiles of five lncRNAs in samples of the training dataset were showed in Figure  1C, which were ranked according to the risk score value. Patients with high-risk scores had higher mortality than patients with low-risk scores. For patients with high risk scores, the expression profiles of lncRNAs (AC069513.4, AC003092.1 and RP11-507K2.3) were significantly upregulated, whereas the remaining two lncRNAs (CTC-205M6.2 and U91328.21) were down-regulated.
Validation of the five-lncRNA signature for the survival prediction in testing set and the entire TCGA data set We next validated our five-lncRNA signature in the testing set to confirm our findings. By calculating the risk score for each patient in the testing set based on the same risk score formula, we divided ccRCC patients into a highrisk group (n=94) and a low-risk group (n=126) using the same threshold. Consistent with the results in the training set, patients in the high-risk group had significantly shorter survival than those in the low-risk group (33.07 months vs. 36.55 months, log-rank test P-value=0.04; Figure 2A). The overall survival rate of the patients in the low-risk group was 55% in 3 years, 12% in 6 years and 2% in 9 years, whereas the survival rate in high-risk group was only 48% in 3 years, 8% in 6 years and 0% in 9 years, respectively. In the entire TCGA data set, similar result was observed that patients in the high-risk group had significantly shorter survival than those in the low-risk group (33.4 months vs. 37.4 months, P-value=5.09e-7; Figure 2B). Time dependent ROC curves analysis for the five-lncRNAs signature-based model achieved AUC score of 0.63 and 0.68 in the testing set and entire set, respectively.

Independence of the five-lncRNA signature and the other clinical variables
We evaluated whether the survival prediction based on five-lncRNA signature was independent of clinical factors. Multivariate Cox regression analysis was then performed, including lncRNA-based risk score and other clinical information, such as age, gender, tumor grade and AJCC tumor stage ( Table 2). The result showed that five-lncRNA risk score remained to be tightly associated with survival after adjusting the clinical factors. Moreover, we found that the age and AJCC stage were also significantly associated with overall survival. Then, stratified analysis was carried out, and the entire TCGA data set were divided into younger stratum (age ≤ 50, n=85) and older stratum (age > 50, n=355). The result showed that the five-lncRNA risk score can further divide ccRCC patients into high-risk and low-risk subgroup within each stratum ( Figure 3). These result suggested that prognostic value of five-lncRNA signature is independent of age. Similar results were obtained when the    stratification analysis of AJCC tumor stage ( Figure 4) and tumor grade ( Figure 5) were performed. These findings suggested that five-lncRNA risk score has a competitive performance for the survival prediction of ccRCC patients.

Functional characteristics of five prognostic lncRNAs
To explore the functional implication of five prognostic lncRNAs in ccRCC tumorigenesis, we performed functional category enrichment analysis to examine their function. The biological functions of lncRNAs are still largely unknown. Many lncRNAs can act as cis-regulators, and the expression of lncRNA is significantly correlated with their neighboring proteincoding genes. Here, we predicted their putative functions based on co-expression network. Spearman correlation coefficients were calculated between lncRNAs and protein-coding genes based on their expression values. The top 1% protein-coding genes were selected as co-expressed partner of five prognostic lncRNAs. At last, a total of 1960 protein-coding genes were significantly correlated with at least one prognostic lncRNAs. Functional enrichment analysis showed that lncRNA correlated protein-coding genes were significantly enriched in 128 GO terms and 11 KEGG pathways ( Figure 6). The functional categories  are mainly involved in four functional clusters, including proteasome, transcription regulation process, intracellular transport, GTPase activity and several pathways ( Figure 6A). This results suggested that the five prognostic lncRNAs might be involved in tumorigenesis process through regulating protein-coding genes to influence known cancer related pathways.

DISCUSSION
Considering the great importance of lncRNAs in tumor tumorigenesis and progression, lncRNA dysregulation may serve as an important indicator of the characteristics of tumors. It has been documented that altered lncRNAs can exist in many cancer types, and are tightly associated with the outcome of cancers [32,33]. Many works have focused on whether aberrant expression of specific lncRNAs in cancers can serve as independent markers for diagnosis and prognosis [17,18,31,34]. Recently, although some works have announced the great importance of lncRNA in ccRCC tumorigenesis [23,35], the comprehensive prognostic values of lncRNA in ccRCC have not been clarified clearly [36]. Therefore, a reliable prognostic biomarker is quite necessary.
In the present work, a five-lncRNA prognostic signature was identified based on the lncRNA expression  profiles of ccRCC patients. It was then confirmed to be an independent prognostic predictor for patients with ccRCC. This study determined the potential five-lncRNA signature to predict the prognosis of ccRCC. The performance of five-lncRNA signature was evaluated using ROC analysis, suggesting that the prognostic value of the five-lncRNA signature is competitive for survival prediction. To the best of our knowledge, these five lncRNAs have not been previously reported, and further functional annotation of these prognostic lncRNAs will increase our understanding of their biological implications in determining ccRCC prognosis.
The result suggested that the prognostic value of five-lncRNA signature was independent of other clinical factors in ccRCC. Actually, lncRNAs have been reported to have higher specificity than mRNA in some cancer types [30,37,38]. The present work may bring some clinical implications in the development of novel prognostic factors in ccRCC. Although these five prognostic lncRNAs have not been previously investigated in cancers, we speculate that these lncRNAs may be involved in ccRCC tumorigenesis and many works are needed in the future ccRCC studies. Previous works have reported some prognostic lncRNAs in ccRCC, such as TUG1 [39], TCL6 [40], H19 [41], MALAT-1 [42] and NBAT1 [35]. After measuring the prognosis of these lncRNAs using TCGA data in ccRCC, these lncRNAs are not involved in TCGA-based prognostic lncRNAs. We speculated the reasons why these reported prognostic lncRNAs cannot be validated in TCGA data. First, all these reported works are based on Chinese ccRCC patients, whereas the ccRCC patients in TCGA are Caucasian people, and the underlying molecular mechanisms might be different between populations. Second, these published works are all based on a small-scale ccRCC cases, which might draw a conclusion with deviation. Future works with more samples are necessary.
Several limitations of present work should be addressed. First, we only analyzed and validated the prognostic power of the five-lncRNA signature in the TCGA dataset, and no other ccRCC lncRNA expression data can be used for further validation. Although previously published microarray data can be used to identify some lncRNAs, these data only include a relatively small fraction of lncRNAs. Second, lncRNAs always play important regulatory roles in a wide range of biological processes through a complex regulatory network involving different kinds of cis-and trans-regulatory elements. Further integrated analysis may help us to predict the functional roles of the five prognostic lncRNAs in ccRCC more accurately. Third, no experimental data on the underlying mechanisms of lncRNAs have been performed, and future experimental studies on these lncRNAs can enhance our understanding of the functional role in ccRCC.
In this work, we reported a lncRNA signature in ccRCC patients to predict survival. Using large-scale independent expression profiles, we have demonstrated the prognostic values of lncRNAs in ccRCC patients. Our result has suggested that the five-lncRNA signature is helpful in predicting the clinical outcome, and may be an effective prognostic biomarker in the prediction of the survival of ccRCC patients.

The ccRCC patient information
The lncRNAs expression data and corresponding clinical information of ccRCC patients TCGA database. After excluding the data without complete survival information, a total of 440 ccRCC patients were enrolled in this work. We also downloaded the detailed clinical information of ccRCC patients, including age, gender, tumor grade, AJCC cancer stage, etc. Samples from TCGA data set were equally divided into training (n=220) and testing sets (n=220).

lncRNA expression profile
ccRCC RNA-seq data were downloaded from TCGA data portal (https://tcga-data.nci.nih.gov/tcga/). After alignment to the human genome (Ensembl database v72 assembly), the expression level of lncRNAs and mRNAs were determined by the value of Reads Per Kilobase of exon model per Million mapped reads (RPKM). We identified lncRNAs from TCGA dataset based on the following three criteria: 1) transcripts were not identified in any protein-coding region; 2) transcript sequences have been annotated in GENCODE project [7]; 3) transcripts were expressed in at least half of the ccRCC samples. The lncRNA expression profiles were defined as those with an average RPKM ≥ 0.1 across 440 ccRCC samples. At last, a total of 9669 lncRNAs in dataset were enrolled. We used edgeR [43] software to examine the expression difference.

Statistical analysis
Based on the training set, the association between the expression level of each lncRNA and patient's overall survival was calculated using a univariate Cox regression. Those lncRNAs were considered to be significant if their P-values were less than 0.001. Then, the selected lncRNAs were fitted in a multivariate Cox regression analysis in the training dataset. Risk scores were estimated by involving these selected lncRNAs, which were weighted by their estimated regression coefficients in the multivariable Cox regression model. The risk score can be calculated for each ccRCC patient based on prognostic five-lncRNA signature. Based on the risk score formula, ccRCC patients can be divided into high-risk and low-risk groups, respectively. Differences in patients' survival between these two groups can be evaluated by the Kaplan-Meier survival analyses. To further determine whether the prediction of the lncRNA signature was independent of other clinical variables, multivariate Cox regression and stratified analyses were carried out. The receiver operating characteristic (ROC) curve within 5 years were performed to compare the sensitivity and specificity of the survival prediction based on the risk score. All analyses were performed using R package (version 3.3.0).

Functional enrichment analyses
To evaluate the functional implication of lncRNAs, spearman correlation coefficients were computed between five lncRNAs and protein-coding genes. Functional enrichment analyses for those co-expressed protein-coding genes were performed using the DAVID Bioinformatics Tool (version 6.7) [44]. GO and KEGG category enrichments were based on the threshold of P-value < 0.05 and enrichment score > 1.0. Significant enrichment results were visualized using Cytoscape software (version 3.4.0) [45].

Author contributions
DD conceived and designed the experiments. DS, QQ, QC, YW, YG and DD analyzed the data and wrote the manuscript. All authors read and approved the final manuscript.