Identification of novel long non-coding RNA biomarkers for prognosis prediction of papillary thyroid cancer

Papillary thyroid carcinoma (PTC) is the most frequent type of malignant thyroid tumor. Several lncRNA signatures have been established for prognosis prediction in some cancers. However, the prognostic value of lncRNAs has not been investigated in PTC yet. In this study, we performed genome-wide analysis of lncRNA expression profiles in a large cohort of PTC patients from The Cancer Genome Atlas and identified 111 differentially expressed lncRNAs between tumor and non-tumor samples and between recurrent and recurrence-free samples. From the 111 differentially expressed lncRNAs, four independent lncRNA biomarkers associated with prognosis were identified and were integrated into a four-lncRNA signature which classified the patients of training dataset into the high-risk group and low-risk group with significantly different overall survival (p=0.016, log-rank test). The prognostic value of the four-lncRNA signature was validated in the independent testing dataset. Multivariate analysis indicated that the four-lncRNA signature was an independent prognostic predictor. Moreover, identified four lncRNA biomarkers demonstrates good performance in predicting disease recurrence with AUC of 0.833 using leave one out cross-validation. Our study not only highlighted the potential role for lncRNAs to improve clinical prognosis prediction in patients with PTC and but also provided alternative biomarkers and therapeutic targets for PTC patients.


INTRODUCTION
Thyroid cancer is the ninth most common cancer and is the most prevalent cancer among endocrine tumors [1]. The incidence of thyroid cancer has increased rapidly worldwide in the past few decades, representing 1%-2% of the total tumor [2]. Thyroid cancer is composed of three main pathological types of carcinomas: papillary thyroid carcinoma (PTC), follicular thyroid carcinoma (FTC) and anaplastic thyroid carcinoma (ATC) [3]. Papillary thyroid carcinoma (PTC) is the most frequent type of malignant thyroid tumor, comprising 80% of all thyroid cancer cases [3,4]. Although current treatment by surgery and radioiodine therapy is effective, the number of more advanced tumors and thyroid-cancer-associated mortality is increasing. Therefore, it is critical to identify novel disease biomarkers for improving PTC patient management and making tailored therapeutic decisions.
Recent advancements in the genome and transcriptome analysis revealed that protein-coding genes only comprised ~2% of the human genome sequences, whereas a substantial fraction of the human genome can be transcribed into noncoding RNAs (ncRNAs) [5]. Long non-coding RNAs, accounting for a large part of the newly discovered ncRNAs, are arbitrarily defined as non-coding transcripts ranging in size from about 200 bp to tens of thousands of bases [6]. Increasing evidence has suggested that lncRNAs are key players of genome regulatory network and play important functional roles in various fundamental biological processes www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No. 28), pp: 46136-46144 Research Paper at both the posttranscriptional and transcriptional level [7,8]. A handful of recent studies has identified have detected a number of dysregulated lncRNAs in cancer tissues or cell lines implying that lncRNAs may possess oncogenic or/and tumor-suppressor properties [9][10][11]. Like proteins, mRNAs and miRNAs, lncRNAs are emerging as novel molecular biomarkers for early diagnosis and prognosis prediction in various cancers [12][13][14][15][16][17][18][19][20]. Most recently, thousands of lncRNAs were found to be differentially expressed between PTC and adjacent noncancerous samples [21]. Another study performed by Lan et al. also identified thousands of significantly differentially expressed lncRNAs in PTC relative to noncancerous thyroid tissue [22]. These studies made lncRNAs attractive as valuable diagnostic and prognostic biomarkers in PTC. However, the prognostic value of lncRNAs has not been investigated in PTC yet.
The purpose of this study is to identify novel lncRNA biomarkers closely associated with the survival and recurrence of PTC patients from The Cancer Genome Atlas (TCGA), and to assess the prognostic value of novel lncRNA biomarkers for predicting survival and recurrence in patients with PTC.

Identification of lncRNA biomarkers associated with prognosis in PTC
We first performed differentially expressed analysis to identify potential lncRNAs associated with prognosis in PTC. Compared with the normal samples, we found that there were 77 differentially expressed lncRNAs (adjusted p-value <0.05). Among them, 22 lncRNAs were up-regulated and 55 lncRNAs were down-regulated in PTC. Hierarchical clustering of 246 PTC patients in the entire TCGA dataset according to the expression patterns of these 77 differentially expressed lncRNAs suggested that these up-regulated and down-regulated lncRNAs could effectively discriminate tumor and non-tumor samples (p<0.001, chi-square test). As showed in Figure 1, Cluster I contained close to the majority of tumor samples (n=355; 71.4%). Conversely, Cluster II contained the majority of normal samples (n=57; 96.6%). Analysis of lncRNA expression profiles in PTC with recurrence compared with those without recurrence identified a total of 34 differentially expressed lncRNAs (adjusted p-value <0.05).
These differentially expressed lncRNAs were subjected to univariate Cox proportional hazard regression analysis to identify lncRNAs, whose expression level could be significantly correlated with patient survival. We identified a set of six lncRNAs (p<0.05) that was significantly correlated with patient survival. We then selected four lncRNAs (RP11-536N17.1, RP11-508M8.1, AC026150.8 and CTD-2139B15.2) as independent lncRNA biomarkers that were correlated with prognosis by multivariate Cox proportional hazards regression analysis (P<0.05). Table 1 shows a list of these four lncRNAs

Development and validation of lncRNA signature in survival prediction of PTC in the training dataset
We then used these four independent lncRNA biomarkers to construct a lncRNA signature by the risk scoring method as the classifier for survival prediction according to the expression of the four lncRNA biomarkers and using the multivariate Cox regression coefficient as the weight, as follows: lncRNA signature scores= ((1.066 * expression value of RP11-536N17.1) + (0.967 * expression value of RP11-508M8.1) + (1.245 * expression value of AC026150.8)+ ( 1.171 * expression value of CTD-2139B15.2 )). Using the median lncRNA signature score as risk cutoff value, the lncRNA signature classified 246 PCT patients of training dataset into the high-risk group (n=123) and low-risk group (n=123) with significantly different survival time (p=0.016, logrank test) ( Figure 2A). Furthermore, univariate analysis revealed that the lncRNA signature was significantly associated with overall survival in PTC patients ( Table 2). The hazard ratios of high-risk group versus low-risk for overall survival was 1.094 (p=0.004, 95% confidence interval (CI) = 1.028-1.163) ( Table 2).

Validation of the lncRNA signature for survival prediction in the testing dataset
To confirm the survival prediction power of the lncRNA signature, we validated the predictive ability of the lncRNA signature in the testing dataset. By using the same risk score formula and risk cutoff value derived from the training dataset, the patients of the testing dataset were classified as high-risk (n=128) or low-risk (n=118) according to the lncRNA signature. Patients in the highrisk group had significantly shorter median survival time than those in the low-risk group (p=0.043, log-rank test) ( Figure 2B). In a univariate Cox regression analysis, the hazard ratios of high-risk group versus low-risk for overall   Table 2).

Survival prediction by the lncRNA signature is independent of clinical factors
To assess whether the prognostic ability of the lncRNA signature in survival prediction is independent of other clinical factors of the patients with PTC, we performed multivariate Cox regression analysis including lncRNA signature, age, stage and gender as covariables. The results from the training dataset showed that the lncRNA signature (HR=1.131, 95% CI=1.01-1.267, p=0.034) and gender (HR=1.151, 95% CI=1.076-1.231, p<0.001) was significantly correlated with overall survival of the patients with PTC (  Table 3). The results of the multivariable Cox regression analysis thus indicated that the prognostic value of the lncRNA signature is independent of other clinical factors for the survival of patients with PTC.

Evaluation of lncRNA biomarkers in predicting the risk of recurrence in PTC
We further assessed the capability of the four lncRNA biomarkers in predicting the risk of recurrence in PTC. Thus, we integrated these four lncRNA biomarkers  to develop a four-lncRNA classifier using SVM algorithm for measuring how successful the four-lncRNA classifier was in assigning samples to the correct class. The performance of four-lncRNA classifier in predicting recurrence was evaluated using the leave one out crossvalidation (LOOCV) procedure. Results of LOOCV procedure showed that the four-lncRNA classifier achieved an overall predictive accuracy of 91.7% with a sensitivity of 87.5% and a specificity of 92.3%. The discriminatory performance of the four-lncRNA classifier, evaluated by calculating the AUC and DOR, revealed that the AUC was 0.833 ( Figure 3) and the DOR was 83.909. These results demonstrated that the four-lncRNA classifier had the better predictive performance in predicting recurrence for PTC patients

DISCUSSION
Papillary carcinoma (PTC) is the most common form of well-differentiated thyroid cancer, and the most common form of thyroid cancer. Although the mean survival rate after 10 years is higher than 90% and most of PTC patients have favorable prognosis especially among patients up to 45 years of age, approximately 5~20% of patients faced the risk of disease recurrence and suffer aggressive and lethal outcomes. Hence, the use of prognostic biomarkers has become increasingly relevant with the potential goal of improving PTC patient management and making tailored therapeutic decisions. Previous studies establishing the utility of molecular biomarkers in the management of PTC have focused on protein-coding genes or miRNAs in order to identify highrisk patients for more aggressive management. In recent years, increasing evidence have shown that lncRNAs function as a big player in gene regulation and tend to be expressed in a more cell-and tissue-type manner than protein-coding genes, demonstrating the intrinsic advantages of lncRNAs as diagnostic and prognostic biomarkers in cancers [23]. Therefore, development of novel lncRNA biomarkers in prognosis prediction of PTC patients may improve future treatment decisions for PTC patients with regard to the aggressiveness and/or recurrence of the disease. Several groups have identified thousands of significantly differentially expressed lncRNAs in PTC relative to noncancerous thyroid tissue. In one lncRNA profiling study in three pairs of PTC and adjacent noncancerous samples, 675 lncRNAs were found to be differentially expressed and 8 randomly selected differentially expressed lncRNAs were validated using quantitative polymerase chain reaction (qPCR) [21]. Another study of sixty-two PTC tissue and paired adjacent noncancerous thyroid tissue specimens also identified thousands of significantly differentially expressed lncRNAs [22]. Still, the prognostic signature of lncRNAs in PTC has not been well-established. This study reports the first examination of the prognostic value of lncRNA in a large cohort of more than 100 patients with PTC. The lncRNA expression profiles of noncancerous samples and PTC patients associated with progression and metastasis of patients with PTC were systematically investigated. Differentially expressed lncRNAs between tumor and non-tumor samples and between recurrent samples and recurrence-free samples were identified as previous studies. In order to more effectively construct a specific lncRNA signature for prognosis prediction for PTC patients, we performed univariate and multivariate Cox proportional hazards regression analysis and identified four independent lncRNA biomarkers from most significantly altered lncRNAs. These four independent lncRNA biomarkers can potentially be used to identify patients who are at high-risk of developing PTC recurrence. Then a linear combination of four independent lncRNA biomarkers (RP11-536N17.1, RP11-508M8.1, AC026150.8 and CTD-2139B15.2) was constructed to form lncRNA signature which was closely related to survival of patients with PTC. The prognostic value of this signature was verified in the training dataset of 246 PTC patients and in an independent testing dataset of 246 PTC patients. Furthermore, the prognostic value of the four-lncRNA signature was independent of clinical factors by performing multivariable Cox regression analysis. We further evaluated the clinical relevance of the four independent lncRNA biomarkers as a predictor in the risk evaluation of recurrence using SVM algorithm. Our findings demonstrated the feasibility and potential power of the four independent lncRNA biomarkers in predicting the risk of recurrence.
Some limitations should be acknowledged for this study. First, roles of four lncRNA biomarkers in PTC pathogenesis are presently unclear. Additional experimental investigations of these four lncRNAs should be conducted to improve our understanding of the biological behavior of three four lncRNAs in PTC. Second, PTC patients used in this study were obtained from a single source (TCGA) and randomly assigned to training and testing datasets for the discovery and validation. So, prognostic lncRNA biomarkers identified here should be validated in independent external PTC datasets. Despite these drawbacks, our study highlighted the potential role for lncRNAs to improve clinical prognosis prediction in patients with PTC. As pointed and discussed in some recent papers [24], the web server of the proposed method is very important. In this regard, we will focus on establishing a web site of the proposed methods in our future studies.
In conclusion, the present study performed genome-wide analysis for lncRNA expression in a large cohort of PTC patients from TCGA and showed altered expression patterns between tumor and non-tumor samples and between recurrent samples and recurrencefree samples. A lncRNA signature comprising four lncRNAs (RP11-536N17.1, RP11-508M8.1, AC026150.8 and CTD-2139B15.2) were identified, which can be used an independent prognostic marker to robustly predict survival and recurrence of patients with PTC. With further independent validation studies in prospective cohorts, these identified lncRNAs might serve as alternative biomarkers and therapeutic targets for PTC patients.

Patients and clinicopathological data
Clinical and pathological data of 496 patients with PTC were retrieved from The Cancer Genome Atlas (TCGA) (https://cancergenome.nih.gov/). All PTC patients were randomly divided into two distinct sets of equal size: the 246-patient training dataset for discovery purpose and the 246-patient testing dataset for validation purpose. There was no significant difference in clinical and pathological characteristics between patients in the training dataset and those in the testing dataset. Detailed clinical and pathological of patients with PTC used in this study was shown in Table 4.

Genome-wide lncRNA expression data in normal and tumor samples
Genome-wide lncRNA expression data of 496 PTC samples and 59 normal samples with PTC were obtained from the TANRIC database (http://ibl.mdanderson.org/ tanric/_design/basic/download.html) [25]. Briefly, those lncRNAs from the GENCODE Resource (version 19) that overlapped with any known coding genes were filtered out which resulted in 12727 lncRNAs. Then expression levels of lncRNAs were quantized as reads per kilobase per million mapped reads (RPKM) values using TCGA RNA-sequencing data in the BAM file.

Differentially expressed analysis
A paired student t-test was used to identify differentially expressed lncRNAs between normal samples and PTC samples or between samples with recurrence and samples without recurrence. LncRNAs with an adjusted P-value <0.05 after Benjamini and Hochberg correction were considered as differentially expressed lncRNAs. Hierarchical clustering of the expression values of differentially expressed lncRNAs was performed with R software using the metric of euclidean distance and complete linkage.

Development of lncRNA signature in survival prediction
Univariate Cox analysis was used to investigate the relationship between the continuous expression level of each lncRNAs and survival. Those lncRNAs with p-value <0.05 were selected as candidate lncRNA biomarkers. Then those candidate lncRNA biomarkers were fitted in the multivariate Cox analysis to identify independent lncRNA biomarkers. Finally, a lncRNA signature was constructed by the linear combination of the expression levels of independent lncRNA biomarkers with the multivariate Cox regression coefficient as the weight. This lncRNA signature could calculate an expression-based risk score for each patient and classify patients into high-risk group and low-risk group using the median risk score from the training dataset

Survival analysis
Kaplan-Meier survival curves and log-rank tests were used to assess the differences in survival time between the high-risk and low-risk patients. Univariate and multivariate analyses with Cox proportional hazards regression for survival were performed on the individual clinical variables with and without the lncRNA signature in the training dataset and testing dataset. Hazard ratios (HR) and 95% confidence intervals (CI) were calculated.

Development of lncRNA-based classifier in recurrence prediction
For classification of samples with recurrence or samples without recurrence, independent lncRNA biomarkers were integrated to form classifier using support vector machine (SVM) with the sigmoid kernel [26]. An unbiased performance estimate of the lncRNA-based classifier was carried out using leave one out cross-validation (LOOCV). Diagnostic ability of the lncRNA-based classifier was evaluated by obtaining the area under a receiver operating characteristic (ROC) curve (AUC) and diagnostic odds ratio (DOR). The ROC curve was produced by plotting true positive rates (sensitivity) against false positive rates (1-specificity). The DORs were calculated as follows: