Identification of a RNA-Seq based prognostic signature with five lncRNAs for lung squamous cell carcinoma

Long non-coding RNAs (lncRNAs) expression profile signature for survival assessment in lung squamous cell carcinoma (LUSC) are largely inconsistent due to distinct detecting approaches and small sample size. Systematic and integrative investigation of RNA-Seq based data from The Cancer Genome Atlas (TCGA) herein was performed to determine candidate lncRNAs for prognosis evaluation of LUSC. A total of 60483 genes, including 7589 lncRNAs were assessed in a cohort including 478 LUSC cases with follow-up data. Firstly, 4225 differentially expressed lncRNAs were obtained via R packages. Next, univariate and multivariate Cox proportional hazards regression revealed that 41 lncRNAs were closely related to the survival of LUSC. Finally, lncRNA based prognosis index (PI) could predict overall survival of LUSC with high accuracy (AUC = 0.652, CI: 0.598, 0.705), PI = expCYP4F26P*βCYP4F26P+expRP11108M12.3*βRP11-108M12.3+expRP11-38M8.1*βRP11-38M8.1+expRP11-54H7.4*βRP11-54H7.4+expZNF503-AS1*βZNF503AS1. Furthermore, it was confirmed that the five-lncRNA signature could act as an independent prognostic indicator for LUSC (HR = 2.068, p < 0.001 with univariate analysis, HR = 1.928, p = 0.038 with multivariate). Besides, we constructed a weighted gene co-expression network analysis (WGCNA) of key lncRNA RP11-54H7.4 according to the p-value of related genes’ weight. This study provides a RNA-Seq based prognostic signature with five lncRNAs for further clinical application to LUSC patients.


INTRODUCTION
Lung cancer remains one of the most common causes of cancer-related deaths all over the world, with over 25% of all cancer-related deaths in both male and female [1][2][3][4][5][6]. Lung cancer includes two main histological categories: small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) . And nearly 80% of lung cancers are NSCLCs, which consists of different subtypes: lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC) and large cell lung carcinoma (LCLC). These different subtypes of lung cancer are related to distinct molecular profiling and are believed to originate from distinct cells [15][16][17][18]. Recent developments in molecular pathology detection and targeted therapies have markedly improved the overall survival (OS) of patients with LUAD. However, there are no specific biomarkers or comparable effective targeted molecular therapies have

Research Paper
Oncotarget 50762 www.impactjournals.com/oncotarget been available to screen and treat LUSC patients [19][20][21]. Thus, the survival rates of LUSC remain relatively low. In that regard, there is an urgent requirement for effective prognostic biomarkers and treatment options based on the current genomic approaches for LUSC.
Presently, the innovation of a particularly large number of non-coding RNAs conceptually transformed the field of cancer research, including long noncoding RNAs (lncRNAs). LncRNAs are non-protein coding transcripts of ≥ 200 nt, which are commonly distributed in the genome and can modulate gene expression [22][23][24][25][26]. Recent increasing evidence has revealed that lncRNA expression profiles are dysregulated in various cancers [27][28][29][30][31]. The aberrant lncRNA expression profile has been reported to correlate to the development and survival in patients with different kinds of cancers, including LUSC, which reveals the potential of lncRNAs as prognostic cancer biomarkers [32][33][34][35][36]. However, most of the previous studies focused on a single lncRNA based on small sample size [37][38][39][40].
We performed the current study using the LUSC dataset retrieved from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) to construct a panel of lncRNA signature and obtain a specific prognosis index (PI), which could predict the prognosis in LUSC. These findings could also provide new insight into the molecular mechanisms based on lncRNAs for LUSC.

Differentially expressed lncRNAs in LUSC
Among the expression data of 60,483 mRNAs, 7589 lncRNA expression was extracted and calculated with R language (EdegR and DESeq). Differentially expressed lncRNAs (n = 4225, Figure 1) were achieved, which were sent for further investigation on prognostic value.

Prognostic evaluation of the differentially expressed lncRNAs and clinicopathological parameters
After we excluded the cases with insufficient survival data, we finally obtained 478 cases for the prognostic evaluation. If the expression data had 10% absence for an lncRNA, this lncRNA was also omitted from our prognosis evaluation. The univariate Cox proportional hazards regression method revealed that a total of 41 lncRNAs gained prognostic value for LUSC. Subsequently, multivariate Cox proportional hazards regression analysis was applied to verify the aforementioned results, and CYP4F26P, RP11-108M12.3, RP11-38M8.1, RP11-54H7.4 and ZNF503-AS1 were proved to be independent prognostic indicators for LUSC and their individual prognostic values were shown in Table 1. Next the prognosis index (PI) for predicting OS was calculated with the formula based on the five lncRNAs above: The "β" value is the estimated regression coefficient of lncRNA derived from the multivariate Cox stepwise regression analysis and "exp" indicates the expression profiles of lncRNA.
The LUSC patients were divided into two groups of low-risk and high-risk according to the median point of the prognostic risk score (Figure 2). We also assessed the difference of the expression levels in these five lncRNAs between high-and low-risk groups. Remarkably lower expression was noted for CYP4F26P and RP11-108M12.3 in high-risk groups, while higher expression was observed for RP11-38M8.1, RP11-54H7.4 and ZNF503-AS1 in high-risk groups ( Figure 3). Original expression of these five lncRNAs between LUSC and non-cancerous lung tissues were also evaluated. Remarkably higher expression was noted for CYP4F26P, RP11-108M12.3, RP11-38M8.1, RP11-54H7.4, while predominantly lower expression was observed for ZNF503-AS1 in LUSC ( Figure 4). The risk score could helpfully predict 5-year survival of LUSC patients, as the AUC of survival ROC reached 0.691 ( Figure 5A). Furthermore, K-M curves indicated that the median survival time of patients of high-risk group was 30.92 months, far shorter than that of low-risk group 66.67 months, p < 0.001, Figure 5B). Additionally, the HR of the PI generated by univariate Cox proportional hazards regression method was 2.068 (CI: 1.503, 2.847, p < 0.001) and multivariate Cox proportional hazards regression analysis revealed a consistent HR of 1.928 (CI: 1.037, 3.583, p = 0.038), which confirmed that the PI of five lncRNAs could function as an independent indicator for the survival of LUSC patients.
In the meantime, the prognostic value of diverse clinicopathological parameters was also investigated. The K-M approaches disclosed that the tumor status (cancer status in Figure 6A and the therapeutic efficiencies in Figure 6B) and new tumor event after initial treatment ( Figure 6C) could manifest the outcome between highand low-risk groups. Moreover, ROC showed that new tumor event (AUC = 0.6233, p = 0.01992, Figure 7A) and primary therapy outcome success (AUC = 0.5910, p = 0.01361, Figure 7B) gained certain value to estimate patients' survival, and the effect was weaker as compared to that from the PI. Several parameters were found to have some prognostic value with univariate analysis; however, only person neoplasm cancer status remained statistically significant with the confirmation by multivariate analysis (HR = 3.446, p = 0.029, Table 2).  Table 3). The co-expression networks of lncRNA RP11-54H7.4 was visualized by WGCNA in the Figure 8.

Validation of the expression of the five lncRNAs with GEO data
Finally, eight studies were screened out from GEO, including GSE19188, GSE30219, GSE33479, GSE37745, GSE50081, GSE73403, GSE74706 and GSE74777. However, only the expression of one key lncRNA, ZNF503-AS1, could be achieved in four datasets (GSE19188, GSE30219, GSE33479, GSE74706) provided ZNF503-AS1 expression data both in LUSC tissues and non-cancerous tissues, which showed the significantly lower level of ZNF503-AS1 in LUSC cases (Supplementary Table 1).

Validation based on in-house clinical samples of LUSC
Since only ZNF503-AS1 was observed in GEO data, we performed real time RT-qPCR to confirm its expression in the clinical sample in house. The primers were listed as follows: Forward-5′-TGAGCGAGTTCGTACAGTGC-3′, Reverse-5′-TAGCATGTTAGCGCAGCCTT-3′. Among these small size of patients, 11 cases showed lower expression of ZNF503-AS1 in LUSC tissue than noncancerous lung tissues and 1 cases showed higher expression. The mean expression level of ZNF503-AS1 was significantly lower than that of non-cancerous lung tissues (P = 0.0019, Figure 9A). Furthermore, the AUC of ZNF503-AS1 was 0.806 (P = 0.011, Figure 9B). However, due the small size of the patients, we could not assess the relationship between ZNF503-AS1 and the development of LUSC (P = 0.055, Figure 9C).

DISCUSSION
LUSC is one of the most lethal malignancies worldwide. Several principal mechanisms have been documented over the past decades, including various high expression and green indicates low expression. Black shows the lncRNA expression with both the log2FC < 1 and adjusted value < 0.05. The X axis represents an adjusted P value and the Y axis represents a log2FC. Differentially lncRNAs were calculated by DESeqR 2809 high expressed lncRNAs, and 1416 low expressed. This volcano plot was conducted by the ggplot2 package of R language.
Oncotarget 50764 www.impactjournals.com/oncotarget molecule alterations, which could confer the oncogenesis and progression of LUSC [17,[41][42][43][44][45]. However, unlike LUAD, none specific biomarkers have been discovered to display the therapeutic efficiency and prognosis of LUSC. Therefore, the investigation of prognostic factors is of paramount importance for the management of LUSC patients.
Most recently, Liu et al. [46] assessed the prognostic value of lncRNA alteration frequencies via cBioPortal in lung cancers, including LUSC and LUAD. They found in LUSC, 624 lncRNAs had alteration rates > 1% and 64 > 10%. In LUAD, 625 lncRNAs had alteration rates > 1% and 36 > 10%. Among those, 620 lncRNAs had alteration frequencies > 1% in both LUSC and LUAD, while 22 were LUSC-specific and 23 were LUAD-specific. Twenty lncRNAs had alteration frequencies > 10% in both LUSC and LUAD, while 44 were LUSC-specific and 16 were LUAD specific. Since the genetic alteration includes gene amplification, mRNA upregulation, mRNA downregulation and all types of mutations, the clinical value of these genetic alterations of lncRNAs are somehow too generalized for clinical setting.
In the current study, we mined the public lncRNA data from TCGA and searched for a novel prognostic  Oncotarget 50765 www.impactjournals.com/oncotarget marker for LUSC. Interestingly, we found that the PI constructed from five differentially expressed lncRNAs could gain a good prognostic value for LUSC. As far as we know, this is the first investigation to generate a PI with RNA-seq based TCGA data for the survival evaluation of LUSC patients with a large cohort of 502 cases.
TCGA data with large-scale genomic analyses have rendered it possible to assess the molecular features related to LUSC outcome [47]. Huang et al. [48] explored 129 cases of the LUSC samples from TCGA with gene expression, microRNA expression, DNA methylation, and CNV data in 2015. They built the whole genome integrative network using variance inflation factor regression and isolated the lung cancer subnetwork with the Bayesian method. However, no lncRNA was involved. Gao et al. [49] also reported that 12 in 133 aberrant miRNAs were related to OS of TCGA LUSC cohort. Furthermore, a linear prognostic model containing seven miRNAs was constructed, which could predict patients' survival with accuracy. These aforementioned studies provided a new and constructive study strategy of prognostic markers for LUSC. Nevertheless, no lncRNA was mentioned in the study.
By far, only two studies mined the lncRNA data from TCGA of LUSC. Wu et al. [50] investigated the differentially expressed genes (DEGs) based on the RNA-seq data of LUSC deposited in the TCGA database between LUSC samples and normal samples. Totally, they obtained 5162 DEGs, including seven upregulated lncRNAs (DIO3OS, HAR1A, FAM138B, BCYRN1, TERC, HCGA and PVT1). However, the relationship between these lncRNAs and survival was not established. Wei et al. [51] evaluated the altered lncRNA profiling in LUAD and LUSC with only the paired tissue samples of RNA sequencing or microarray data from TCGA and GEO. They observed that lncRNA expression pattern was distinct in LUAD and LUSC. Furthermore, a 6-lncRNA signature expression pattern (RP11-111M22.3, TOPORS-AS1, RP11-383C5.4, AC078883.3, AC007566.10 and AC011526.1) was observed to be notably related to the OS of LUSC. Meanwhile, another 6-lncRNA signature (CRNDE, CTA-292E10.6, CTD-3025N20.3, RP11-983P16.4, FLI1-AS1 and AC007879.5) was closely correlated to the PFS of LUSC. But this conclusion was based on the sequencing data of lncRNA with a small sample size of 16 paired LUSC and normal lung samples. We attempted to validate the prognostic value of these novel five-lncRNA signature with GEO data. Unfortunately, no sufficient survival data of these five lncRNAs was provided in theses GEO datasets, so the   Oncotarget 50767 www.impactjournals.com/oncotarget prognostic value of these five-lncRNA signature needs further validation with other methods in the future. This finding needs to be confirmed with a larger sample size.
In the current study, we first screened the differentially expressed lncRNAs, and next we selected those lncRNAs which tended to gain prognostic value which could be verified again by multivariate analysis. Finally, we constructed the PI with five novel lncRNAs: CYP4F26P, RP11-108M12.3, RP11-38M8.1, RP11-54H7.4 and ZNF503-AS1. And most importantly, this PI was proved to be independent prognostic indicator for LUSC.
We were also interested in the prospective molecular mechanisms of these five lncRNAs. Unfortunately, no publication was found of these lncRNAs and little was known on the functional mechanism of these five lncRNAs. To this end, we performed WGCNA to explore the related genes of the lncRNAs. Fourteen candidates were brought into the network, which could be the most correlative genes of these lncRNAs in LUSC.
Though we investigated the prognostic value of lncRNAs based on TCGA data in LUSC, other reports based on different techniques are also available. For instance, Zhou et al. [52] carried out an array-based transcriptional analysis of lncRNAs in 603 cases of NSCLC by repurposing microarray probes from three Gene Expression Omnibus (GEO) database (GSE37745, GSE31210 and GSE50081). They found that an expression pattern of eight lncRNAs was closely related to OS of NSCLC patients, including RP11-21L23.2, CTD-2358C21.4, RP11-94L15.2, GPR158-AS1, KCNK15-AS1, AC104134.2, RP11-701P16.5 and RP11-379F4.4. Multivariate regression and stratified analysis further suggested that the eight-lncRNA signature was   [53] mined lncRNA expression profiling in 739 lung cancer patients from GEO datasets. A risk score model was built up on the basis of the expression data of eight lncRNAs (AK021595, BC030759, AK000053, AK124307, BC020384, AK022024, CR615992 and AF085995) in the training dataset (GSE30219). They further validated the association between these lncRNAs and survival of lung cancer patients in another three independent testing sets (GSE31210, GSE37745 and GSE19188). However, both Zhou et al. [52] and Tu et al. [53] investigated the prognostic role of lncRNA in a general lung cancer on the whole, without separating LUAD from LUSC. Since LUAD and LUSC have absolutely distinct molecular mechanism, it is of great significance to perform the prognostic assessment separately. It is also unsurprising that the five lncRNAs in LUSC noted in the current study are not concordant with those reported from lung cancer [52,53]. Therefore, the five lncRNAs may be candidate prognostic biomarkers for LUSC patients with substantial clinical implications.
To sum up, we assessed the RNA-Seq based lncRNA data with 502 cases of LUSC patients from TCGA and managed to construct a five-lncRNA signature,  is the key lncRNA (blue node), surrounded with co-expressed genes (red nodes). The distance between the RP11-54H7.4 and the coexpressed genes is represented by co-expression weight. The network is conduct by Cytoscape 3.4 software.
Oncotarget 50769 www.impactjournals.com/oncotarget which could become a novel prognostic indicator for LUSC. However, the clinical role as well as the biological function of these five lncRNAs needs to be further verified with more experiments.

TCGA dataset and analysis of the differentially expressed lncRNAs
RNA sequencing (RNA-Seq) data from individuals with LUSC, which were calculated on IlluminaHiSeq RNASeq platform, were achieved from TCGA data portal (https://cancergenome.nih.gov/.), containing 502 LUSC tissues and 49 adjacent non-tumorous lung tissue samples up to November 9, 2016. The expression data of lncRNAs were displayed as reads per million (RPM) and the expression level of each lncRNA was normalizated by Deseq package of R language for further analysis. The data were from TCGA, which is a community resource project providing available data for community research. Approval by a local ethics committee was not required because the current study adhered to the TCGA publication guidelines and data access policies. The RNA-Seq data of LUSC covered 60,483 mRNAs containing 7589 lncRNAs, which have been described by NCBI (https://www.ncbi. nlm.nih.gov/) or Ensembl (http://asia.ensembl.org/). Next the differentially expressed lncRNAs were calculated by EdgeR and DEseq (Padj < 0.05 and the absolute log2 FC > 1), respectively. The final selected lncRNAs were the integration between the two approaches. Student's t test (SPSS Inc., Chicago, IL, USA) was used for the statistical analyses of the differentially expressed level of these five lncRNAs between LUSC and non-cancerous lung tissues.

Manufacture of prognostic signature with differentially expressed lncRNAs for LUSC
The differentially expressed IncRNAs of which the expression level was less than 1 in exceeded 10% of all subjects were removed from the prognostic analysis. In the meantime, clinicopathological features, including survival information, were also achieved from TCGA. Cases without sufficient clinical data were omitted., Finally, the prognostic analysis included a total of 478 samples with expression data from 4221 IncRNAs. The end-point of the LUSC patients in our study was set up with OS. The average follow-up period was 28 months for these LUSC patients being involved.
The prognostic value of lncRNAs, as well as the clinicopathological features, was firstly assessed by the univariate Cox proportional hazards regression (p < 0.05). The statistically significant indicators including lncRNAs and clinicopathological features were further confirmed with the multivariate cox regression model. An lncRNA based prognosis risk score was constructed on the basis of a linear combination of the expression level multiplied regression coefficient derived from the multivariate cox regression model (β) with the following formula as previously reported [54,55].
Prognosis Index (PI) = exp CYP4F26P *β CYP4F26P +exp RP11- Based on the cut-off of the median PI, LUSC patients were then divided into high-score and low-score groups [56]. Univariate and multivariate Cox proportional hazards regression analyses were further performed to investigate the prognostic effect of this prognosis risk score, and adjustments were made for tumor dimension, smoking status, distant metastasis, lymph node metastasis, tumor status, pathologic stage, new tumor event after initial treatment, person neoplasm cancer status, primary therapy outcome success and PI. Hazard ratio (HR) and 95% confidence intervals (CI) were calculated. The time-dependent receiver operating characteristic (ROC) curve analysis within 5 years as the defining point was also conducted with the R package "survivalROC", to assess the predictive accuracy of prognostic model for time dependent disease outcomes. Kaplan-Meier survival curves were drawn to evaluate the relationship between all between para-tumorous lung tissues (pT) and LUSC (RT-qPCR); (B) ROC curve; (C) The correlation of ZNF503-AS1 between paratumorous lung tissues (pT) and LUSC.pT: para-noncancerous tissues.
ROC curve was applied to evaluate the predictive value of the risk score for patients' outcome after first course of treatment. If two-sided P value was less than 0.05, statistical significance was considered. SPSS 22.0 software (SPSS Inc., Chicago, IL, USA) was selected for the statistical analyses.

Functional evaluation of lncRNAs with WGCNA
Incorporated networks for the DELs are analyzed with WGCNA, which is capable of describing the correlation patterns gene expression profiles [58]. The WGCNA R package was used to evaluate the significance of the five DELs and their module membership by the 'p.weighted', which negatively indicates the correlation between the DELs and co-expressed genes in LUSC. In the first step, WGCNA was conducted by the threshold of p.weighted > 0.5 in our study. Next, the pairwise Pearson correlation was used to evaluate the weighted co-expression between all the probe set subjects in an adjacency matrix. The adjacency was defined by a 'soft threshold', which raises the absolute value of the coexpression by a ij=|cor(xi, xj)|β. In this study, the soft threshold was set at β = 3 with the scale-free topology criterion [59]. Following the identification of weighted correlation, features of the network were presented by Cytoscape 3.4.0.

Analysis of the differentially expressed lncRNAs with GEO data
We also attempted to validate the findings from TCGA and the relevant mRNA-Seq datasets of LUSC from GEO was searched. The search strategy was as follows: (cancer OR carcinoma OR adenocarcinoma OR tumour OR tumor OR malignanc* OR neoplas*) AND (lung OR pulmonary OR respiratory OR respiration OR aspiration OR bronchi OR bronchioles OR alveoli OR pneumocytes OR "air way"). The expression level of each lncRNA was extracted for further analysis from all included data. Student's t test (SPSS 22.0 Inc., Chicago, IL, USA) was used for the statistical analyses of the differentially expressed level of these five lncRNAs between LUSC and non-cancerous lung tissues.

Validation based on clinical samples of LUSC
To further verify the data from GEO, we performed real time RT-qPCR to detect the level of some lncRNAs with clinical LUSC samples (n = 12) from the First Affiliated Hospital of Guangxi Medical University as previously reported [60]. The Ethical Committee of First Affiliated Hospital of Guangxi Medical University, China approved the present study. All participating patients provided informed consent and agreement for the research use of the clinical samples. GAPDH was used as internal reference with the primers as follows: Forward-5′-TGAACGGGAAGCTCACTGG-3′, Reverse-5′-TCCACCACCCTGTTGCTGTA-3′. Paired-samples t test was performed to compare the difference of lncRNAs between LUSC and non-cancerous lung tissues with SPSS 22.0. We also drew Receiver operating characteristic (ROC) curves to assess the effect of lncRNAs to discriminate the LUSC from non-cancerous lung tissue.