Discover prognostic RNAs through identifying lncRNA-miRNA-mRNA triplets in lung squamous cell carcinoma

Carcinogenesis involves multi-level RNA dysregulations, including long noncoding RNA, microRNA and mRNA. Therefore, only looking into one RNA level of cancer is not sufficient to uncover the intricate underlying mechanisms. In this study, an integrative strategy was conducted to systematically analyze the characterized molecular patterns of lung squamous cell carcinoma. Paired samples (cancer and adjacent normal tissues) in each RNA level were retrieved to identify differentially expressed RNAs. The regulatory triplets of lncRNA-microRNA-mRNA were identified after considering dysregulation pattern and evidence of publicly available databases. Thus, 32 significant mRNAs were retrieved by means of random walk algorithm. In four independent datasets, Kaplan–Meier survival and Cox regression analyses both confirmed that the expression of these genes was significantly associated with overall survival of LUSC patients.


INTRODUCTION
Lung cancer holds the largest mobility and mortality rate among all types of cancer worldwide, and more than 1.8 million new cases emerged every year (13% of the new cancer cases) [1]. The same ranking of lung cancer was also confirmed in China [2]. Lung squamous cell carcinoma (LUSC) accounts up to 30% of lung cancer [3]. Up to date, Cisplatin plus gemcitabine is still the first-line treatment of advanced LUSC [4], and therapeutic options for LUSC patients remain scarce because no molecular targets have ever been identified [5].
Carcinogenesis is an extremely complex process, of which the initiation and progression is closely associated with not only the protein coding mRNAs, but also noncoding RNAs composing up to ~ 98% of the human genome [6]. Long non-coding RNAs (lncRNAs), of which the length is greater than 200 nucleotides (nt) but appearing to lack protein-coding potential, are produced through complex transcription interwoven between and within protein-coding genes [7]. Emerging evidence repetitively addressed that lncRNA might wield substantial weight upon cancer transformation [8][9][10], for instance, in posttranscriptional regulation [11], tumorigenesis [12], and embryonic development [13]. microRNAs (miRNAs) are a class of small non-coding RNA (~ 22 nt) that regulate gene expression by binding to the 3'-untranslated region (3'-UTR) of target genes, leading to the degradation or protein translation inhibition of target genes [14], and predicted to be the regulator of more than 60% of all protein-coding genes in mammal [15] and wield its regulatory power almost across every aspect of cellular process [16,17]. The important role played by non-coding RNAs indicated that while looking into the dysregulation of mRNA expression in cancer, the aberrant patterns of multi-level events, including lncRNA and miRNA, should also be paid considerable attention. Therefore, the integrative analysis of LUSC multi-level RNA is urgently needed.
Among the hypotheses for generalized mechanisms of lncRNA function, the one receiving the most notable attention is the competitive endogenous RNA (ceRNA) hypothesis. The ceRNA hypothesis posits that specific lncRNAs can inhibit miRNA activity through sequestration, thereby up-regulating mRNA expression level [17]. For example, lncRNA HULC could reduce both miR-372 expression and activity leads to reducing translational repression of its target gene, PRKACB, which in turn induces phosphorylation of CREB in liver cancer [16]. In clear-cell renal cell carcinoma, lncRNA PTENP1 functions as a ceRNA of PTEN for miR-21 to suppress Oncotarget 2 www.impactjournals.com/oncotarget cancer progression, and the expression of PTENP1 and PTEN is positively correlated, and the expression of both is inversely correlated with miR-21 expression [18]. Therefore, the negative correlation between miRNA and lncRNA or mRNA could be used as a filtering rule to identify lncRNA-miRNA-mRNA regulatory triplet.

RESULTS
The schematic of this study is illustrated in Figure 1.

Identification of differentiated lncRNAs, miRNAs and mRNAs according to paired TCGA data
Differentially expressed RNAs (DERs) in LUSC were composed of 1,223 up-regulated and 1,177 downregulated lncRNAs, 228 up-regulated and 81 downregulated miRNAs, and 4,157 up-regulated and 5,073 down-regulated mRNAs (Figure 2A-2C). In addition, principle component analysis (PCA) of the TCGA paired data indicated that human lung carcinogenesis was characterized by substantial changes in transcriptomic features of three RNA levels ( Figure 2D-2F). The cancer samples and normal samples could be clustered into two distinct groups, respectively. Therefore, we collected DERs for further analysis to reduce signal noise. GO analysis for differentially expressed mRNAs was conducted via the DAVID bioinformatics tool (http://david.abcc.ncifcrf.gov/). The result of GO enrichment analysis indicated that the down-regulated mRNAs were functionally concentrated on "immune response", since the majority of the enriched GO terms were the offspring of this GO term (FDR<0.001, Figure 2G, Supplementary Table 2), while up-regulated mRNAs were functionally concentrated on "cell cycle" (FDR<0.001, Figure 2H, Supplementary Table 3), which was also confirmed by our previous research [19]. Step I: Using TCGA paired LUSC samples to identify differentially expressed RNA in three levels; Step II: Identification of lncRNA-miRNA-mRNA regulatory triplets according to several filtering rules; Step III: Identification of mRNAs significantly affected by the dysregulation of upstream lncRNAs and miRNAs, by applying random walk with restart algorithm in the merged network.
Step IV: Survival analysis of identified significant RNAs to evaluate their prognostic value.

Random walk in HPRD-KEGG merged biological network
We then projected these regulatory triplets onto HPRD-KEGG merged biological network, and used the biggest connected component (BCC, containing 540 mRNAs regulated by 41 lncRNAs and 41 miRNAs, Figure 3) as the walking graph for random walk. mRNAs with triplet regulation in the BCC were all used as source nodes until the possibility vector reached steadystate. Source nodes were weighted according to the aforementioned formula. The initial probability vector p 0 was obtained by normalizing the score vector (n = 540) so that the sum of the vector is equal to 1 (the input of random walk algorithm). When the steady-state was finally reached, all the genes in the BCC were scored with p ∞ (n = 540, output of random walk algorithm), and thus the mRNAs with significantly high score were mostly affected by upstream lncRNA and miRNA dysregulations. Therefore, 32 significant mRNAs (p < 0.02) in respect to steady-stage probability were collected through 10,000 permutations (Figure 4), and algorithmically these genes received the most influence imposed by upstream dysregulation of 34 lncRNAs and 27 miRNAs  Table 4). The Circos plot of these triplet regulations were illustrated in Figure 5. The heatmaps of upstream 34 lncRNAs, 27 miRNAs and 32 significant mRNAs were shown in Figure 6A with 158 TCGA LUSC samples, and corresponding clinicopathological factors were also illustrated.

Confirmation of the prognostic value of the three-level RNAs
The Cox proportional hazards regression model was used to evaluate the independence of the prognostic Round dots represented mRNAs, hexon represented miRNAs, and square represented lncRNAs. Up-regulated RNAs were colored red, and down-regulated ones were dark green.

DISCUSSION
The booming amount of high-throughput and multidimensional genomic data usher us into a new era, when the tremendously complicated molecular mechanism of carcinogenesis could be possibly perceived and dissected in a more integrative perspective. Alterations in the primary or secondary structure, and expression levels of lncRNAs as well as their cognate RNA-binding proteins are often associated with human diseases [20]. In the last decade, the hypothesis of ceRNA was putatively accepted as a novel layer of gene regulation. Emerging evidences have indicated that lncRNA functioned as miRNA sponges by decoying miRNAs from other target transcripts, and therefore wielded great weight upon the process of carcinogenesis. For instance, lncRNA loc285194 was proven as a p53-regulated tumor suppressor through reciprocal through repression the expression level of miR-211 [21]. Increased lncRNA CYP4Z2P and CYP4Z1-3ʹUTR expression promotes tumor angiogenesis in breast cancer partly via miRNA-dependent activation of PI3K/Akt and ERK1/2 by inhibiting several miRNAs Oncotarget 6 www.impactjournals.com/oncotarget [22]. However, our knowledge about the underlying mechanism of lncRNA acting as ceRNAs in LUSC is still limited. In this study, we systematically analyzed LUSC's RNA sequencing data, including lncRNA, miRNA and mRNA expression level, to discover novel and important molecular dysregulations in a more comprehensive manner, shedding significant light upon the novel layer of post-transcriptional regulation in human cancers.
In order to identify lncRNA-miRNA-mRNA regulatory triplet of great importance with as much accuracy as possible, we first identify differentially expressed RNAs in three levels based on TCGA paired data. According to the ceRNA hypothesis, under the most circumstances, lncRNA and mRNA should be negatively correlated with miRNA expression level, since lncRNAs may function as miRNA sponges, to downregulate the Oncotarget 7 www.impactjournals.com/oncotarget expression and activities of miRNAs, thereby modulating the de-repression of miRNA targets and imposing an additional level of post-transcriptional regulation [23]. Therefore, we searched for the regulatory pairs, of which source nodes had the reverse differentiation direction in comparison to target nodes, and were negatively correlated with target nodes. Starbase database contained invaluable information for deciphering lncRNA-miRNA interactions, miRNA-mRNA interactions and ceRNA networks based on 108 CLIP-Seq datasets, whereby the involvement of Starbase evidence, could greatly consolidate the credibility of the regulatory relation.
GO analysis indicated that differentially expressed mRNAs were involved in immune response and cell cycle processes. Tumor-associated cell cycle defects may induce aberrant proliferation as well as genomic and chromosomal instability, which are often mediated by alterations in cyclin-dependent kinase (CDK) activity [24]. Infection and chronic inflammation contribute to an estimated 25% of all cancers worldwide [25]. In developmental biology, the fetus, which in many ways behaves like an allogenic transplant, also evades maternal immune-surveillance through mechanisms similar to those observed in tumors [26]. Indeed, excessive proliferation (activation of cell cycle genes) and immune-surveillance evasion (suppression of immune genes) allow tumors to obtain territorial expansion advantages compared to normal cells. In this study, we adopted a simple and effective computational strategy to randomly walk through mRNAs with significantly dysregulated lncRNAs and miRNAs in HPRD and KEGG merged biological network. We hypothesized downstream target mRNAs may be conferred with copious prognostic information, and the same strategy was also adopted by previous researches [27,28]. Therefore, mRNAs with upstream lncRNA-miRNA regulation were collected, and the biggest connected component composed of these mRNAs were used as walking graph for random walk. Random walk with restart was adopted to decipher gene to disease association in prior-knowledge based network, whose performance was proven to be much more superior to other methods, such as neighborhood approaches [29][30][31]. The advantage of this strategy is that it subtly combines observed triplet regulation with knowledge-based regulatory network, tracing the information flow which would be greatly accumulated in significant genes. Nodes in the network were weighted with the mean of regulatory consistency, quantifying the authenticity the regulatory pair held. In this way, significant mRNAs identified by random walk through biological network contained specific genes that underwent the considerable dysregulation of upstream lncRNA-miRNA during carcinogenesis.
Many of these significant mRNAs obtained through random walk algorithm were closely related to the initiation and progression of LUSC. For example, JAK1 is responsible for STAT3 activation in non-small-cell lung cancer and inactivation of JAK1-STAT3 can inhibit lung cancer growth [32]. SRPK1 could promote a stem cell-like phenotype in non-small-cell lung cancer via Wnt/β-catenin signaling [33]. Overexpression of CRKL promoted cell invasion through upregulation of MMP9 expression and activation of ERK pathway in lung cancer [34]. The abundance of supporting researches further endorsed our study that the differentially expressed genes significantly influenced by the dysregulation of upstream lncRNAs and miRNAs might be extremely crucial during the process of lung carcinogenesis.
Since candidate genes were collected based on aberrant patterns in multi-level of TCGA RNA sequencing data, we used microarray data sets with OS information from GEO database in conjunction with TCGA to test the prognostic value of these significant genes and corresponding upstream regulators. The reason why only mRNAs and lncRNAs, rather than miRNAs in regulatory triplet, were evaluated as valuable in OS prediction is still unclear. Probably miRNAs might be functioned as a cushion mediating the proper expression level for both lncRNA and mRNA, and therefore the two-side tradeoff compromises the prognostic value of sandwiched miRNAs. The underlying mechanism needs further exploration. Moreover, smoking is no doubt the leading hazardous factor, especially for LUSC, confirmed Oncotarget 10 www.impactjournals.com/oncotarget by numerous researches [35]. However, at least in TCGA database and our microarray data, smoking index was not significantly correlated with LUSC patient's OS (Table 1). In Figure 6A, 158 TCGA LUSC samples were ordered by smoking index, and all the listed clinicopathological factors and the expression pattern of three RNA levels seemed randomly scattered, suggesting smoking works as the ignitor for LUSC formation, but probably not as the indicator for LUSC's prognosis and progression.

Data retrieval
The multi-dimensional data of LUSC associated datasets were retrieved from The Cancer Genome Atlas (TCGA) database (https://tcga-data.nci.nih.gov/ tcga/). Two levels of RNA sequencing data [RNASeq by Expectation Maximization (RSEM) normalized read counts] were downloaded, including 51 pairs of mRNA sequencing level 3 data (including 20,531 mRNAs) and 45 pairs of miRNA sequencing level 3 data (including 546 mature miRNAs processed by TCGA). LncRNA sequencing data of LUSC was downloaded from The Atlas of ncRNA in Cancer database (TANRIC, http:// ibl.mdanderson.org/tanric/_design/basic/index.html), including 16 pairs of LUSC samples and 12,727 lncRNAs. The RNAs with 0 read count in more than half of corresponding samples were excluded. The expression value was log2 transformed to approximate normal distribution. TCGA LUSC samples with overall survival (OS) information (n = 483) were used for prognostic evaluation. Datasets with OS information (GSE30219 and GSE11969, only LUSC samples were included) were downloaded from GEO database. All clinical information was extracted from the original publications.

Microarray expression profiling of LUSC samples
The global mRNA expression profile of 69 LUSC samples with OS information were conducted in our previous research [19] (Supplementary Table 1). After histopathological evaluation and RNA integrity analysis, all these samples were purified and analyzed using Agilent microarrays. Total RNA samples were labeled and hybridized to Agilent 4*44K Whole Human Genome Oligo Microarrays (G4112F). Normalized expression data were extracted with R package "limma" using cyclic loess method, and the ComBat algorithm was utilized to eliminate potential batch effects. The expression levels of 18,453 genes were obtained as the median value of all probes mapping to a particular gene. The raw and processed data have been deposited in GEO database with the series accession numbers GSE73403.

Establishment of lncRNA-miRNA-mRNA triplet regulatory relation
First, significantly differentiated lncRNAs, miRNAs and mRNAs were identified based on paired t test using LUSC RNA sequencing data, and the p value was also adjusted with false discovery rate (FDR) algorithm (FDR<0.05).
Several rules must be obeyed in order to establish lncRNA-miRNA-mRNA regulatory relation (Figure 1), several rules: (i) source nodes must have the reverse differentiation direction comparing with target nodes (i.e., up-regulated lncRNAs regulate down-regulated miRNAs, which regulate up-regulated mRNAs; or downregulated lncRNAs regulate up-regulated miRNAs, which regulate down-regulated mRNAs; (ii) source nodes must be negatively correlation with target nodes according to Spearman correlation algorithm within cancer samples (p < 0.05); (iii) lncRNA-miRNA and miRNA-mRNA regulation both required support from CLIP-Seq data in Starbase V2.0 database [36]; (iv) miRNA-mRNA regulatory pair was regarded as solid only if they satisfied at least two of three sequence algorithms (miRanda [37], TargetScan [38], PicTar [39]) as well.

Establish merged biological network
The protein-protein interaction network was downloaded from the Human Protein Reference Database (HPRD), and Kyoto Encyclopedia of Genes and Genomes (KEGG) network. Therefore, gene regulatory network was established by merging the two networks, including 10,340 nodes and 60,642 edges after eliminating selfloops and duplicated edges. Therefore, the regulatory network was composed by mRNAs with lncRNA-miRNA regulation and corresponding regulators.

Random walk in merged biological network
Taking advantage of knowledge-based network topology, random walk algorithm was utilized to identify genes algorithmically most affected by corresponding lncRNAs and miRNAs [40]. In the network, genes of interest were designated as information source (i.e., source nodes) and the remaining genes in the network as the information target (i.e., target nodes). Biggest connected component (BCC) of interest is used as the walking graph for random walk. mRNAs with triplet regulation in the BCC were all used as source nodes until the possibility vector reached steady-state. Source nodes were weighted according to following formula: Cor l-mi represented the Spearman correlation coefficient between lncRNA and its regulated miRNA, whereas Cor mi-m represented that between miRNA and regulated mRNA. Therefore, W denoted the regulatory consistency, i.e., the bigger W is, the more likely this triplet regulatory relation exists. The information flow originates from source nodes iteratively and randomly transmits to their neighbors with a probability proportional to their topological features. At each step, the information can flow back to the source nodes with the same probability. The final steady-state probability assigned to each gene in the network reflects the integrated influence imposed by source nodes combining network topology. Formally, the random walk with restart is defined as: where W is the column-normalized adjacency matrix of network, and p t is a vector in which the genes in the network holds probability in the iterative process up to step t. Source nodes were weighted with initial probability vector p 0 (the sum of its elements was equal to 1), and r represents restart probability (r = 0.7 in this study). All the genes in the network were ranked according to the values in the steady-state probability vector p ∞ . This was obtained at query time by performing the iteration until the difference between p t and p t+1 (measured by the L1 norm) was lower than 10 -10 . In order to obtain genes with significantly high steady-state probability, 10,000 permutations of node labels (with network topology remained the same) were conducted to calculate the null distribution of final probability for each gene. The p value was termed as the ratio of random values that were greater than the observed final probability. Genes with p < 0.02 were regarded as the genes significantly afflicted by upstream lncRNA and miRNA dysregulations.

Survival analysis
We clustered LUSC patients based on the expression profile of cancer patients used k-means algorithm. Kaplan-Meier survival analysis and the logrank test were used to evaluate prognostic differences between the two k-means assigned groups [41][42][43]. The Cox proportional hazards regression model was used to evaluate the independence of the prognostic factors in a stepwise manner. Samples in each dataset with complete patient age, sex, stage, smoking index and OS information were used for Cox analysis, and a value of p < 0.05 was regarded as significant.

Statistical analysis
All the data analyses were conducted using R programming (Version 3.3.1) and Matlab (Version 2015b) language.

CONFLICTS OF INTEREST
None.