Integrative analysis of protein-coding and non-coding RNAs identifies clinically relevant subtypes of clear cell renal cell carcinoma

Protein-coding genes and non-coding RNAs cooperate mutually in cells. Integrative analysis of protein-coding and non-coding RNAs may facilitate characterizing tumor heterogeneity. We introduced integrated consensus clustering (ICC) method to integrate mRNA, miRNA and lncRNA expression profiles of 431 primary clear cell renal cell carcinomas (ccRCCs). We identified one RCC subgroup easily misdiagnosed as ccRCC in clinic and four robust ccRCC subtypes associated with distinct clinicopathologic and molecular features. In subtype R1, AMPK signaling pathway is significantly upregulated, which may improve the oncologic-metabolic shift and partially account for its best prognosis. Subtype R2 has more chromosomal abnormities, higher expression of cell cycle genes and less expression of genes in various metabolism pathways, which may explain its more aggressive characteristic and the worst prognosis. Moreover, much more miRNAs and lncRNAs are significantly upregulated in R2 and R4 respectively, suggesting more important roles of miRNAs in R2 and lncRNAs in R4. Triple-color co-expression network analysis identified 28 differentially expressed modules, indicating the importance of cooperative regulation of mRNAs, miRNAs and lncRNAs in ccRCC. This study establishes an integrated transcriptomic classification which may contribute to understanding the heterogeneity and implicating the treatment of ccRCC.

Despite the common origin of proximal tubules, ccRCC, the most common RCC subtype, is not a single entity but a collection of heterogeneous tumors. Clinically, tumor pathologic stages and histologic grades can be used to stratify patients, infer prognosis and guide treatment [8]. However, due to intra-and inter-tumoral heterogeneity [9,10], stages and grades fail to explain molecular diversity in tumors and cannot satisfy requirements of precision therapy. So understanding the inherent molecular basis of these heterogeneities in ccRCC remains important and challenging.

Research Paper
Currently, some molecular characteristics of ccRCC have been uncovered, such as recurrent mutations of von Hippel-Lindau (VHL), PBRM1, BAP1 and SETD2 [11][12][13][14], and loss of chromosome 3p [11]. These genomic aberrations can potentially change the landscape of tumor transcriptomes by altering expression of global gene sets. For example, VHL mutations lead to imbalances of hypoxia inducible factors (HIF-1α and HIF-2α, or HIF1A and EPAS1) [15][16][17] and then dysregulation of cellular metabolism. Mutations in chromatin remodeling genes, including PBRM1, BAP1 and SETD2, may affect additional functional pathways through chromatin remodeling/histone methylation [12][13][14]. Metabolism shift and epigenetic reprogramming are critical for the development and progression of ccRCC [11]. Notably, noncoding RNAs, including miRNAs and long non-coding RNAs (lncRNAs), can be important epigenetic regulators of various biological processes through transcriptional and post-transcriptional processing and even chromatin remodeling. Recent evidences show critical roles for noncoding RNAs in tumorigenesis and some have been reported to have regulatory functions in crucial pathways in ccRCC such as miR-21, miR-92 and miR-210 [11,[18][19][20].
Transcriptome data, including mRNA, miRNA and long non-coding RNA (lncRNA) expression profiles, have been utilized individually for molecular subtyping of ccRCC [11,21,22]. Brannon and colleagues used gene expression microarray data to identify two robust expression subtypes (ccA and ccB) with distinct gene expression patterns and divergent biological pathway [21]. The Cancer Genome Atlas (TCGA) Research Network established four stable mRNA and four stable miRNA subgroups that correlated with survival [11]. Malouf's group identified four subclasses of ccRCC with distinct clinicopathological and genomic features based on lncRNA expression profiles [22]. These studies show that each transcriptome type of mRNAs, miRNAs and lncRNAs contributes much to ccRCC subtyping.
Actually, mRNAs, miRNAs and lncRNAs cooperate reciprocally and form regulation networks in tumor cells. Expression profiles of each data type reflect intrinsic molecular characteristics of tumors from a single perspective and contribute partially to tumor classification. Therefore, integrative analysis of multiple transcriptome data types including protein-coding genes and non-coding RNAs may facilitate capturing the heterogeneous nature of tumors and these data can be used to identify concordant tumor subtypes [23].
In this study, we introduced an integrative analytical method, termed integrated consensus clustering (ICC), to establish a classification based on multiple transcriptome data types. We then applied ICC to ccRCC with matched mRNA, miRNA and lncRNA expression data sets from TCGA. We identified one RCC subgroup easily misdiagnosed as ccRCC in clinic and four robust ccRCC subtypes associated with distinct clinicopathologic and molecular features. Furthermore, we constructed a triplecolor co-expression network and identified functional modules differentially expressed among subtypes and associated with prognosis. Our results suggest that integrative analysis of protein-coding and non-coding RNAs can effectively classify ccRCCs into subtypes with significantly different biological pathways and regulation mechanisms.

Integrated consensus clustering (ICC)
To integrate information from different data sources, we proposed ICC of three steps: (1) constructing a patient similarity matrix (PSM) from each data source; (2) integrating multiple PSMs into one fused PSM (fPSM); (3) obtaining a final clustering result ( Figure 1). First, the consensus matrix of patients clustering was generated using the Consensus Clustering algorithm [24] for each data set. Each element of the consensus matrix represented the proportion of the corresponding two patients classified into the same cluster in multiple iterative clustering. We referred to the consensus matrices as PSMs, which could mask differences of diverse platforms and hence made PSMs comparable across various data sources. Next, PSMs were merged into an fPSM. Each element of fPSM was calculated as the sum of the corresponding elements from different PSMs and represented fused patient similarity. In the third step, the final consensus matrix (FCM) and clustering result were obtained from consensus clustering on fPSM.
To validate the feasibility of ICC, we applied ICC to three cancer cohorts from TCGA, namely BRCA (Breast invasive carcinoma), LGG (Brain Lower Grade Glioma), and LUAD (Lung adenocarcinoma) (Supplementary Table  S1). Proportional area change under CDF was used for selection of the optimal K (Supplementary Figure S1A, S1C, S1E). After identification of stable clusters for each cancer type, overall survival analysis was performed. The results show that the overall survivals are significantly different among the subtypes in each of the three cancers (Supplementary Figure S1B, S1D, S1F). These results demonstrate that our ICC method can identify cancer subtypes with significant clinical relevance.

Identification of stable clusters in ccRCC by integrated transcriptomic analysis
We applied ICC to 431 primary ccRCC tumors with matched mRNA, miRNA, and lncRNA expression profiles from TCGA KIRC cohort to classify patients into clusters. To select the optimal cluster number, we assessed clustering stability using the FCM produced by ICC from k=2 to 12. For each cluster number k, an FCM was produced ( Figure  2A) and empirical cumulative distribution (CDF) of each FCM was calculated ( Figure 2B). To avoid subjective judgments, the optimal cluster number was determined by a combination of proportional area change under CDF (Δ(K), Figure 2C), proportion of ambiguous clustering [25] (PAC, Figure 2D) and average silhouette width (ASW, Figure 2E). According to Δ(K), the optimal cluster number was 5 because clustering stability increased for k=2 to 5 but almost not for k > 5 ( Figure 2B, 2C). And according to PAC and ASW, we prefer to k=5 as the optimal cluster number in consideration of the major change before and after k=5, although the lowest PAC scores and the largest ASW values appeared at k=2 and 5 ( Figure 2D, 2E). Therefore, the optimal cluster number was determined to be 5 and then 431 ccRCCs were classified into five robust clusters: R1 (n = 105, 24.4%), R2 (n=127, 29.5%), R3 (n = 83, 19.3%), R4 (n = 92, 21.3%) and R5 (n = 24, 5.6%) ( Figure 2F and Supplementary Table S3).
The integrated transcriptomic classification shows high association with those classifications based on the single data type, such as ccA and ccB expression subtypes [21,26] (p<1e-5), TCGA mRNA-based subtypes [11] (p<1e-5), TCGA miRNA-based subtypes [11] (p<1e-5) and the lncRNA-based subtypes [22] (p<1e-5) (Supplementary Table S4, Fishers' exact test). Interestingly, when compared to ccA and ccB subtypes, our clusters R1 and R4 divide ccA into two subgroups with significantly different survivals, clusters R2 and R5 correspond to ccB, and cluster R3 consists of almost equal numbers of ccA and ccB tumors which may account for roughly 20% unclassified tumors in the ccA/ccB classification scheme [21] (Figure 3A, 3B). The result suggests that our integrated transcriptomic classification may be a step forward and could divide ccRCCs into more elaborate subgroups.

Figure 2: Identification of stable clusters of clear cell Renal Cell Carcinoma by using ICC. A. Consensus matrixes of 431
TCGA ccRCC samples for each k (k=2 to 7), displaying the clustering stability using 1000 iterations of hierarchical clustering. B. Plot of cumulative distribution function (CDF) from consensus matrix for each k (k= 2 to 12). C. The Δ(k) vs k plot, indicating the optimal cluster number is k=5 where the 'elbow' occurs. D. Proportion of Ambiguous Clustering (PAC) vs k plot, allowing inference of the optimal k by the lowest PAC, that is 2 or 5. E. Average silhouette width vs k plot, showing the optimal cluster number k= 5 with respect to the major increase from k=3 to k=5 and the subsequent drop-off. F. Silhouette width profile calculated from the consensus matrix when k=5, which was selected as the optimal k. www.impactjournals.com/oncotarget

Clinicopathological characteristics and survival analysis of four ccRCC subtypes
We surveyed clinicopathological characteristics of four ccRCC subtypes (Table 1). Subtype R2 is enriched for more distant metastasis and higher histologic grade and pathologic stage, suggesting more aggressive behaviors of tumors in R2 ( Figure 5A). Overall survival analysis reveals significantly different survival outcomes among the four integrated transcriptomic subtypes ( Figure 5B). Furthermore, multivariate analysis demonstrates that our four integrated transcriptomic subtypes are independent prognostic factors after adjusting for clinical indices including gender, age, pathologic stage and histologic grade ( Table 2).

Molecular expression patterns and pathway analysis
RNAs with subtype-specific expression may play important functions in individual molecular subtypes. So we identified the mRNAs, miRNAs and lncRNAs with significantly high expression (fold change > 1.5 and adjusted p value < 0.05) among four integrated transcriptomic subtypes by comparing each subtype with the rest using a moderated t-test [28] (Figure 6A-6C). Interestingly, the numbers of subtype-specific highly expressed mRNAs and lncRNAs in R2 and R4 are much larger than those in R1 and R3, implying that tumors in R2 and R4 are more active at transcriptomic level. In addition, highly expressed miRNAs in R2 are far more than those in other subtypes and highly expressed lncRNAs in R4 account for more than half (6693/12727=52.6%) of all the lncRNAs evaluated in this study. These results suggest that non-coding RNAs are important in ccRCC subtyping and that miRNAs and lncRNAs may play critical roles in R2 and R4, respectively.
To reveal biological functions related to each subtype, we performed gene ontology enrichment analysis on the top 500 overexpressed mRNAs using R package topGO [29] and then the top 50 most significantly enriched biological processes were summarized using the REVIGO webserver [30]. Four subtypes show distinct enriched biological processes ( Figure 6A and Supplementary Figure S2). Consistent with previous reports [21,31], upregulated metabolism-related genes, which are enriched in R1, correlate with good prognosis of ccRCC, and overexpression of genes involved in mitotic cell cycle, which are enriched in R2, are associated with worse prognosis. Moreover, collagen is the most abundant protein in the extracellular matrix (ECM). Upregulation of genes in collagen metabolic process may help cancer cells breach the ECM and escape, which may account for more invasive and metastatic features of tumors in R2 [32]. In addition, upregulated genes involved in RNA splicing in R4 implies that aberrant splicing events may contribute to disease progression of tumors in R4.
Next, we investigated the expression of pathways among four subtypes by pairwise comparisons of four subtypes and adjacent normal samples using gene set analysis [33] and KEGG pathway genes. Downregulated expression of various metabolic pathways are observed in all four ccRCC subtypes when compared to adjacent normal samples (Supplementary Figure S3), which is consistent with the central feature of oncologic-metabolic shift in ccRCC [11]. The widespread differentially expressed  pathways are observed among the four ccRCC subtypes, which delineates the distinct pathway characteristics of subtypes. Notably, AMPK signaling pathway is more highly expressed in R1 compared to any other subtype. Activated AMPK signaling pathway can lead to inhibition of biosynthetic pathways and activation of catabolic pathways [34] and hence improve the oncologic-metabolic shift of ccRCC, which may explain good prognosis of R1. Subtype R2 has the highest expression of genes involved in cell cycle, p53 signaling pathway and ECM-receptor interaction and the lowest expression of genes involved in multiple metabolism pathways ( Figure 6D), which may promote aggressive behavior of ccRCC and account for poor prognosis of R2.

Triple-color co-expression network analysis
To systematically understand the potential regulation relationships among mRNAs, miRNAs and lncRNAs in ccRCC, we constructed a triple-color coexpression network of mRNAs, miRNAs and lncRNAs using weighted gene co-expression network analysis (WGCNA) [35] on a cohort of 407 ccRCCs included in R1-R4. Using unsupervised hierarchical clustering analysis and a dynamic hybrid tree cut algorithm, we identified 31 distinct triple-color co-expression modules (Supplementary Figure S4A, Supplementary Tables S8-S9). This result suggests that synergistic and regulatory effects among protein-coding and non-coding RNAs may In addition, we used the module eigengene (ME) to summarize expression of each module and to assess whether modules are associated with subtype phenotypes (See Methods). We observed significant expression differences in 28 modules among subtypes (Supplementary Figure S4B). We further identified 10 prognosis modules (M2-7, M10, M17-18, M28) ( Figure 7A) with significant correlation with survival and enrichment of prognosis genes (See Methods). These results suggest that these triple-color modules, especially prognosis modules, may play important roles in tumorigenesis of ccRCC.
To explore biological functions of modules, we performed gene ontology enrichment analysis on mRNAs of each module using topGO [29]. Then we summarized the top 50 most significant biological processes using REVIGO [30] and defined the most significant summarized biological process as the function of the module. These module functions include diverse biological processes, such as mitotic cell cycle, carboxylic acid metabolism, blood vessel morphogenesis, RNA processing and extracellular matrix organization (Supplementary Figures  S4B). Notably, among the 10 prognosis modules ( Figure  7A), the functions of M5 (blood vessel morphogenesis), M6 (carboxylic acid metabolism) and M10 (mitotic cell cycle) have been reported to be critical in ccRCC [11].
Other prognosis modules such as M4 (mitochondrion translation), M7 (histone lysine methylation), M17 (extracellular matrix organization) and M18 (positive regulation of autophagy) may also exert important functions in tumorigenesis of ccRCC although they are less well-studied for the moment. Interestingly, the worst prognosis subtype R2 shows nearly completely opposite expression patterns of prognosis modules compared to the best prognosis subtype R1 ( Figure 7A), suggesting that these prognosis modules together may characterize the molecular features of subtypes and correspond to their disparate survival outcomes. This result also shows that low expression of M6 (carboxylic acid metabolism) and M18 (positive regulation of autophagy) and high expression of M10 (mitotic cell cycle), M17 (extracellular matrix organization) and M28 (lymphocyte activation) may be associated with poor prognosis in ccRCC (absolute Pearson's correlation coefficient > 0.5 in R2). In addition, subtype R4 shows very close expression patterns with R1 except M7 (histone lysine methylation), which has significantly high expression in R4 but low expression in R1. The result implies that high expression of M7 (histone lysine methylation) may partially account for the worse prognosis of R4 when compared to R1.
Moreover, in prognosis modules, miRNAs and lncRNAs may play important regulation functions in associated biological processes. For example, in module M10 (mitotic cell cycle) ( Figure 7B), miR-30a [36], let-7c [37,38], miR-101 [39] and LINC00511 [40] ( Figure  7C) have been reported to play regulation functions in tumor cell growth and proliferation. The rest miRNAs and lncRNAs in M10 may also participate in regulation of genes associated with cell cycle since they are highly co-expressed. Interestingly, DEPDC1, the antisense gene of lncRNA DEPDC1-AS1, plays a pivotal role in the regulation of mitotic progression [41]. DEPDC1 and DEPDC1-AS1 are both included in M10 and highly coexpressed. The result suggests that DEPDC1-AS1 may be involved in cell cycle as a cis-acting regulator of DEPDC1. Although detailed molecular mechanism need further research, our results still suggest that the prognosis modules may represent critical functional modules in ccRCC and those miRNAs and lncRNAs in modules may play important regulation functions in associated biological processes.

DISCUSSION
In this study, we introduced ICC to integrate mRNA, miRNA and lncRNA expression profiles for ccRCC subtyping. We identified one RCC subclass easily misdiagnosed as ccRCCs in clinic and four robust ccRCC subtypes that were associated with different clinicopathologic features, genomic aberrations and molecular expression patterns. Survival analysis showed that our classification could separate ccRCC tumors with different overall survivals. Functional analysis of four ccRCC subtypes characterized distinct features of biological processes and pathways. In addition, triplecolor co-expression network analysis depicted coexpression relationships among mRNAs, miRNAs and lncRNAs in ccRCC and identified 10 prognosis modules characterizing molecular features of four subtypes. Our results show that the integrative analysis of protein-coding genes and non-coding RNAs may contribute to delineate molecular heterogeneity within ccRCC and may help guiding treatment of ccRCC.
Our ICC approach integrates multiple data types by transforming the information of each data type into a PSM and merging PSMs into an fPSM. The final clustering result is determined by consensus clustering on fPSM. One advantage of the method is that it does not require normalization of data across multiple types or platforms prior to integrating them. Any data type can be integrated into analysis, including gene mutations, CNA, DNA methylation and gene/miRNA/protein expression profiles. We used ICC to integrate mRNA, miRNA and  lncRNA expression profiles of ccRCCs and successfully identified five robust clusters including one RCC subclass misdiagnosed as ccRCCs and four clinically relevant subtypes. This integrative analysis framework can also be applied to other cancer types.
Among the five clusters identified with ICC, cluster R5 shows significantly different somatic alterations patterns with other clusters. A recent report shows that 22 ccRCC samples in TCGA are misdiagnosed as ccRCCs and should be a mixture of chRCC and ccpRCC [27]. Cluster R5 accurately recognized 19 phenotypes. Each module is represented by its module eigengene and the most significant biological processes for each module are shown on the right. B. Heatmap of genes belonging to prognosis module 10 (mitotic cell cyle) (top) and corresponding module eigengene values across samples (bottom). C. Visualization of the potential regulatory network of miRNAs in module 10. Yellow circle, red rhombus and purple rhomboid indicate mRNAs, miRNAs and lncRNAs, respectively. We predicted the targets (including mRNAs and lncRNAs) of miRNAs in module 10 using miRWalk2.0 webserver with default settings. Then those predicted targets included in module 10 were retained together with miRNAs for network visualization. out of 20 misdiagnosed samples included in our study (True Positive Rate (TPR) = 86.4%, higher than TPR = 64.5% by lncRNA-based classification [22]). The result demonstrates the ability of integrative analysis of protein coding genes and non-coding RNAs to distinguish other RCCs from ccRCCs.
Our integrated transcriptomic classification identified four robust ccRCC subtypes. When compared to previously established gene expression ccA/ccB classification, our subtyping system can further divide ccA into two subgroups R1 and R4 with significantly different survival. Moreover, more than half (52.6%) of lncRNAs are significantly highly expressed in R4, much more than those in other subtypes, which makes R4 a potential lncRNA-dependent subtype. These results suggest that integrating information from both protein coding and non-coding RNAs may contribute to capturing the heterogeneity of ccRCC and could help dividing ccRCCs into more elaborate subgroups.
To our knowledge, this is the first integrative analysis of protein-coding genes and non-coding RNAs in cancer subtyping. Our four integrated transcriptomic subtypes significantly correlate with clinicopathologic and molecular features. Although this result should be verified in a larger sample size from multiple centers, our study demonstrates that integrative analysis of protein-coding genes and non-coding RNAs are valuable for ccRCC subtyping.

The cancer genome atlas (TCGA) data
All ccRCC patients are from the kidney renal clear cell carcinoma (KIRC) cohort of The Cancer Genome Atlas (TCGA) project [11]. Multi-omics datasets of a cohort of 431 ccRCC patients with matched mRNA, miRNA and lncRNA expression profiles were used in this study (Supplementary Tables S2-S3). Level 3 mRNAseq RSEM data (n = 533) and Level 3 miRNA-seq RPM data (n = 516) were obtained from the Broad Institute GDAC FireBrowse (TCGA data version 20150601, http:// firebrowse.org/). LncRNA RPKM data (n = 448) was downloaded from TANRIC database [42]. Gene mutation and copy number alteration data were archived from Mutation MutSig2.0 Analyses results and CopyNumber Gistic2 Analyses results on the Broad Institute GDAC FireBrowse (TCGA data version 20150401). Clinical information data were downloaded from UCSC cancer browser [43].

Transcriptome data preprocessing
The same preprocessing flow was applied to mRNA, miRNA and lncRNA datasets independently. Given a dataset, the RNA expression matrix was obtained by substituting the smallest non-zero value for all zero or NA values in the dataset and taking a base-2 logarithmic transformation. Before clustering, RNA expression matrix was median centered by RNA and the most variable RNAs was selected using median absolute deviation. Finally, the most variable 2,500 mRNAs, 1,500 lncRNAs and 300 miRNAs were used for the unsupervised clustering analysis.

Integration of multiple transcriptome data types using consensus clustering
For the first step of the ICC method, we employed consensus clustering algorithm [24] to integrate multiple transcriptome data types, including mRNA expression, miRNA expression and lncRNA expression profiles. First, we constructed the patient similarity matrix (PSM) represented by a consensus matrix for each data type independently ( Figure 1A, 1B). Due to the comparability of the consensus matrix, we then combined three PSMs into a fused PSM (fPSM) by matrix addition ( Figure  1C). Specifically, we initially performed consensus clustering on each filtered RNA expression matrix to construct the consensus matrix using the R package ConsensusClusterPlus. To ensure comparability of clustering results based on different data types, the same clustering parameters were used for different data types and different cluster numbers. For any given data type and cluster number k, we conducted 1,000 runs (reps) of agglomerative hierarchical clustering algorithm (clusterAlg) using the resampled data with 80% patient resampling (pItem) and 80% RNA resampling (pFeature). The distance measurement was set as Pearson correlation (distance) and linkage function was set as "ward.D" (both innerLinkage and finalLinkage). Then, a single combined consensus matrix was achieved by summation of three consensus matrices derived from different data types. We consider the combined consensus matrix as a fPSM capturing a combination of multiple patient similarity measures derived from different transcriptome data types.

Assessment of clustering stability
Given multiple transcriptome data sets and a specific cluster number k, we can now achieve a fPSM for clustering. Furthermore, to obtain stable clustering results, we used consensus clustering to the fPSM to assess clustering stability ( Figure 1D). Considering the specific properties of the PSM (patients × patients), we used a custom implementation of consensus clustering to resample patients identically from rows and columns of the fPSM. We carried out 1,000 runs of "ward.D" linkage hierarchical clustering using the resampled data with 80% patient resampling and 1-Spearman correlation as distance. Then a final consensus matrix was calculated. www.impactjournals.com/oncotarget Finally, we performed hierarchical clustering to achieve the cluster membership result with 1-the final consensus matrix as a distance matrix and "ward.D" as the linkage function.

Selection of optimal cluster number
We selected a maximum cluster number k of 12 for assessment of optimal cluster number. To intuitively inspect clustering results, we used the R package pheatmap for visualization of the final consensus matrix for each cluster number. We also calculated empirical cumulative distribution (CDF) for the final consensus matrix and used the proportional area change under CDF (Δ(k)) for selection of the optimal k. According to the Δ(k) vs k plot, the optimal k is the k for which the Δ(k) value will be close to zero. To avoid a subjective judgement, we also investigated the proportion of ambiguous clustering (PAC) score and average silhouette width for each cluster number k, and the k for which the lowest PAC score or highest average silhouette width (ASW) appears can be considered to be the optimal K. The optimal number of clusters was determined by a combination of Δ(K), PAC and ASW.

Gene-set analysis
We performed gene-set analysis among ccRCC subtypes and adjacent normal sample group using the R package PIANO [33] and KEGG pathway genes set from NCBI BioSystems [44]. The p-value for each gene calculated for differential expression analysis was used as gene-wise statistics. Fisher's combined probability test was used to aggregate data into a gene-set p-value. The "geneSampling" method was used for significance assessment of gene sets. The significant up(/down)regulated pathways were filtered and only the dominantly up-or down-regulated pathways were reported.

Weighted gene co-expression network analysis
A signed weighted co-expression network was constructed using the R package WGCNA [35]. mRNAs, miRNAs and lncRNAs expressed in more than 70% samples of 407 ccRCCs (patients in R1-R4) were independently filtered. Then 27,543 genes (18,288 mRNAs, 566 miRNAs and 8,689 lncRNAs) were included for network construction. In order to seize the negative regulation functions of miRNAs, we multiplied miRNA expression values by -1 so that miRNAs could be positive correlation with their potential targets and clustered together in the signed network. First, a matrix of pairwise correlations between all pairs of RNAs across 407 ccRCC samples was calculated. Then, an adjacency matrix was computed by raising the 0.5*(1+correlation matrix) to the power of 6, which is the suggested value for more than 60 samples in a signed hybrid network. Based on the adjacency matrix, a topological overlap matrix was constructed. Using 1-topological overlap matrix as a dissimilarity matrix, the average linkage hierarchical clustering was performed. Finally, robustly defined modules were identified by cutting the hierarchical clustering tree using the dynamic hybrid tree cut algorithm. We summarized each module by the module eigengene (ME) represented by the first principal component of the scaled module expression profiles, which explains the major variation of module expression. Expression values of MEs were used for assessing correlations of modules with survival and five subtype phenotypes. Here, the subtype phenotype was introduced as a binary trait variable across all samples. In this case, phenotypic trait is the subtype category, that is R1-R4. To define prognosis modules, we also identified 2,822 prognosis genes (2,494 mRNAs, 27 miRNAs and 301 lncRNAs) (Supplementary Table S9) which were significantly correlated with survival after adjusting by FDR among all 27,543 genes (Adjusted p value < 1e-4).

Statistical analysis
All statistical analysis was performed using R language [45]. The R package survival was used for survival analysis. Kaplan-Meier methods and a log-rank test were used to assess differences between survival distributions. Univariate and multivariate models were calculated using Cox proportional hazards regression. The R package limma [28] was employed for differential expression analysis. A Fisher's exact test was used to compute p-values from a contingency table. The Benjamini and Hochberg method was used for multiple testing adjustment and 0.05 was considered statistically significant, unless otherwise noted.