Transcriptome sequencing identified hub genes for hepatocellular carcinoma by weighted-gene co-expression analysis

Hepatocellular carcinoma (HCC) is one of the most common malignancies worldwide, and it remains a challenge to understand the genetic mechanisms underlying hepatocarcinogenesis. A global gene network of differential expression profiles in HCC has yet to be fully characterized. In the present study, we performed transcriptome sequencing (mRNA and lncRNA) in liver cancer and cirrhotic tissues of nine HCC patients. We identified differentially expressed genes (DEGs) and constructed a weighted gene co-expression network for the DEGs. In total, 755 DEGs (747 mRNA and eight lncRNA) were identified, and several co-expression modules were significantly associated with HCC clinical traits, including tumor location, tumor grade, and the α-fetoprotein (AFP) level. Of note, we identified 15 hub genes in the module associated with AFP level, and three (SPX, AFP and ADGRE1) of four hub genes were validated in an independent HCC cohort (n=78). Identification of hub genes for HCC clinical traits has implications for further understanding of the molecular genetic basis of HCC.


AbstrAct
Hepatocellular carcinoma (HCC) is one of the most common malignancies worldwide, and it remains a challenge to understand the genetic mechanisms underlying hepatocarcinogenesis. A global gene network of differential expression profiles in HCC has yet to be fully characterized. In the present study, we performed transcriptome sequencing (mRNA and lncRNA) in liver cancer and cirrhotic tissues of nine HCC patients. We identified differentially expressed genes (DEGs) and constructed a weighted gene co-expression network for the DEGs. In total, 755 DEGs (747 mRNA and eight lncRNA) were identified, and several co-expression modules were significantly associated with HCC clinical traits, including tumor location, tumor grade, and the α-fetoprotein (AFP) level. Of note, we identified 15 hub genes in the module associated with AFP level, and three (SPX, AFP and ADGRE1) of four hub genes were validated in an independent HCC cohort (n=78). Identification of hub genes for HCC clinical traits has implications for further understanding of the molecular genetic basis of HCC.

IntroductIon
Hepatocellular carcinoma (HCC) is one of the most common malignancies worldwide, with the highest incidences occurring in East Asia and sub-Saharan Africa [1]. In China, HCC is the second leading cause of cancer deaths. Infection with chronic hepatitis B virus (HBV) remains the major etiological factor of HCC globally with more than one half of HCC patients being chronic HBV carriers [2]. Due to high mortality and poor 5-year survival rates [1], a better understanding of the genetic basis of HCC based on a more comprehensive approach will potentially provide novel strategies for its prevention and treatment.
In the present study, we performed a weighted gene co-expression network analysis (WGCNA) for HCC using mRNA-and lncRNA-seq data to investigate the association between differentially expressed genes (DEGs) and HCC clinical traits, e.g., tumor grade and the α-fetoprotein (AFP) level. We had the following aims: 1) investigate transcriptional patterns in liver cancer and cirrhotic tissues; 2) identify gene modules associated with HCC clinic traits and critical intramodular genes (i.e., hub genes); and 3) complement the characteristics of biological networks in the hepatocarcinogenesis.

rEsuLts summary of rnA-seq data
We performed lncRNA and mRNA sequencing on nine pairs of liver cancer and adjacent cirrhotic tissues of hepatitis B virus-associated HCCs using the Illumina HiSeq ™ 2000 (San Diego, CA) platform. RNA-seq generated 90 bp paired-end sequences and resulted in an output of a total 497 GB of raw sequence (approximately 13 GB per sample). On average, 54 and 57 million raw sequencing reads were obtained in lncRNA and mRNA sequencing, and approximately 94.1% and 93.9% of these raw reads were aligned to the transcribed database (http:// genome.ucsc.edu/, hg19) (Table S1 and S2).

Identification of lncRNAs in the liver cancer and cirrhotic tissues
We used a pipeline for lncRNA annotation from RNAseq data (PLAR) to identify lncRNAs [23] ( Figure 1A). The pipeline predicted 83,796 distinct transcript models that overlapped > 52% of the protein-coding genes (based on the RefSeq annotation in NCBI). After filtering predicted protein-coding transcripts and transcripts near coding genes, we finally obtained 2,799 non-coding RNA transcripts. The number of different types of non-coding RNA transcripts are shown in Figure 1B. In total, eight significantly differentially expressed lncRNA transcripts were identified, and none of these lncRNAs had been reported previously ( Figure 1C).

Significantly differentially expressed mRNA and lncRNA
A total of 23,367 of genes were identified in nine liver cancer and cirrhotic tissues (Figure 2A). Using FPKM [24], we identified 747 significantly differentially expressed genes (DEGs) (fold change (FC) > 2, and false discover rate (FDR) < 0.05) between the liver cancer and cirrhotic tissues including 334 up-regulated and 413 down-regulated genes. For eight differentially expressed lncRNA transcripts, six up-regulated and two downregulated genes were identified.
We estimated the statistical power for detecting significantly DEGs using 'RnaSeqSampleSize' [25] and 'ssizeRNA' [26]. The genes with minimun read counts > 10 across all individuals (n = 11,076) genes was used. We also estimated the distributions of gene read count and dispersions using our RNA-seq data as reference. Given a minimal FC of 2 (i.e., the effect size) and a FDR < 0.05, the statistical power to reject the null hypothesis that the population means of the two groups are equal is 0.697 (by an exact test [27]) using 'RnaSeqSampleSize' [25]. Given the same effect size and significance level, the achieved statistical power is 0.265 in nine pairs HCCs (by a paired t-test) using 'ssizeRNA' [26].
Cluster analysis of the DEGs produced six significant clusters ( Figure 2B). Functional annotation for each cluster according to the gene ontology (GO) terms using DAVID [28] indicated that there were 57 significant GO terms (p < 0.05 and FDR < 0.05) including 22 cellular component terms, 30 biological process terms and five molecular function terms (Table S3). The top 10 GO terms for each cluster are shown in Figure 2C.
A gene set enrichment analysis (GSEA) for RNAseq was performed to determine whether a set of genes defined a priori showed statistically significant, concordant differences between liver cancer and cirrhotic tissues. Five significant gene signatures (FDR < 0.05) were enriched in oncogenic signatures (Table S4). The β-catenin and Yesassociated protein conserved signatures were well known functional pathways involved in HCC [29,30].
The enrichment map analysis overcomes the limitation of redundancy in the GO system [31]. A functional map was constructed using 47 enriched gene sets (30 up-and 17 down-regulated gene-sets) ( Figure 2D). We noted that the up-regulated gene set was enriched in 'nucleic acid metabolic process' and 'biopolymer metabolic process' genes and that the down-regulated gene set was enriched in 'signal transduction' genes. We then tested whether the enriched gene sets were associated with the known liver cancer gene set (http:// zldev.ccbr.utoronto.ca/~ddong/diseaseHub/index.html), which integrates data from multiple sources including OMIM (Online Mendelian Inheritance in Man), GAD (Genetic Association Database), HGMD (Human Gene Mutation Database), PharmGKB (Pharmacogenomics Knowledge Base), CGP (Cancer Genome Project) and GWAS (Genome Wide Association Studies). We found that the 'immune response' genes enriched in the known liver cancer gene set and two known liver cancer genes (CCL5 and CXCL12) were associated with the enriched gene set (p < 10 −4 , Fisher's Exact Test).

Gene co-expression network analysis of DEGs
We decided to construct gene co-expression networks using the weighted gene co-expression network analysis (WGCNA). We extracted modules containing at least 15 genes by combining modules with eigengenes. The co-expression network contained 11 modules ( Figure 3A), and the module sizes ranged from 21 to 94. However, 263 genes were not similarly co-expressed with other genes in the network (MEgrey), including eight lnc-RNA genes. The association between the modules and HCC clinical traits (i.e., sex, age, tumor location, tumor grade and the AFP level) were identified ( Figure 3B). The correlation coefficient (r) of MEblack indicated that it was positively correlated with tumor grade (r = 0.78, p = 0.01) and that MEtan and MEblue were negatively correlated with the AFP level (r = −0.95, p = 8e−05) and the AFP high/low trait (r = −0.77, p = 0.02), where AFP > 25 ng/ml was considered to be high, respectively.
To verify the correlation between MEtan and the AFP level, a measure of module significance -the average gene significance of all of the genes in the module -was calculated. The distribution of the gene significance in all modules associated with the AFP level showed that MEtan had the highest mean gene significance (0.81) ( Figure 3C), indicating that genes in MEtan may play an important role in affecting AFP level. Using Ingenuity Pathway Analysis (IPA ® , http://www.ingenuity.com/products/ipa), we found that MEtan was enriched in genes involved in cell morphology (p = 0.03), embryonic development (p = 0.03) and hematopoiesis (p = 0.01) for molecular and cellular functions, and in the canonical eukaryotic pathways of CMP-N-acetylneuraminate biosynthesis I, phosphatidylcholine biosynthesis I, and mismatch repair. Hub genes represent a series of genes that is significantly connected to a relevant module [32]. We identified the hub genes for the AFP level in MEtan based on the network property (the intramodular connectivity (IMC) and module membership (MM)) ( Figure 3D). The top 15 hub genes were shown in Table 1 (FDR-adjusted p < 0.05), which explained 71% of the total variation in the module eigengene. Of the 15 hub genes, SPX was the gene most significantly associated with the AFP level (q.weighted value = 0.0025), and EXO1 and ADGRE1 exhibited the highest IMC (7.92) and MM (0.95), respectively. We also noted that the well-known HCC biomarker -AFP [33] -was associated with the AFP level (IMC = 3.18, MM = −0.84, and q.weighted = 0.025).
A network visualization of the genes in MEtan associated with the AFP level is shown in Figure 4. A high interconnectivity among hub genes implies that the processes in which they are involved are potentially co-regulated. The most interconnected hub genes, with 13 strong connections (TOM > 0.1), were SPX, EGR3 and ADGRE1. For example, SPX has strong interactions with EXO1, ADGRE1, EGR3 and GPR88.

Network preservation analysis and consensus analysis
To validate the modules identified in the training data (i.e., our RNA-seq data), we assessed the preservation (b) Cluster analysis of significantly DEGs between nine liver cancer and cirrhotic tissues (fold change > 2 and FDR < 0.05). Rows represent genes, and columns represent samples. The dendrogram was generated from unsupervised cluster analysis of DEGs based on complete linkage and Pearson distances. The tree was cut by using the programme cutree for the hcluster at the h = max(hcluster$height)/1.2. Different coloured bars indicate different clusters. (c) GO terms for each cluster shown in B. (d) Network of enriched GO terms derived from the 747 DEGs between the liver cancer and cirrhotic tissues. Red nodes represent up-regulated terms, and green nodes represent down-regulated terms. The blue triangle represents the known liver cancer genes from the DiseaseHub database (http://zldev.ccbr.utoronto. ca/~ddong/diseaseHub/index.html). The color intensity in each node is proportional to enrichment significance. Purple edges indicate overlap between the liver cancer signature and the enriched gene sets, tan-color edges indicate overlap between two gene sets, and the edge width is proportional to the overlap size between the two nodes. of modules in another two HCC RNA-seq data sets: a RNA-seq data set for 12 HCC patients (Zhang_testing_ data, n = 12, accession no.: GSE63863) [34] and TCGA database (TCGA_testing_data, n = 50). We used a measure of intramodular connectivity preservation (i.e., Zsummary) to assess preservation [35]. In Zhang_testing_ data, we found that MEbrown were moderately preserved (|Zsummary|>2), and the remainings modules were weakly preserved (|Zsummary|<2) ( Figure 5A). In TCGA_ testing_data, the modules showed highly preserved (|Zsummary|>8) (Figure 5C), e.g., Zsummary for MEtan related to AFP level was 8.04. These results suggested that the modules identified in the training data were reproducible in independent testing networks and there was no significant change in intramodular connectivity patterns. In addition, consensus analysis of associating the training data with two testing data sets respectively showed that most of the training set-specific modules have a consensus counterpart ( Figure 5B and 5D). Our validation results suggested a similar module structure between the training and testing data.
We then used quantitative real-time PCR (qRT-PCR) to experimentally validate the expression levels of the hub genes in an independent HCC cohort (n = 78). As shown in Figure 5E, three out of the four selected genes (SPX, AFP, EXO1, and ADGRE1) were validated. The qRT-PCR and RNA-seq results were similar.

dIscussIon
In the present study, we conducted an expression profile analysis for nine liver cancer tissues and their    matched cirrhotic tissues using transcriptome sequencing. We identified 747 DEGs and eight novel significantly differentially expressed lncRNA transcripts. The weighted gene co-expression network analysis identified a module (MEtan) related to AFP level and detected hub genes in MEtan. The identified modules were validated by preservation and consensus analysis in another two HCC RNA-seq datasets. We validated at least three hub genes (SPX, AFP and ADGRE1) in the module associated with the AFP level. To our knowledge, this study is the first one to integrate mRNA-and lncRNA-seq data to identify the hub genes related to HCC clinical traits.
We identified 747 DEGs in the liver cancer and cirrhotic tissues, which were categorized into six clusters. The DEGs were enriched in five oncogenic signatures. For example, the signatures of 'BCAT_GDS748_DN' and 'BCAT.100_ UP.V1_DN' both source from β-catenin, a major effector of the canonical Wnt signaling pathway. The 'CORDENONSI_ YAP_CONSERVED_SIGNATURE' is a YAP (Yes-associated protein)-conserved signature; YAP is a driving oncogene in HCC [30]. However, no significant consistency between the clusters and gene signatures was noted.
WGCNA is a powerful approach for investigating the mechanisms underlying the tumorigenesis because co-expressed genes are likely to be jointly involved in carcinogenesis [36]. The down-regulated of eigengenes in MEtan suggested that AFP is partially modulated by specific mRNAs. MEtan was predominantly enriched with cell cycle progression genes, playing an important role in hepatocyte proliferation. WGCNA identified lncRNA co-expression network to be associated with actively transcribed enhancers, which is involved in cell cycle deregulation and liver metabolism during HCC development [37].
We identified 15 highly connected hub genes in MEtan, including ABCA8, AFP, EGR3, EXO1, HMGA1, MT1X and VARS, which play roles as major regulators in cell-cycle regulation and cancer development. ABCA8 is responsible for the transport of a variety of inflammatory mediators and lipids that have direct relevance to tumor progression in ovarian cancer [38,39]. EGR3 is the bona fide target for ESR and involved in the estrogensignaling pathway in breast cancer cells [40]. HMGA1 is involved in the carcinogenesis and invasiveness of HCC, which may be a potential prognostic marker [41]. By interacting with β-catenin, HMGA1 positively regulates Wnt/ β-catenin signaling, leading to an increased formation of the β-catenin-TCF4 complex [42]. EXO1 is an important nuclease involved in mismatch repair system that contributes to maintain genomic stability, modulate DNA recombination and mediate cell cycle arrest. A polyporphism in EXO1 (K589E) was associated with increased risk of HCC development by influencing the activity of Exo1 protein [43].
We validated three of four selected hub genes in 15 hub genes ( Figure 5E). Rucinski et al. [44] ascertained the role of spexin (SPX) in the regulation of cell proliferation (e.g., in adrenal gland cortex). The reexpression of AFP occurs in 50 to 80% of HCC patients during tumor progression, and the serum AFP levels play an important role in HCC diagnosis and the monitoring of responses to treatment [33]. Chen et al. [45] found several polymorphisms in the AFP promoter region may be pathologically significant in HCC. ADGRE1 is an eosinophil receptor and a highly specific marker for eosinophils in humans [46]. Eosinophils reduce chronic inflammation linked to liver diseases in adipose tissue [47]. The association of SPX and ADGRE1 with AFP level and HCC has not been reported previously. The underlying interactions between these hub genes affect AFP levels needs to be explored further.
Microarray data have been widely used in WGCNA for the identification of modules associated with cancer or intermediate traits [48][49][50]. However, several intrinsic limitations of microarray data should be noted; for example, each array contains only its own known genes, and microarray experiments are less reproducible [51]. In the present study, we performed WGCNA for HCC clinical traits using RNA-seq data, which has been suggested to be more reproducible [51].
There are several limitations in the present study. First, the number of samples used for RNA-seq was small, and a large sample size is needed to demonstrate the reliability of the results. However, preservation and consensus analysis of the identified modules in two independent HCC RNA-seq data sets indicated that modules in the training data were preserved in the testing data, especially in TCGA_testing_data. In addition, we validated at least three hub genes (of four) by qRT-PCR. Second, although we have identified a specific module (MEtan) that is associated with AFP level and have analyzed the possible molecular/cellular functions and canonical pathways in the module, there may still be other functions and pathways that were overlooked. Third, in the present study, we assumed one etiologic class (i.e., homogeneity) using HBV-associated HCC patients in RNA-seq. We cannot exclude the possibility that these hub genes may be involved in non-HBV associated HCC. Finally, we acknowledged that there is lack of functional validation of the hub genes.
In conclusion, our study identified differentially expressed mRNA and lncRNA between the liver cancer and cirrhotic tissues. A weighted gene co-expression network based on mRNA and lncRNA identified a module that was significantly associated with AFP level. Hub genes within the module may have crucial roles in HCC progression and may therefore be candidates for functional studies. Our study provided evidence that data mining of www.impactjournals.com/oncotarget DEGs is an effective approach for the identification of novel genes associated with HCC clinical traits.

Patients and tissue specimens
A total of 232 paired fresh-frozen tissue samples (the liver cancer and cirrhotic tissues) were collected from HBV-related HCC patients undergoing surgery in four hospitals ( Table 2). All of the samples were immediately frozen in liquid nitrogen after surgical resection. Diagnosis of HCC for all of the cases was histologically confirmed by two independent pathologists at Peking Union Medical College Hospital. All of the tumour tissues were assessed by hematoxylin and eosin (HE) staining, and only tumour tissues with the percentage of tumor cells > 70% were used for analyses. Afterwards, the liver cancer and cirrhotic tissues of nine HCC patients were selected for RNA-seq. After pathological confirmation, 87 patients were used in the present study including nine pairs RNA-seq HCCs (Table 3) and 78 pairs HCCs used as an independent cohort for validation (Table S5). The

cDNA library preparation and RNA-sequencing
Total RNA was extracted from nine HCC and cirrhotic tissues. We performed whole transcriptome sequencing for mRNA and lncRNA, as described in our previous study [52]. All sequencing was carried out at the Beijing Genomic Institute at Shenzhen (BGI-Shenzhen, Shenzhen, China). A detail description of the library preparation and sequencing is provided in the Supplementary materials.

RNA-seq data analysis
We used a previously described protocol [53] to perform differential gene expression analysis. We used Tophat [54] to map the RNA-seq reads to the genome. Alignments were used as input for Cufflinks [55] for transcriptome reconstruction. The reconstructed transcriptome from all of the samples were merged using CuffMerge. Expression levels in each sample in Fragments Per Kilobase per Million reads (FPKM) units were quantified using CuffDiff [55]. Finally, we used cummeRbund [56] for further analysis. All programs were used with default parameters.

Identification of lncRNA
We used a Pipeline for lncRNA annotation from RNAseq data (PLAR) to identify lncRNAs [23] (Figure 1A). A detailed description was shown in Supplementary materials.

Gene set enrichment analysis (GSEA) and enrichment map analysis
GSEA [57], which utilizes the gene rank derived from differential expression, is a computational method that determines whether a priori defined set of genes shows statistically significant, concordant differences between two biological states. An enrichment map organizes gene sets in a more intuitive way and is implemented in Cytoscape network analysis environment [31,58]. Gene-sets derived from DEGs were enriched and filtered for significance (p < 0.05, FDR < 0.05). An enrichment map places similar significant gene sets near each other, resulting in a more concise global view of enriched biological functions. Overlap between significant gene sets is computed according to the overlap coefficient. The overlap between the DEGs found in the present study and the known liver cancer gene set was scored (Fisher's exact test, nominal p < 10 −4 ) (http://baderlab.org/Software/ EnrichmentMap).

Weighted gene co-expression network analysis (WGCNA)
WGCNA is a statistical approach for the construction of gene modules within a network based on correlations between RNA expression profiles [36]. Considering that the WGCNA was nearly a scale-free topology, the weighted coefficient β was selected based on the scale-free topology criteria, allowing for a maximal correlation coefficient. The adjacency coefficient α was computed using a power function (α mn = power (S mn β) = 1S mn 1 β ), which measures correlation strength between two genes. The adjacency matrix was created based on α, which was subsequently transformed into a topological overlap matrix (see Supplementary materials). A topological overlap measure (TOM) was calculated, which assessed gene interconnectedness.

Identification of clinical significant modules
Genes were hierarchically clustered using the dissimilarity coefficient as the distance measure. We assigned modules containing at least 15 genes by using a mixed dynamic tree-cutting algorithm criterion, which was used to identify modules whose expression profiles were similar and then and merge them into new modules defined as merged dynamic modules. Module eigengenes associated with clinic traits were then used to calculate a correlation coefficient (r) between each module and an HCC clinic trait. The module significance is defined as the average gene significance in a module, and the gene significance is defined as log(p), where p denotes the significance from the t-test for the identification of differential expression between two groups. High module significance values denote strong associations with a clinic trait.

Hub gene analysis
Genes with the highest degree of connectivity within a module (i.e., centrally located genes of co-expressed genes) are termed as 'hub genes' and are expected to be drivers required for signaling pathway of essential cellular function [36,59]: where k i = intramodular connectivity of gene i, and a ij = adjacency between genes i and j [35]. Alternatively, it may also be defined as genes with high module membership [35,60]: MM i (q) = cor (x i , E (q) ), where MM i = module memebership of gene i (in module q), x i = expression profile of gene i, and E (q) = module eigengene of module q.
Both definitions were used to identify the hub genes of module associated with AFP level.

Network preservation and consensus analysis
Module preservation between our HCC data (the training data) and the testing data [34] was assessed using network preservation statistics [35]. Module densitybased and connectivity-based statistics were used to assess module reproducibility [35]. In this comparison, Zsummary, a Z statistic representing a weighted summary of module density and connectivity measures was computed for each module. The Zsummary score was used to evaluate module preservation; >8 indicating strong preservation, and 2-6 means moderate preservation [35]. In addition, we performed consensus analysis of the training and testing data, which related training data modules to the consensus modules and calculated the overlaps of each pair of training-consensus modules.

Quantitative real-time PCR (qRT-PCR)
We used real-time quantitative PCR (Bio-Rad ® , Hercules CA) to validate four hub genes that affected AFP level. The genes were selected by their significance levels and functional relevance. Paired t-tests were used to test for significance. A detailed description of the qRT-PCR method is presented in the Supplementary materials.

Data submission
The sequencing data in this manuscript were submitted to the BioSample database (hosted by the NCBI) (http://www.ncbi.nlm.nih.gov/biosample), and the BioSample accessions is SUB1305112.