A seven-gene signature predicts overall survival of patients with colorectal cancer

Colorectal cancer (CRC) is a major cause of global cancer mortality. Gene expression profiles can help predict prognosis of patients with CRC. In most of previous studies, disease recurrence was analyzed as the survival endpoint. Thus we aim to build a robust gene signature for prediction of overall survival (OS) in patients with CRC. Fresh frozen CRC tissues from 64 patients were analyzed using Affymetrix HG-U133plus 2.0 gene arrays. By performing univariate survival analysis, 6487 genes were found to be associated with the OS in our cohort. KEGG analysis revealed that these genes were mainly involved in pathways such as endocytosis, axon guidance, spliceosome, Wnt signalling and ubiquitin mediated proteolysis. A seven-gene signature was further selected by a robust likelihood-based survival modelling approach. The prognostic model of seven-gene signature (NHLRC3, ZDHHC21, PRR14L, CCBL1, PTPRB, PNPO, and PPIP5K2) was constructed and weighted by regression coefficient, which divided patients into high- and low-risk groups. The OS for patients in high-risk group was significantly poorer compared with patients in low-risk group. Moreover, all seven genes were found to be differentially expressed in CRC tissues as compared with adjacent normal tissues, indicating their potential role in CRC initiation and progression. This seven-gene signature was further validated as an independent prognostic marker for OS prediction in patients with CRC in other two independent cohorts. In short, we developed a robust seven-gene signature that can predict the OS for CRC patients, providing new insights into identification of CRC patients with high risk of mortality.


INTRODUCTION
Colorectal cancer (CRC) is one of the most common malignancy with high morbidity and mortality worldwide, accounting for almost 1.4 million new cases and 0.7 million deaths in 2012 [1]. In China, the estimated new cases of CRC were 376,300, with 191,000 deaths occurring in 2015 [2]. Despite of improvements in diagnosis and treatment for CRC, its annual new incidence and mortality is still growing.
CRC is a heterogeneous disease with complex multi-pathways. Most of CRC follows the chromosomal instability (CIN) pathway which is the most well characterized pathway type [3]. The Wnt signaling pathway [4], RAS pathway [5], p53 system [6], and other pathways involved in CIN [7,8] are frequently present to be dysregulated during the initiation, progression and metastasis of CRC. A number of patients have quite different treatment responses and prognoses although their tumor types are histologically identical. www.impactjournals.com/oncotarget Therefore it is necessary to look deeply into those abnormal molecular pathways and develop individual strategies for CRC diagnosis and therapy. To date, a lot of efforts have been put to identify molecular markers, however, with limited success achieved when focusing on single protein or gene mutation [9]. Gene expression profiling has been verified as a promising tool to classify tumors and predict the prognosis of cancer [10]. Identification of molecular subtypes [11,12], discovery of progression markers [13,14], and construction of different prognostic models [15][16][17][18] in CRC have been processed and confirmed to potentially improve the diagnosis and therapy of CRC. However, all the developed prognostic signatures are still difficult to apply commonly because of the heterogeneity of CRC.
For most of these studies, disease free survival (DFS) or relapse free survival (RFS) was used as the endpoint to develop molecular markers or to evaluate the validity of prognosis. Considering that overall survival (OS) is traditionally regarded as the ultimate measure of treatment benefits, we would like to investigate whether it is possible to build a robust gene signature to predict the OS in CRC patients. By performing univariate survival analysis, 6487 genes associated with OS were discovered in patients with CRC from a Chinese cohort. Of them, a seven-gene signature was developed using a robust likelihood-based survival modelling approach, and further trained by BRB-array tool to generate a prognostic model. Importantly, the prognostic value of this seven-gene signature was validated in other two independent cohorts, indicating its potential use for identifying CRC patients with high risk of mortality.

RESULTS
The overall flowchart of this work was summarized in Figure 1. We employed Cox proportional hazard regression model and forward selection method to identify a seven-gene signature which can predict OS for colorectal cancer patients based on our gene microarray datasets of Chinese patients. To further evaluate the performance of this gene signature, two cohorts on different platforms from other countries were subsequently validated.

Identification of genes associated with OS
Our data included expression values of 54675 genes and 64 samples. Each sample had observed (survival or censoring) time and censoring status. We first selected an initial set of genes by performing univariate survival analysis using Cox proportional hazard regression model, with the threshold of p-value set as 0.05. Total 6487 genes associated with the overall survival were initially identified.
To investigate the key pathways that were associated with patient survival, we next performed KEGG pathway enrichment analysis for the 6487 genes. These genes were found to be enriched in the signalling pathways such as endocytosis, axon guidance, spliceosome, Wnt signalling and ubiquitin mediated proteolysis (Figure 2A).

Screening of seven-gene signature
We next screened the optimal survival-associated signature genes based on a partial likelihood of the Cox proportional hazard regression model [21]. Considering larger variability of the data, a cross-validation technique was applied by separating the samples into training and validation sets (See Materials and Methods). A forward selection was employed to generate a series of gene models and the optimal model was then selected by using the criterion of minimal AIC (Table 1). Seven genes (NHLRC3, ZDHHC21, PRR14L, CCBL1, PTPRB, PNPO and PPIP5K2) were selected as signature genes that can optimally predict the OS of patients with CRC. With the selected gene signature, unsupervised hierarchical clustering analysis was performed, and the patient population was classified into two sub-classes: Cluster 1 and Cluster 2 ( Figure 2B). Compared with Cluster 1 patients, all Cluster 2 patients in this study were still alive during follow-up ( Figure 2C). Therefore this seven-gene signature may have important application in predicting the OS for patients with CRC.

Differential expressions of seven-gene signature in colorectal cancer and adjacent normal tissue
We further compared the expressions of these seven genes between CRC tissues and adjacent normal tissues in two independent data sets. In our cohort, the mRNA expression of CCBL1, NHLRC3 and PNPO were significantly down-regulated in 26 CRC as compared with paired adjacent normal tissues ( Figure 2D). On the other hand, PRR14L, PPIP5K2, PTPRB and ZDHHC21 were determined to be over-expressed in CRC ( Figure 2D). Consistent result was obtained using TCGA cohort that contained 623 CRC and 51 adjacent normal tissues ( Figure 2E). Therefore, these seven genes may play important roles as oncogenes or tumor suppressor genes during the development of CRC.

Construction of survival risk score system based on seven-gene signature
We subjected these seven genes to BRB-arrayTools to construct a survival risk score system by using the 64 training samples. The regression coefficient for each gene was generated. The survival risk score was calculated as follows: risk score = (-1.004× expression level of PPIP5K2) + (0.823× expression level of CCBL1) + (-0.715× expression level of PNPO) + (-0.008× expression level of PTPRB) + (-1.077× expression level www.impactjournals.com/oncotarget of PRR14L) + (-0.133× expression level of NHLRC3) + (-0.781× expression level of ZDHHC21). Higher score indicated greater mortality risk for patient with CRC. By performing cross-validated time-dependent ROC curves, the area under the respective ROC Curves (AUC) was 0.814 ( Figure 3A), confirming the prediction accuracy of this model. 64 patients were divided into high-and lowrisk groups. The OS for patients in high-risk group was significantly poorer compared with patients in low-risk group (hazard ratio (HR) = 25.79, P < 0.001) ( Figure 3B).

External validation of seven-gene signature
To evaluate the robustness and effectiveness of the seven-gene signature, we used two independent sets of CRC patients with OS information and gene expressions on different platforms. The first one was from the Vanderbilt Medical Center (GSE17537, N = 55) using the same gene expression array platform as we did. The second one, TCGA dataset (N = 584), used Illumina RNA Sequencing method. The survival risk score of Next a robust likelihood-based survival modelling approach was used to select the optimal gene signature for prognosis prediction. The survival risk score system was built based on seven-gene signature (NHLRC3, ZDHHC21, PRR14L, CCBL1, PTPRB, PNPO, and PPIP5K2), which divided patients into high-and low-risk groups. Finally, this seven-gene signature was validated in two independent cohorts on different platforms.  each patient was calculated based on expressions of seven-gene signature. ROC curve analyses demonstrated that this seven-gene signature was capable of predicting OS in patients with CRC (for cohort I, AUC = 0.715, p = 0.011; for cohort II, AUC = 0.585, p = 0.009) ( Figure 4A and 4C). We further divided the patients into two risk groups, based on optimal cut-off risk scores ( Figure 4A and 4C). For the first validation data set, 38 (69.1%) and 17 (30.9%) patients were distinguished as the low and high risk groups, respectively. For the second validation data set, 418 (71.6%) and 166 (28.4%) patients were classified as low and high risk groups, respectively. Kaplan-Meier plots indicated significant differences between 5-year OS of two groups in both two validation data sets: GSE17537 (p < 0.001, Figure 4B), and TCGA (p = 0.002, Figure 4D). By univariate ( Table 2) and multivariate analyses (Table 3), this seven-gene signature significantly predicted the 5-year OS of patients. In multivariate analyses, this seven-gene signature showed prognostic significance for 5-year OS risk in both two validation data sets: GSE17537 (p = 0.002) and TCGA (p = 0.005). Therefore, the seven-gene signature was an independent prognostic factor in predicting the OS of patients with CRC.
To evaluate this seven-gene signature with respect to the prognosis among patients in TNM stage II and III, we further investigated the association between risk scores and OS in 382 CRC patients from TCGA. Stage III tumors were associated with a higher rate of 5-year mortality than stage II tumors (HR = 2.160, p = 0.004) ( Figure 5A). Meanwhile, patients in high risk group demonstrated poorer survival as compared with patients in low risk group (HR = 2.918, p < 0.001) ( Figure 5B). Moreover, this increased 5-year mortality in high risk group was marked in both stage II (HR = 3.124, p = 0.005) ( Figure 5C) and stage III (HR = 2.415, p = 0.012) ( Figure 5D). Therefore, this seven-gene signature might be able to help predict prognosis for patients with stage II and stage III CRC.

DISCUSSION
In this study, we presented the development and validation of a robust seven-gene signature which was able to predict the OS for CRC patients. The signature identified patients with high risk of mortality who may need potential interventions and individual therapies.
A Chinese cohort containing 64 CRC patients was used and total 6487 genes were identified to be associated with the overall survival. Of them, several genes have been reported to present prognostic value for CRC patients, such as CDX2 [19], EPHA2 [20], SLC2A1 [21], CDKN2A [22] and ITGA3 [23]. By performing KEGG analysis, these genes were enriched in the signalling pathways such as endocytosis, axon guidance, spliceosome, Wnt signalling and ubiquitin mediated proteolysis (Figure 2A). The activation of the Wnt signalling pathway is a critical event which frequently occur during CRC development, making it as therapeutic targets for cancer therapy [24]. The ubiquitinproteasome system is important for cell growth and apoptosis regulation, making it also a potential molecular target for cancer treatment and prevention [25]. We further narrowed down the genes size and selected an optimal seven-gene signature (NHLRC3, ZDHHC21, PRR14L, CCBL1, PTPRB, PNPO and PPIP5K2) for prognosis prediction. By searching NCBI (http://www.ncbi.nlm. nih.gov/gene/), none of these seven genes have been reported previously in cancer study. Of note, NHLRC3 was predicted to play role in protein modification process   through ubiquitination; CCBL1, PNPO, PPIP5K2 and ZDHHC21 were possibly involved in cellular metabolism regulation or acting as cell signaling molecules; PTPRB is a member of the protein tyrosine phosphatase (PTP) family which is known to regulate a variety of cellular processes including oncogenic transformation [26]; the function of PRR14L remains unknown. Interestingly, we found that these genes were differentially expressed in CRC tissues compared with adjacent normal tissues in two different cohorts ( Figure 2D and 2E). This suggests that these genes might be novel oncogenes or tumor suppressor genes, and that their functions need further investigation. The prognostic signature was validated in two independent patient sets on different platforms from different countries, with further validation study from China currently underway. Indeed, By ROC curve analyses, this seven-gene signature was capable of predicting OS in patients with CRC from these two datasets with an AUC of 0.715 and 0.585, respectively ( Figure 4A and 4C). By multivariate analyses, this seven-gene signature was validated as an independent prognostic factor for predicting 5-year OS of CRC patients (Table 3). Comparing with other studies that constructed different prognostic models [15][16][17][18], our study obtained a signature with few genes (only seven) to successfully predict the OS in CRC patients. This is important to develop a RT-PCR assay for clinical practice. In our future study, the high-risk patients identified by seven-gene signature will be further studied for their responses to different therapies in an independent Chinese cohort.
The American Joint Committee on Cancer (AJCC) TNM staging system is a golden standard to determine the treatment and prognosis for patients with CRC. However, the limitation of TNM staging system to accurately predict prognosis has been also demonstrated in clinical practice. For example, the 5-year survival rate of stage IIIA patients (85.4%) was higher than stage II patients (79.2%) [27]. Increasing attention has been focused on patients with stage II and stage III CRC, as the use of adjuvant chemotherapy to a proportion of these patients may cause potential under treatment or overtreatment [28][29][30]. To date, there are several tests (such as Oncotype Dx [31], a 12-gene RT-PCR assay, and Coloprint, an 18-gene microarray [32]) being developed in clinical laboratories, which are intending to predict the prognosis of patients with stage II CRC and help guide the use of chemotherapy. Still, all of them are under evaluation in multiple validation cohorts. Our seven-gene signature is able to predict the OS for patients in stage II and III in TCGA cohort ( Figure 5). However, the sample size is limited. Hence large samples are required to be collected in future to further validate the prognostic value of this seven-gene signature on CRC patients in stage II and III. Moreover, we will study whether the seven-gene signature could benefit selection of adjuvant chemotherapy for individual patient in stage II and III.
In conclusion, our seven-gene signature provides new promising biomarkers for CRC prognosis and potential therapeutic targets for CRC treatment.

Human CRC Samples
Frozen tumor specimens (n = 64) and paired adjacent non-tumor tissue (n = 26) of 64 patients with colorectal cancer were retrieved from the Second Affiliated Hospital of Zhejiang University, College of Medicine, China. All patients who received surgical operation and were diagnosed as colon or rectal adenocarcinoma were included in this study. All pathologic information and follow-up data were obtained by reviewing the hospital records [33]. Notably, 19 of the 64 patients died of cancer during follow-up. The clinical data of patients was presented in Table 4. This study was approved by the ethics committee of Zhejiang University.

Gene expression microarray and data analysis
Fresh CRC samples were immediately snap frozen in liquid nitrogen and stored at -80°C until further use. Total RNA was extracted using Trizol reagent (Invitrogen, Carlsbad, California, USA) and sent out to company (Capitalbio Corporation, Beijing, China) for RNA quality control (QC) performance. Only high quality RNA (OD260/280 = 1.8~2.2; 28S:18S ≥ 1.5; RIN (RNA Integrity Number) ≥ 8) were used for Affymetrix HG-U133plus 2.0 gene array. Statistical analyses of microarray data were performed in the R language environment (http://www.rproject.org) [34]. The data was normalized and transformed to expression values by using justRMA( ) function in "affy" library. Signal values for all genes were transformed to the log base 2. Quantile normalization was applied to obtain equal distributions of the probe signal intensities.

Gene signature identification
A robust likelihood-based survival modeling approach [35] was used to select the gene signature. We implemented our analysis by using the "rbsurv" package in R. The detailed algorithm is summarized as follows: (i). We randomly divided the samples into the training set with N*(1 − p) samples and the validation set with N*p samples, with p = 1/3. We fitted a gene to the training set of samples and obtained the parameter estimate for this gene. Then we evaluated log likelihood with the parameter estimate and the validation set of samples. This evaluation was repeated for each gene.
(ii). We repeated the above procedure 10 times, thus obtaining 10 log likelihoods for each gene. The best gene, g(1), with the largest mean log likelihood was selected.
(iii). We searched the next best gene by evaluating every two-gene model and selected an optimal one with the largest mean log likelihood.
(iv). We continued this forward gene selection procedure, resulting in a series of models. Akaike information criterions (AICs) for all the candidate models were computed and an optimal model with the smallest AIC was selected finally.

Unsupervised hierarchical clustering and Kaplan-Meier analysis
Unsupervised hierarchical clustering analysis [36] was performed in R using "hclust" function with Euclidean distance. Kaplan-Meier curves for two distinct groups of patients were plotted using "survfit" function in survival package. P value from log rank test was computed using "survdiff" function.

Development and validation of 7-gene survival risk score system
We subjected the seven genes to BRB-array Tools (http://linus.nci.nih.gov/BRB-ArrayTools.html), using survival risk group prediction tool [37], to calculate the regression coefficient for each gene using the 64 training samples from Zhejiang University. The Survival Risk Score is the sum of the product of the expression level of a gene and its corresponding regression coefficient. The patients were divided into two groups at high and low risk using the 50th percentile. Leave-one-out cross-validation method was performed to determine the robustness.
Gene expression data from the Vanderbilt Medical Center (GSE17537, HG-U133_Plus_2 platform; N = 55) with overall survival information was normalized by justRMA( ) and used as the first validation data set [14]. The CRC dataset (N = 623 for primary CRC tumors and N = 51 for adjacent non-tumor tissues) from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/; Illumina GA or Hiseq platform) was analyzed (publicly available TCGA pre-processed data was downloaded from Broad GDAC Firehose (https://gdac.broadinstitute. org/)), and those with overall survival information (N = 584) were used as the second validation set. For each independent cohort, the risk score of patients was calculated using the coefficient derived from the training data set. Receiver operating characteristic (ROC) curve analyses were generated for the 3-and 5-year survival predictions to evaluate the specificity and sensitivity of the gene signature and to estimate discriminatory power of the prognostic gene expression signatures. The area under the curve (AUC) was calculated and a bootstrap method was used to calculate the 95% confident internal (CI) for AUC.
Optimal cut-off value based on ROC curve was obtained to divide the patients into low and high risk groups [38]. Kaplan-Meier curves for the two groups of patients were plotted using "survfit" function in survival package.

Enrichment analysis of kyoto encyclopedia of genes and genomes pathways
Genes that were associated with the overall survival of patients evaluated by Cox proportional hazard regression model (P < 0.05) were included in the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis using the Gene Set Analysis Toolkit V2 (http:// bioinfo.vanderbilt.edu/webgestalt/). The hypergeometric test statistical method and the BH multiple test adjustment method were used. All genes from human beings were used as reference. Top 10 pathways with at least 10 genes involved were considered significantly enriched.

Statistical analysis
Mann-Whitney U test and Wilcoxon matchedpairs test were performed to compare the gene expression between CRC and adjacent normal tissues. All statistical tests were performed using Graphpad Prism 5.0 (GraphPad Software Inc, San Diego, CA, USA), and a 2-tailed P value of less than 0.05 was considered statistically significant