Identification of a gene signature associated with radiotherapy and prognosis in gliomas

Glioma is one of the most common primary brain tumors with poor prognosis. Although radiotherapy is an important treatment method for gliomas, the efficacy is still limited by the high occurrence of radioresistance and the underlying molecular mechanism is unclear. Here, we performed a data mining work based on four glioma expression datasets. These datasets were classified into training set and validation set. Radiotherapy-induced differential expressed genes and prognosis-associated genes were screened using different classifiers. The Kaplan-Meier curves along with the two-sided Log Rank (Mantel-Cox) test were used to evaluate overall survival. We found the gene expression profiles of gliomas between those patients received radiotherapy and those patients without received radiotherapy were quite different. A 20-gene signature was identified, which was associated with radiotherapy.Furthermore, a novel 5-gene signature (HOXC10, LOC101928747, CYB561D2, RPL36A and RPS4XP2) as an independent predictor of glioma patients’ prognosis was further derived from the 20-gene signature. These findings provided a new insight into the molecular mechanism of radioresistance in gliomas. The 5-gene signature might represent therapeutic target for gliomas.


INTRODUCTION
Glioma is one of the most common primary brain tumors in adults and malignant gliomas, accounting for approximately 70% of malignant primary brain tumors [1,2]. According to the World Health Organization (WHO) classification based on four main features: nuclear atypia, mitoses, microvascular proliferation, and necrosis, gliomas are classified as: grade I (pilocytic astrocytomas, PA), grade II (low grade), grade III (anaplastic) and grade IV (glioblastoma, GBM) [3]. Recently, there have been important advances in understanding the molecular pathogenesis of malignant gliomas [4] and significant progress in its treatment [5]. However, the overall survival www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No.51), pp: 88974-88987

Research Paper
of gliomas remains poor. The median survival time is only 12 to 15 months for patients with GBM and 2 to 5 years for patients with anaplastic gliomas [1]. Radiotherapy with ionizing radiation (IR) is used for the treatment of low grade gliomas [6] and GBM [7]. However, its efficacy is often limited by the occurrence of radioresistance [8] and the heterogeneity of gliomas with different histological subtypes and grades [9]. Furthermore, the molecular mechanism responsible for the radioresistance of human gliomas is still unclear. Exploration of the molecular alterations after radiotherapy may provide comprehensive understanding of radioresistance in gliomas. In this study, we attempt to find a gene signature associated with radiotherapy and prognosis in gliomas. After downloading microarray data sets of gliomas from the Gene Expression Omnibus (GEO) database and TCGA database, and analyzing the differentially expressed genes (DEGs) with different classifiers between glioma samples that received radiotherapy and that did not receive radiotherapy, we successfully obtained a 20-gene signature that was associated to radiotherapy. Then we further identified a 5-gene signature from the 20-gene signature which was predictive for the prognosis of glioma patients in different data sets.

A 20-gene signature associated with radiotherapy in gliomas
To explore gene markers associated with radiotherapy in gliomas, data mining was conducted. Three data sets were divided into a training set (GSE13041) including 218 patients and 2 validation sets (GSE7696 and TCGA cohort) including 628 patients. First, we used 5 different classifiers to re-classify the clinical samples into radiation group and no radiation group in the training set. As a result, we identified a 20-gene signature (ANAPC1,  BTBD7, CA11, CYB561D2, DRD5, FKBP6, HOXC10,  LAMB4, LOC101928747, PADI1, PAX3, PF4, PYGM,  QPCTL, RPL36A, RPS4XP2, SLC18A1, TP53TG3,  USB1, ZNF280A in Supplementary Table 1) that was associated with radiotherapy in gliomas. In detail, the 20gene signature could re-classify the two groups with high accuracy between 78% and 87%, high specificity between 0.796 and 0.928, and high negative predictive value (NPV) between 0.851 and 0.917 in different classifiers (Table 1), indicating relative high efficiency of this gene signature to distinguish glioma patients receiving radiotherapy from patients not receiving radiotherapy. When the hierarchical clustering analysis was conducted, we also found different expression pattern of the 20 genes between radiation group and no radiation group ( Figure 1A). Furthermore, we used the receiver operating characteristic (ROC) curves to evaluate the comprehensive ability of this gene signature in the 2 linear classifiers (Compound Covariate classifier and DLDA classifier) to separate these two groups. As a result, the 20-gene signature could separate these two groups with AUC value of 0.773 in Compound Covariate classifier ( Figure 2A) and 0.753 in DLDA classifier ( Figure 2B), respectively. This result further indicated moderate ability of this gene signature to separate these two groups in the training set. Afterwards, we used the validation sets to verify the result derived from the training set. We also found high accuracy between 66% and 88% in different classifiers in the TCGA cohort, and high accuracy between 66% and 99% in different classifiers in the GSE7696, respectively (Table 2). In the hierarchical clustering analysis, we found similar differential expression pattern of the 20 genes between radiation group and no radiation group in the TCGA cohort as that in the training set ( Figure 1B). Also, moderate ability of the 20-gene signature to separate these two groups was detected in the TCGA cohort with AUC value of 0.749 in Compound Covariate classifier ( Figure 2C) and 0.790 in DLDA classifier ( Figure 2D), respectively.

Identification of a 5-gene signature related to prognosis of glioma patients
A 20-gene signature associated with radiotherapy in gliomas has been identified, suggesting that the expression changes of these genes in gliomas might be induced by radiotherapy. We further hypothesize that some of them might be associated with radioresistance in gliomas and thus could influence the prognosis of the patients. Next, we tried to screen genes which were related to the prognosis of glioma patients from the 20-gene signature. In order to achieve this goal, GSE13041 was also used as the training set while GSE7696, GSE16011 and the TCGA cohort were used as the validation sets. First, when univariable Cox proportional hazards regression analysis was used in the training set, we obtained 5 genes (HOXC10, LOC101928747, CYB561D2, RPL36A and RPS4XP2) which were highly associated with patients' prognosis from the 20-gene signature ( Table 3). The random survival forests algorithm further validated that all the 5 genes were important for survival of glioma patients when the cut value of relative importance was set as 0.1 ( Figure 3 and Table 3). Then we successfully constructed a risk score model according to the expression levels of these 5 genes as follows: Risk score = 0.469×CYB561D2 + 0.197×HOXC10 -0.066×RPS4XP2 -0.506×RPL36A -0.645×LOC101928747. Next, all patients in the training set and the validation sets were divided into the high-risk group and the low-risk group according to the median risk score. The distribution of risk scores and the survival status of all the patients in the 4 data sets were showed in Figure  4A, 4B. We found that the radio of alive patients over dead patients at the endpoint of follow-up in the low-risk www.impactjournals.com/oncotarget  group was significantly higher than that in the high-risk group (Likelihood ratio test, p=9.138e-05). When Kaplan-Meier curves were used to further evaluate the difference of overall survival (OS) between the two groups, we found patients in the high-risk group had significantly shorter OS than those in the low-risk group (Log Rank test, P=0.011 in GSE13041, P<0.0001 in GSE7696, GSE16011, and the TCGA cohort) ( Figure 5A, 5B). These results indicated that the 5-gene signature was indeed associated with the prognosis of glioma patients.

Prognosis prediction by the 5-gene signature is independent of clinical and pathological factors
To assess whether the prognosis prediction ability of the 5-gene signature is independent of other  Table 4, univariable and multivariable Cox regression analysis both indicated that the risk score was significantly associated with poor prognosis of glioma patients in most data sets (GSE7696, GSE16011 and the TCGA cohort) (for GSE7696, HR=13. 20 These results indicated that the risk score based on the 5-gene signature might be an independent predictor of glioma patients' survival.

Identification of the 5-gene signature associated biological pathways and processes by GSEA
To identify the 5-gene signature associated biological pathways and processes, Gene Set Enrichment Analysis (GSEA) was performed in the GSE13041 cohort. The gene expression profile in the high-risk group and low-risk group were compared. As a result, several cancer related pathways or processes such as p53 signaling pathway and peroxisome were enriched in the high-risk group, while cancer related pathways or processes such as hedgehog signaling pathway and retinol metabolism were enriched in the low-risk group ( Figure 6).

DISCUSSION
In this study, we examined the gene profiles of glioma tissues from patients receiving or not receiving radiotherapy and identified a 20-gene signature associated with radiotherapy in gliomas. Furthermore, a 5-gene signature associated with the prognosis of glioma patients was identified from the 20-gene signature. This 5-gene signature is also an independent predictor of glioma patients' survival. Malignant tumors show massive molecular alterations including gene mutations and abnormal gene expression. The differentially expressed genes between tumors and normal tissues could be used as biomarkers of malignant tumors. However, singlegene biomarkers may result in low reproducibility across different data sets, while gene signature with a panel of genes may be superior to single-gene biomarkers [10].
In fact, the potential of gene signatures as biomarkers of malignant tumors have been widely explored since the pioneering study of molecular classification in acute myeloid leukemia (AML) [11]. In the following studies, gene signatures as classification markers [12,13], diagnosis markers [14,15], prognosis predictors [15][16][17][18][19][20] and markers of treatment response [21][22][23] were identified in different kinds of malignant tumors. As for gliomas, gene signatures were also identified for classification [24] and prognosis prediction [25]. However, few biomarkers are specific to radiotherapy or can indicate response to radiotherapy for glioma patients. In this study, we obtained a radiotherapy-specific 20-gene signature in gliomas, and further identified a 5-gene signature with prognostic value from the 20-gene signature, which may be used as new biomarkers for glioma patients receiving radiotherapy. As previously reported [10], the criteria to establish a gene signature as a marker of a particular treatment method or a prognosis predictor are as follows. First, the gene signature shows specific association with this treatment or patients' prognosis. Second, the accuracy and reproducibility of the gene signature are demonstrated in independent data sets. Third, the gene signature is independent of other clinical factors in a multivariate analysis. Here, we first identified a 20-gene signature associated with radiotherapy in glioma patients. Then a 5-gene signature associated with patients' survival was generated from the 20-gene signature. Both the association between the 20-gene signature and radiotherapy, and the prognostic value of the 5-gene signature were validated in training set and several validation sets. Moreover, the risk score based on the 5-gene signature was still significantly associated with poor prognosis of glioma patients in most data sets by in univariable and multivariable Cox regression analysis.  Therefore, we identified promising and stringent gene signatures which could be used as reliable biomarkers in gliomas. Besides, single bioinformatics model is usually prone to false-positive candidate genes lacking real biological relevance or less clinical utility [26]. Also, the FDR (False Discovery Rate) can be very high when the study is based on small-size samples (often less than 50 patients) [26]. To improve the reliability of our results, we chose data sets with more than 50 patients (in particular, the largest data set, the TCGA cohort includes a total of 548 patients), used 5 different classifiers during the data mining and carefully validated the results in different data sets. The 5-gene signature consists of CYB561D2, HOXC10, RPL36A, RPS4XP2 and LOC101928747. Among them, CYB561D2 is a member of the cytochrome b561 family, being a hydrophobic, transmembrane heme protein. It is capable of oxidationreduction reaction and is a candidate tumor suppressor gene [27,28]. HOXC10 belongs to the homeobox family which encodes a highly conserved family of transcription factors that play an important role in morphogenesis, cell differentiation and proliferation. HOXC10 dys-function is found in thyroid cancer [29], breast cancer [30] and cervical squamous cell carcinomas [31]. RPL36A encodes a ribosomal protein that is a component of the 60S subunit of cytoplasmic ribosomes. The protein, which shares sequence similarity with yeast ribosomal protein L44, belongs to the L44E (L36AE) family of ribosomal proteins. RPL36A overexpression is found in hepatocellular carcinoma [32]. For RPS4XP2 and LOC101928747, their function is almost unknown. The biological function of these 5 genes in gliomas might be of great importance for understanding the molecular mechanisms of radioresistance and potential biomarkers in predicting prognosis in gliomas. Taken together, it suggests that both tumor suppressors and oncogenes may affect the prognosis of gliomas. In summary, we have shown that the gene expression profiles of glioma tissues are different between patients that received radiotherapy and patients that didn't received radiotherapy. Furthermore, we obtained a 20gene signature associated with radiotherapy in gliomas and a 5-gene signature as an independent predictor of glioma patients' prognosis. One limitation of this study is that treatment response was not evaluated because this information was not available in most cases. The potential of the 5-gene signature as a biomarker for radioresistance in gliomas deserves validation in the future study.

Data sets of gliomas
A total of 4 independent data sets of gliomas including GSE13041 [33], GSE7696 [34], GSE16011 [35] and the TCGA cohort [36] were downloaded and analyzed. Among them, GSE13041, GSE7696 and GSE16011 were downloaded from the GEO database (Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/), and the TCGA cohort was downloaded from the TCGA database. GSE13041 has 2 subsets. One subset with 23 patients receiving radiotherapy and 168 patients not receiving radiotherapy was profiled with Affymetrix Human Genome U133A Array

Data processing
After the CEL file of each data set was downloaded, the background was corrected. The raw probe intensities were normalized with the Robust Multichip Average (RMA) [37] method and converted into standardized expression data. Then, we found 13238 common genes among the two platforms and they were used to screen markers of gliomas in subsequent analysis. For genes with more than one probe, the average probe intensity of the same gene was used to calculate its expression value. In order to avoid the systematic error between different platforms, each data set was standardized independently by transforming the expression of each gene to a mean of 0 and SD of 1. The expression profiles were pooled together and then considered them as a single data set [38].
Identification of a gene signature associated with radiotherapy GSE13041 was defined as the training set, while GSE7696 and the TCGA cohort were defined as the validation sets. First, 5 different classifiers including Compound Covariate classifier [39], Diagonal Linear Discriminant Analysis (DLDA) classifier [40], Bayesian CCP classifier, Nearest Neighbor classifier (1-Nearest Neighbor & 1-Nearest Neighbor) [41] and Nearest Centroid classifier, were used to re-classify patients receiving radiotherapy (radiation group) and patients not receiving radiotherapy (no radiation group) for exploring specific gene markers that could efficiently separate radiation group from the no radiation group. Among the 5 classifiers, Compound Covariate classifier and DLDA are linear classifiers. During this process, "leave one out cross validation" was used to increase the accuracy and stability of the results. With this method, a total of 20 genes with classification error rate less than 0.16 were identified as genes that were associated with radiotherapy in gliomas in the training set. Then, the accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) of the 20gene signature in separating the radiation group from the no radiation group in different classifiers were calculated. The hierarchical clustering analysis [42] was performed to visually evaluate the expression of the 20-gene signature between these two groups. Hierarchical clustering analysis of gene expression profiles was done based on centered correlation metric and average linkage method. Also, to evaluate the comprehensive ability to separate these two groups, the receiver operating characteristic (ROC) curves were graphed and area under the curve (AUC) was calculated in these two linear classifiers. Next, two validation sets were used to validate the results in the training set. Moreover, the ability of the 20-gene signature in separating these two groups was evaluated by calculating the accuracy in different classifiers, hierarchical clustering analysis and ROC curves.

Identification of a gene signature associated with prognosis of glioma patients
A total of 4 datasets were divided into the training set (GSE13041) and validation sets (GSE7696, GSE16011, and the TCGA cohort). The training set was used to detect a gene signature associated with prognosis of glioma patients, and the validation sets were used to verify the reliability of this gene signature. In the training set, univariable Cox proportional hazards regression analysis [43] was used. When random permutation test was used and genes with P values less than 0.001 were selected, we obtained a 5-gene signature from the above 20-gene signature. Then, the random survival forests algorithm [44][45][46] was performed to evaluate the relative importance of each gene to further screen genes associated with the survival of the patients from the 5-gene signature. In this process, number of trees (N tree) was set as 1000, and genes with relative importance more than 0.1 were selected. In fact, the 5 genes were all confirmed to have relative importance more than 0.1. Thus, all the 5 genes were included for subsequent analysis. Then, a risk score model as described previously [46] was constructed using a multivariable Cox regression model based on the 5-gene signature. Risk score of each patient in the training and validation sets was calculated. Patients in the training set and the validation sets were divided into highrisk and low-risk groups using the median risk score as the cutoff. Then the Kaplan-Meier curves were used to further evaluate the difference of overall survival between the two groups, and the hierarchical clustering analysis was performed to visually evaluate the expression of the 5-gene signature between these two groups. Differences in survival time between the low-risk and high-risk groups in each data set were then compared using the two-sided Log Rank (Mantel-Cox) test. Finally, the risk score, together with other clinicopathological parameters were analyzed in univariate and multivariable Cox regression model to verify whether the risk score based on the 5-gene signature is an independent predictor of glioma patients' prognosis in the training set and the validation sets.

Gene set enrichment analysis (GSEA)
GSEA was performed using MSigDB C2 CP: Canonical pathways gene set collection. Biological pathways and processes with relative high NES values were considered to be significantly enriched. Enrichment Map was used for visualizing the GSEA results.

Statistical analysis
The data mining was performed with R software, while other statistical analysis was performed by SPSS (version 17.0). The ROC curves were used to evaluate the ability of the gene signature to separate the radiotherapy group from the no radiotherapy group and AUC of each curve was calculated. The Kaplan-Meier curves were used to evaluate overall survival of the high-risk group and the low-risk group, along with the two-sided Log Rank (Mantel-Cox) test to determine if the difference between the two groups was significant. Other statistical methods included the Cox proportional hazard models, univariate and multivariable Cox regression model. In this study, all statistical tests were two-tailed and differences were considered statistically significant if P-values<0.05.

Author contributions
SL, HG, YY, QC, ZZ, XW, BL, LM, JZ and PZ performed the data mining and analysis. HH, BT and SL conceived the study and wrote the manuscript. All authors read and approved the final manuscript.

CONFLICTS OF INTEREST
The authors declare that they have no competing interests.

FUNDING
This work was supported by by Anhui Provincial College Key Foundation for Outstanding young talent (gxyqZD2016172) and The Provincial College Quality Project for Anhui Province (2016jxtd127). This work was also supported by NSFC81302187 and CWS14C063.