A bioinformatics-to-clinic sequential approach to analysis of prostate cancer biomarkers using TCGA datasets and clinical samples: a new method for precision oncology?

Biomarker-driven cancer therapy has met with significant clinical success. Identification of a biomarker implicated in a malignant phenotype and linked to poor clinical outcome is required if we are to develop these types of therapies. A subset of prostate adenocarcinoma (PACa) cases are treatment-resistant, making them an attractive target for such an approach. To identify target molecules implicated in shorter survival of patients with PACa, we established a bioinformatics-to-clinic sequential analysis approach, beginning with 2-step in silico analysis of a TCGA dataset for localized PACa. The effect of candidate genes identified by in silico analysis on survival was then assessed using biopsy specimens taken at the time of initial diagnosis of localized and metastatic PACa. We identified PEG10 as a candidate biomarker. Data from clinical samples suggested that increased expression of PEG10 at the time of initial diagnosis was linked to shorter survival time. Interestingly, PEG10 overexpression also correlated with expression of chromogranin A and synaptophysin, markers for neuroendocrine prostate cancer, a type of treatment-resistant prostate cancer. These results indicate that PEG10 is a novel biomarker for shorter survival of patients with PACa. Also, PEG10 expression at the time of initial diagnosis may predict focal neuroendocrine differentiation of PACa. Thus, PEG10 may be an attractive target for biomarker-driven cancer therapy. Thus, bioinformatics-to-clinic sequential analysis is a valid tool for identifying targets for precision oncology.


INTRODUCTION
Precision oncology, also called biomarker-driven therapy, has greatly improved clinical outcomes in recent years. Biomarker-driven therapies such as trastuzumab for HER2-positive breast cancer and imatinib for chronic myeloid leukemia [1,2] highlight the efficacy of targeting biomarkers associated with a poor prognosis and illustrate the importance of identifying those biomarkers involved in poor clinical outcomes for malignancies.
Prostate adenocarcinoma (PACa) is one of the most common cancers in men. In the U.S., the 5 year relative survival rate of early stage PACa is >99% [3]; yet PACa is the second leading cause of cancer-related death in the U.S. This suggests that a subset of PACa is treatment-resistant [3]. To determine the prognosis of a www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No. 59), pp: 99601-99611

Research Paper
PACa patient, physicians measure levels of prostatespecific antigen (PSA), and use clinical staging and the Gleason score, which is a grading system based on the architectural pattern of tissue from a PACa biopsy [4][5][6]. In addition to these factors, castration-resistant prostate cancer (CRPC), a transformed prostate cancer mainly caused by continuous androgen deprivation, is linked to a poor prognosis due to limited therapeutic options [7]. Neuroendocrine prostate cancer is a class of CRPC showing neuroendocrine differentiation [8]. Because of its poor prognosis and treatment resistance, there is significant unmet need for new neuroendocrine prostate cancer treatments. However, the best molecular targets for neuroendocrine prostate cancer therapies have not yet been identified [8].
To identify biomarkers of poor prognosis in PACa, we developed a bioinformatics-to-clinic sequential analysis approach and used it to identify a candidate biomarker linked to shorter survival. To evaluate the predictive power of our analytical approach, we examined biopsy samples to see if expression of our identified gene, PEG10, at the time of initial diagnosis affected clinical outcome. PEG10 was also of interest because a recent report implicates it in neuroendocrine differentiation of PACa [9]. Therefore, we also investigated the link between PEG10 expression and neuroendocrine differentiation in clinical samples.

Overview of the bioinformatics-to-clinic sequential analysis method
To identify biomarkers that predict shorter survival of patients with PACa, we used two analytical procedures based on the TCGA dataset for localized PACa, followed by validation in clinical samples (Table 1). First, we extracted PACa cases with a Gleason score ≥8; this is because higher Gleason scores are linked to poor clinical outcomes [4]. Next, differentially expressed genes (DEGs) were identified based on relapse-free survival (RFS). DEGs that were highly expressed in the cohort with shorter survival were extracted. Second, the RFS hazard ratio (HR) for all overexpressed genes was calculated to identify genes linked to shorter RFS. Genes identified in both analyses were designated final candidate genes. Finally, the effect of the candidate gene expression on survival was validated using clinical samples ( Figure 1).

Calculation of the RFS HR
We calculated the RFS HR for each gene that was overexpressed (z-score ≥1) using a Cox proportional hazard model. Before calculating the HR, the number of cases showing overexpression of each gene was examined. If a gene was overexpressed in less than 5% of the total cohort, it was excluded from further analysis because a HR calculated from a disproportionate distribution is not reliable. Thus, 2527 genes (13.3% of the identified genes) were excluded. Next, we identified genes responsible for shorter RFS (HR >1) among genes with a significant p-value for RFS HR ( Figure 1 and Table 3). The number of genes responsible for shorter RFS was 630.

Final candidate molecules identified by screening
PEG10 was identified by both analyses (HR = 3.0844, 95% CI: 1.397-6.812; p-value = 0.0053). Thus, increased PEG10 expression was chosen as the candidate marker for shorter RFS in those with localized PACa with a Gleason score ≥8.
The definition of overexpression in bioinformatics analysis is complex. It is not known whether z-score ≥1 is the most appropriate definition of overexpression. Additionally, we excluded genes overexpressed in less than 5% of the population. However, this exclusion may also result in inaccurate estimations. Therefore, we investigated whether increased expression of PEG10 is linked to RFS using a different definition of increased expression: a z-score ≥1, ≥1.5, or ≥2, without the exclusion of genes overexpressed in less than 5% of the population. Again, increased expression of PEG10 was identified as a biomarker for shorter RFS when using the dataset based on localized PACa (HR = 3.0844; 95% CI, 1.397-6.812; p = 0.0053, for z-score ≥1; HR = 8.6811; 95% CI, 2.557-29.47; p = 0.0005, for z-score ≥1.5; and HR = 8.812; 95% CI, 2.01-38.63; p = 0.0039, for z-score ≥2).
To investigate whether expression of PEG10 protein in biopsy samples taken at the time of initial diagnosis was also linked to RFS of those with localized or metastatic PACa, we collected PACa biopsy samples from patients       true when we examined total (local plus metastatic) PACa (HR = 3.396; 95% CI, 1.871-5.927; p < 0.0001) ( Figure 3G). Furthermore, the effect of PEG10 expression remained significant in multivariate analysis using a Cox proportional hazard model adjusted for age, initial PSA level, main treatment (hormonal therapy or others), and stage (Table 6).

Link between PEG10 expression and neuroendocrine differentiation
A recent report suggests that PEG10 expression is increased in PACa with focal neuroendocrine differentiation [9]. Thus, we stained clinical samples with chromogranin A (CGA), a marker for neuroendocrine differentiation [11], to explore the link between increased PEG10 expression and neuroendocrine differentiation. Because expression of CGA in PACa tissue is negligible, the effect of CGA expression on clinical outcome is a matter for debate [12,13]. Recent studies suggest that a scoring system based on the population of CGA-positive cells (0: <1% positive cells; 1: 1-10% positive cells; 2: >10% positive cells) in biopsy samples taken at the time of initial diagnosis shows a direct link with clinical outcome [13][14][15]. Therefore, we stained samples for CGA and examined the relationship between CGA expression at the time of diagnosis and clinical outcome. We found that increased expression of CGA was indeed linked to shorter RFS ( Figure 4A-4C). Additionally, we assessed the association between PEG10 expression and CGA expression using the H-score. Interestingly, increased expression of PEG10 was positively associated with expression of CGA (Spearman r = 0.2021; p = 0.0334) ( Figure 4D). Furthermore, expression of PEG10 was positively associated with expression of synaptophysin, another representative marker of neuroendocrine differentiation ( Figure 4E-4G). This suggests that increased expression of PEG10 at the time of initial diagnosis may predict neuroendocrine differentiation.

DISCUSSION
To apply biomarker-driven cancer therapy to PACa, we developed a method called bioinformaticsto-clinic sequential analysis. A bioinformatic approach can be a useful tool; however, it has several limitations. One such limitation is that data that fit the study model perfectly are often not available. In the current study, the dataset required for our model needed to include not only localized, but also metastatic, PACa. A dataset including both localized and metastatic PACa was available, but it contained only 21 cases. Thus, we used a TCGA dataset that included only localized PACa; however, the sample number (201 cases) was acceptable for bioinformatics analysis. Next, we used the clinical samples that included both localized and metastatic PACa to ascertain whether biomarkers of a poor prognosis identified by the bioinformatics approach (based on the localized PACa dataset) were applicable to the clinical samples from patients with localized or metastatic PACa.
When identifying DEGs (the first step in the in silico analysis), the definition of "shorter survival" is complex. To identify DEGs in this study, we divided the dataset according to 1, 3, or 5 year survival. It is still not known if these criteria are the most appropriate. This is an important consideration because extraction of genes that emerge in the differential settings may avoid underor over-diagnosis. Another issue related to DEGs is that DEGs are identified by a comparison of the average value for each group. If a particular gene has a strong link to survival when overexpressed, but overexpression in the population is low, the average value of the gene's expression in a cohort of shorter survival will not be high. Thus, identification of DEGs may not always give the most accurate estimation of genes relevant to survival time. The second step, a calculation of HR, may also be problematic. In this study, we calculated HR in several definitions of increased expression. Even in the different settings, increased expression of PEG10 was linked to shorter survival. However, it is still not known which definition is appropriate. As described above, there are problems with in silico analysis; therefore, to compensate for these we developed the bioinformatics-to-clinic sequential analysis approach to check for concordance between different analytical procedures. Indeed, our results indicate that our approach is reliable and provides a high-quality outcome. The transcription factor c-Myc induces production of PEG10 mRNA [16]. Due to the instability of c-Myc mRNA, it is not possible to investigate the relationship between c-Myc mRNA and PEG10 mRNA expression using bioinformatics [16]. However, if overexpression of PEG10 is associated with up-regulation of c-Myc in PACa,   c-Myc inhibition may be a therapeutic option for the subset of PACa patients with a poor prognosis and overexpression of PEG10. Currently, several c-Myc inhibitors are being tested in clinical trials [17]. The present study may indicate an additional therapeutic application for c-Myc inhibitors in a subset of PACa patients that have a poor prognosis, and possibly in patients with neuroendocrine prostate cancer.

Bioinformatic analysis
Gene count data from PACa TCGA samples (RNA sequencing) were downloaded from the Genomic Data Commons Data Portal (https://gdc-portal.nci.nih.gov). To identify DEGs, raw counts were entered into the TCC package in R [18]. The definition of DEGs is a false discovery rate (FDR) of <0.05 and a logFC ≥1 or ≤-1. Z-scores (RNA seq V2 RSEM) available in cBioPortal (http://www.cbioportal.org) were used to calculate the HR [19]. Overexpression was defined as a z-score ≥1, ≥1.5 or ≥2. The HR of genes was calculated using Cox proportional hazard models and the coxph function located in the survival library in R.

Statistical analysis
HRs were calculated using the Cox proportional hazard model, and significance was assessed using the log-rank test. The Kaplan-Meier method was used for survival analysis. The link between PEG10 and CGA or synaptophysin expression was assessed using the Spearman's Rank-Order Correlation test. For multivariate analysis, Cox proportional hazard models were performed using the coxph function in the survival library in R. All other analyses were performed with GraphPad prism. Differences were considered statistically significant when the two-tailed p-value was <0.05.

Patients
Patients diagnosed with PACa by biopsy at St. Marianna University Hospital from January 2003 to August 2014 were considered for inclusion in this study. Patients without exact information regarding their case and patients without standard therapy due to advanced age or severe comorbidities were excluded. Of the remaining patients, 121 had a Gleason score ≥8 and became the cohort assessed in this study.

Immunohistochemistry (IHC) and measurement of protein expression
Immunohistochemistry (IHC) was performed as described previously [20]. Briefly, for PEG10 and synaptophysin staining, antigen retrieval was performed using Antigen Retrieval Solution (pH 9.0) (Nichirei bioscience) at 95°C in a steamer for 40 min, followed by incubation with an anti-PEG10 or an antisynaptophysin antibody for 60 min. For CGA staining, the sections were incubated with an anti-CGA antibody for 60 min without antigen retrieval. For protein expression determinations, more than 500 cells in five high-resolution fields were evaluated and their H-score was calculated. Increased expression was defined as above median.