Biomarkers identified for prostate cancer patients through genome-scale screening

Prostate cancer is a threat to men and usually occurs in aged males. Though prostate specific antigen level and Gleason score are utilized for evaluation of the prostate cancer in clinic, the biomarkers for this malignancy have not been widely recognized. Furthermore, the outcome varies across individuals receiving comparable treatment regimens and the underlying mechanism is still unclear. We supposed that genetic feature may be responsible for, at least in part, this process and conducted a two-cohort study to compare the genetic difference in tumorous and normal tissues of prostate cancer patients. The Gene Expression Omnibus dataset were used and a total of 41 genes were found significantly differently expressed in tumor tissues as compared with normal prostate tissues. Four genes (SPOCK3, SPON1, PTN and TGFB3) were selected for further evaluation after Gene Ontology analysis, Kyoto Encyclopedia of Genes and Genomes pathway analysis and clinical association analysis. MIR1908 was also found decreased expression level in prostate cancer whose target genes were found expressing in both prostate tumor and normal tissues. These results indicated that these potential biomarkers deserve attention in prostate cancer patients and the underlying mechanism should be further investigated.


INTRODUCTION
Prostate cancer is still a serious threat to men in the world, with a top 2 ranked mortality rate in males, second to that of lung cancer [1,2]. Despite the wide application of radical prostatectomy, radiation therapy and androgen deprivation therapy in clinic [3], treatment outcome varies significantly across patients and is unpredictable [4]. Although prostate specific antigen (PSA) level and Gleason score were used for screening and evaluation of prostate cancer [5,6], the genetic feature, critical for diagnosis and prognosis for prostate cancer patients, is still not widely recognized [7,8].
Genome-scale screening is necessary to facilitate the understanding about the inner cause and progression of prostate cancer.
Genome-scale microarrays are used in cancer study as a powerful technology for years. Its high-throughput screening ability renders it a more ideal platform than traditional methods for researchers [9]. Based on microarray analysis, many genes were found related to prostate cancer, such as PIM1, AMACR, H3K27me3, and IL-15 [10][11][12][13]. However, poor reproducibility of results is still a potential problem and a comprehensive study integrating microarray data generated from different labs is necessary [14]. We utilized six independent genome-scale research datasets of prostate cancer patients and performed a twostage study to search and verify the candidate biomarkers for this disease. A total of 41 genes were founded with consistent lower expression level in tumor tissues than in normal tissues in our discovery stage research. Four genes (SPOCK3, SPON1, PTN and TGFB3) and a micro-RNA (MIR1908) were selected for exploring their potential roles in the progression of prostate cancer. We found that SPOCK3, SPON1, PTN and TGFB3 were significantly correlated with the progression-free survival (PFS) status of prostate cancer patients. The target genes of MIR1908 were predicted and were found transcribed actively in prostate tumor tissues and normal prostate tissues. Our study indicated that these five potential biomarker genes may play important roles in prostate cancer and underlying mechanisms could be further studied in the future.

Screening for candidate biomarkers in prostate tumorous and normal tissues
We chose four GEO datasets: GSE26910, GSE32448, GSE46602 and GSE55945 as our discovery-cohort to identify the difference of gene expressionprofile between prostate tumor tissues and normal prostate tissues (Supplementary Table 1). A total of 94  tumor samples and 63 normal prostate samples were  included in. GSE32448 contained 40 pairs of T-N tissue  from Rockville and the other three datasets consisted  of unmatched tumorous and normal tissues. The clinical  information of patients was listed in Supplementary  Table 2. We first explored the profile of differentially expressed genes among these four datasets and probes were defined as our measurement index. 1006 decreased expression level probes and 1466 increased expression level probes showing significantly different expression level were found in GSE26910. 1522 decreased expression level probes and 1168 increased expression level probes were considered as significantly differentially expressed probes in GSE32448. 15844 decreased expression level probes and 44 increased expression level probes in GSE46602, 7832 decreased expression level probes and 175 increased expression level probes in GSE55945 were also identified ( Figure  1A-1D). All these probes were divided into two groups depending on if they were up-regulated or downregulated. The intersection of two groups were regarded as candidate biomarkers in this step ( Figure 1E-1F).  Figure 1). There were 48 candidates in the down-regulated group and 0 candidate in the up-regulated group after our rigorous screening, all of which were converted to corresponding gene symbols in the next step.

Functional analysis of candidate biomarker genes
In order to explore the roles of these candidates, we first searched the known relationship between these genes and prostate cancer in PubMed. As mentioned above, a total of 41 genes were identified in the previous stage. We listed all the genes that were studied before in prostate cancer among our results (Supplementary Table 3). The expression level of candidate genes in our discovery cohort and their chromosomal locations were presented in Figure 2A.
Gene Ontology (GO) function cluster analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were used for further research. KEGG pathway analysis indicated that some candidates involved in Glutathione metabolism related pathway and Drug metabolism related pathway. Of note, Chemical carcinogenesis related pathway was significantly enriched according to the p-value in our analysis ( Figure 2B), indicating that these candidate genes may play important roles in the process of cancer. Enriched GO terms indicated that these candidate genes played important roles in metabolic process and enzyme related function such as nitrobenzene metabolic process and glutathione derivative biosynthetic process ( Figure 2C). We presented

Biomarker genes' expression level may indicate the prognosis for prostate cancer patients
Based on our screening criteria, all candidates showed lower levels in prostate cancer tissues than in normal prostate tissues and most of them were enriched in certain terms as indicated by GO analysis. This phenomenon is interesting and we then used an independent cohort to explore the relationship between these potential biomarkers' expression levels and prognosis in clinic. Data of 436 prostate cancer patients from TCGA database were used in this section, the detail clinical information of patients was listed in Table 1.
We found that SPOCK3 and SPON1 were significantly associated with prostate cancer patients' PFS. Patients with lower expression level of SPOCK3 showed worse PFS than those with higher SPOCK3 level ( Figure 3A, p-value < 0.0001, HR = 3.345, 95% CI: 1.787 -6.261). Patients with lower level of SPON1 had shorter PFS than patients with higher level of SPON1 ( Figure 3B These results indicated that SPOCK3, SPON1, PTN and TGFB3 may be chosen as potential prognostic biomarkers for prostate cancer patients in clinic.

Low expression level of MIR1908 was found in prostate cancer patients
MIR1908 showed lower level in prostate tumor tissues in our discovery cohort and the distribution plot were provided ( Figure 4A-4D). To our knowledge, the role of this microRNA in prostate cancer has not been reported. The structure of miR1908 was shown ( Figure 4E) and its potential targets were predicted (Supplementary Table  4). We found that many target genes of this microRNA showed expression in both prostate normal tissues and tumor tissues ( Figure 4F). These results indicated that underlying mechanisms about how this microRNA regulates its targets could be further studied.
A validation cohort for confirming results in prostate cancer patients SPOCK3, SPON1 and MIR1908 were found to be potential biomarkers in our discovery stage. We further used a validation cohort in this section to ensure reliability of these results. PTN and TGFB3 were considered as control biomarkers as previously reported. A dataset with 5 paired tumor-normal tissues of prostate patients examined in the same platform were included in this   stage. This dataset was published in an independent study (Supplementary Figure 6). Consistent with results from the discovery cohorts, all these potential biomarkers were downregulated in tumor tissues ( Figure 5A-5E). Another two datasets generated from a different platform were included in this stage, too. A consistent result was observed ( Supplementary Figure 7-8). These results indicated that SPOCK3, SPON1, MIR1908, PTN and TGFB3 may be treated as biomarkers for prostate cancer patients.

DISCUSSION
We conducted a genome-scale research to seek candidate biomarker genes of prostate cancer patients. The datasets of microarray were utilized in this study for it's high-throughput capability [15]. We considered genes with a consistent expression trend in the four discovery datasets as potential biomarkers and 41 genes were picked out finally. There were 24 genes that had been reported correlating with prostate cancer, among which TGFB3 [16], PTN [17], ID4 [18] were widely studied. These results proved that our methods were valid in another aspect. To our knowledge, a panel of genes, including SPOCK3 and SPON1, were found downregulated significantly in prostate cancer patients for the first time. It was interesting to note that SPOCK3, SPON1, PTN and TGFB3 played important roles, as suggested by function enrichment analysis. Downregulation of these three genes in prostate cancer patients may affect extracellular margin and space as well as growth factor activity. GSTM1, GSTM3 and ABCC6 were metabolism related genes whose abnormal regulation may contribute to prostate cancer progression.
To investigate the prognostic significance of these genes in patients with prostate cancer, we chose an independent cohort to explore the association between them and PFS of prostate cancer patients. These results demonstrated that SPOCK3, SPON1 were correlated with PFS of prostate cancer patients in this cohort. PTN and TGFB3 were also found to influence the PFS of patients with prostate cancer. However, GSTM1, GSTM3 and ABCC6 showed no significant effects on prostate patients' PFS (data not shown).
SPOCK3 (SPARC/osteonectin, cwcv and kazal like domains proteoglycan 3), also known as testican3, is a member of novel calcium-binding proteoglycan proteins family. Research indicated that the protein encoded by SPOCK3 might inhibit the activity of membrane-type matrix metalloproteinases [19] and was identified as a suppressor of tumor invasion [20]. SPON1 (spondin-1), also known as Vascular smooth muscle cell growthpromoting factor (VSGP), was found to be an inhibitor of angiogenesis [21]. The micro-vessel density of tumor may be affected by SPON1 according to a previous study [21]. PTN (Pleiotrophin) could restrain the differentiation of epithelial cells in vivo [17] and TGFB3 (Transforming growth factor-β 3) was shown to have tumor-suppressing function [16]. We found that MIR1908, which was found as a proliferation suppressor in NSCLC, [22], showed a lower expression mode in tumor tissues than in normal tissues.
There are some limitations in this study. First, we focused on screening of candidate biomarkers and the molecular functions of these biomarkers were not included in this study. Instead, the related Gene Ontology (GO) function analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were carried out and some possible mechanisms of these five biomarkers which we focused on were discussed in the discussion section. Moreover, the platforms of our validation stage were not exactly as same as those of the discovery stage and the sample size of this study was limited. However, we successfully validated the candidate biomarkers from our discovery stage despite the different platforms and these results were consistent in all datasets generated from different labs, suggesting the reliability of our conclusion.
In summary, we conducted a multiple-cohort research to seek the candidate biomarker genes for prostate cancer patients in a genome-wide scale. Fortyone candidates were all down-regulated in the first cohort of them, SPOCK3, SPON1, PTN and TGFB3 were associated with the prognosis of prostate cancer patients. Besides, MIR1908 was a potential biomarker for prostate cancer patients according to our result. The underlying mechanism would be investigated in the future.

Pre-process of data
We downloaded data files with probe values (.CEL files) from the GEO database. The raw data was read and pre-processed by AFFY package of R [23,24]. Mas5 algorithm combined with detection calls of multiple Perfect-Match (PM) and Mismatch (MM) probes were utilized in the process of background correction, quantile normalization as well as calculation of the expression level of each probe [25].

Candidate biomarkers screening
We performed t-test to evaluate the statistical significance of gene expression difference between tumor tissues and normal tissues [26]. A set of significant differential expression level probes (p-value < 0.05) were divided into tumor high expression group (HE) and tumor low expression group (LE). Intersection of HE or LE in the four discovery datasets was picked out as candidate biomarkers. Cluster analysis of these biomarkers in each dataset were performed afterward. The t-test analysis and presentation of our results were performed by basic package and gplots package of R respectively.

Function analysis biomarkers screening
We annotated candidate biomarkers and performed GO function cluster analysis as well as KEGG pathway analysis by Annotation, Visualization, and Integrated Discovery (DAVID, https://david.ncifcrf.gov/) database [27]. Circos software and ggplot2 package of R were used for results presentation [28].

Statistical analysis
SPSS software (version 18.0, SPSS, Chicago, ILand), Mathematica software (version 10.0, Mathematica, Chicago, Champaign) and GraphPad Prism (version 5, GraphPad Software Inc, San Diego, CA) were utilized in this study for statistical analysis and results presentation. Student's t-tests were used in comparing the difference between two groups. Kaplan-Meier analysis were utilized to analyze the PFS of two groups and Log-rank tests were