Integrated analyses for genetic markers of polycystic ovary syndrome with 9 case-control studies of gene expression profiles

Due to genetic heterogeneity and variable diagnostic criteria, genetic studies of polycystic ovary syndrome are particularly challenging. Furthermore, lack of sufficiently large cohorts limits the identification of susceptibility genes contributing to polycystic ovary syndrome. Here, we carried out a systematic search of studies deposited in the Gene Expression Omnibus database through August 31, 2016. The present analyses included studies with: 1) patients with polycystic ovary syndrome and normal controls, 2) gene expression profiling of messenger RNA, and 3) sufficient data for our analysis. Ultimately, a total of 9 studies with 13 datasets met the inclusion criteria and were performed for the subsequent integrated analyses. Through comprehensive analyses, there were 13 genetic factors overlapped in all datasets and identified as significant specific genes for polycystic ovary syndrome. After quality control assessment, there were six datasets remained. Further gene ontology enrichment and pathway analyses suggested that differentially expressed genes mainly enriched in oocyte pathways. These findings provide potential molecular markers for diagnosis and prognosis of polycystic ovary syndrome, and need in-depth studies on the exact function and mechanism in polycystic ovary syndrome.


INTRODUCTION
Polycystic ovary syndrome (PCOS), as a highly complex endocrine disorder, is usually comprised of phenotypical and heterogeneous reproductive effects [1] as well as metabolic symptoms [2][3][4][5].Thus, there is great genetic heterogeneity and different pathophysiological mechanisms of various PCOS phenotypes.The genetic heterogeneity, combined with the pronounced variability in the diagnostic criteria, makes the genetic study and susceptibility gene identification of PCOS particularly difficult.Although PCOS has been studied for decades, the genetic contributions to this disorder are not fully understood.Furthermore, lack of sufficiently large cohorts also reduces the power to identify specific genes of PCOS [6].
It has been reported that candidate genes associated with PCOS contribute to different biological processes and phenotypes [7].For example, as the association between obesity and PCOS [8], gene variants affecting fat mass, such as FTO [9], have been found to be involved in PCOS.The effect of common variants in TCF7L2 and KCNJ11 is likely to be mediated by the impairment of insulin secretion from β-cells, as a well-established pathogenic pathway in PCOS [10].The hyperandrogenemia of PCOS is most commonly characterized by increased testosterone levels, resulting from enhanced ovarian biosynthesis [11].Several genes (CYP11A, CYP19) involved in steroid biosynthesis pathways are also potential candidate genes for PCOS.However, the candidate genetic markers significantly contributing to the diagnosis and prognosis of PCOS needs to be further explored.

Research Paper
Recently, gene expression profiling with microarray and RNA sequencing has been used to discover gene markers and signaling pathways associated with various complex diseases [12].Gene expression datasets from PCOS patients and normal controls have been collected for the successful identification of gene expression signatures [13].However, as the extreme heterogeneity and small sample size among studies, the reproducibility of various studies is very low [14].If a larger sample size is not attainable, integrated analyses of multi-center collaborative studies become very useful [15].
In this study, to investigate the candidate diagnostic and prognostic genetic markers for PCOS, we performed comprehensive and statistical analyses on 9 case-control studies with 13 gene expression profiling datasets.Finally, we identified 13 genetic markers may be potential molecular factors for the diagnosis and prognosis of PCOS patients.Further functional studies of these candidate genes may improve the understanding and treatment of PCOS diseases.

Correlation of gene expression in various PCOS profiles
The expressing pattern for each dataset was obtained by the methods of calculating the expression variation score (EVS) [23].By clustering with Pearson, Spearman and Kendall correlations, the relationships for the 13 datasets were shown in Figure 2A, 2B and 2C, respectively.The coefficience values suggested that two datasets (GSE6798 and GSE8157) from PCOS muscles [20,21] were extremely consistent with each other and obviously apart from the other 11 datasets.The high consistency may be resulting from the specificity of muscle tissues and the same laboratory.Thus, we integrated these two datasets as Muscle2 and performed further analysis.Among the remained 11 PCOS datasets (Figure 2A-2C), there only three datasets from GSE48301 showed reasonable similarities.As heterogeneity of sampling tissues and low quality of datasets, the quality control should be performed before meta-analysis.

Quality control (QC) assessment in 11 datasets
To identify datasets with high quality and consistency, we carried out the quality assessment for the 11 PCOS datasets by the R package MetaQC [24].Six QC assessments including homogeneity of coexpression structure, accuracy and consistency of biomarkers detection with or without pathway information were calculated.As shown in Table 2 and Figure 3 with PCA biplots, the top five datasets (GSE43264, GSE48301_ eSF, GSE1615, GSE10946_obese and GSE48301_ eMSC) performed relatively well in most criteria, while GSE48301_eEP as borderline case.After excluded the bottom five datasets (GSE34526, GSE48301_eEN, GSE5090, GSE5850 and GSE10946_lean) with low quality assessment, there were 6 datasets, named as PCOS6, remained for the following analysis.

Identification of differentially expressed genes (DEG) by MetaDE
Further analyses for both groups of Muscle2 and PCOS6 were performed by the MetaDE package.The datasets in two groups were re-merged and filtered, respectively.From the detection competency curves (Figure 4A, 4B), most methods of meta-analysis were useful to detected common DEGs among different datasets, especially for the Fisher method.Under the criteria of q value less than 0.05, there were 869 and 287 DEGs were identified in Muscle2 and PCOS6, respectively (Supplementary Table S1, S2).The expression patterns of these DEGs were shown in the heatmaps (Figure 4C, 4D), suggesting that the common DEGs were almost consistency in each dataset.

Functional enrichment and signaling pathway analyses
Functional enrichment analyses were performed with the DAVID web server.We found that the 869 DEGs in Muscle2 datasets were particularly enriched in the muscle system, metabolic process and activity (Figure 5A, Supplementary Table S3), agreed with previous reports [20,21].Meanwhile, the 287 DEGs in PCOS6 datasets mainly focused on the response to stimulus from endogenous, hormone and steroid hormone (Figure 5B, Supplementary Table S4).
Further signaling pathway analyses showed that the DEGs in Muscle2 almost enriched in several types of cancers, junction pathway, and signaling pathways including B cell receptor, VEGF, Calcium and MAPK (Figure 5C, Supplementary Table S5).Then, the PCOS6 DEGs directly focused on the oocyte meiosis and oocyte maturation, and also enriched in cancer pathway, apoptosis, adhesion, adipocytokine, neurotrophin, mTOR and p53 signaling pathways (Figure 5D, Supplementary Table S6).
that four genes (SIAE, ICAM1, FN1, and FGF7) were significantly correlated with disease-free survival, while the other three genes (SIAE, FGF7, and PDGFRA) were significantly correlated with overall survival (Table 4).These findings further confirmed that the critical role of these candidate genes in PCOS.

DISCUSSION
PCOS is a highly complex endocrine disorder and affected by phenotypically heterogeneous.As its phenotypic heterogeneity and studies with small sample sizes, the power is low to identify specific genes for PCOS.
To increase the sample size and make powerful analysis, we performed the integrated and meta-analyses to improve the quality of gene association studies.Then, we identified 13 genetic markers may be potential molecular factors for the diagnosis and prognosis of PCOS patients.
Recently, microarray analysis of gene expression profiles has been widely used to identify genes and biological pathways associated with various complex diseases, including PCOS.However, previous studies have sampled different tissues from PCOS patients.The pathological factors and mechanisms in various tissues of PCOS patients may be similar.In this study, to identify the common PCOS-associated genes in multiple types of tissues, we carried out the integrated analysis and identified 13 common DEGs for PCOS.Taking ICAM-1 for an example, two SNPs of ICAM-1, G241R and K469E, have been found to be risk factors for PCOS [25].Moreover, ICAM1 K469E is associated with obesity and PCOS, according to serum triglyceride levels [26].As evidenced by gene ontology and pathway analyses, these DEGs in Muscle2 datasets directly involved in muscle mysion complex, myofibril, muscle system and muscle contraction.The DEGs in PCOS6 datasets mainly focused on oocyte pathways, including oocyte meiosis (PPP2R1B, RPS6KA6, MAD2L1, SGOL1, CAMK2D, IGF1, and ANAPC11) and progesteronemediated oocyte maturation (RPS6KA6, MAD2L1, IGF1, ANAPC11, PIK3R3, and CCNA2).In-depth functional studies of these candidate genes and signaling pathways may improve the understanding of PCOS.Along with next generation sequencing (NGS), increasing numbers of studies explored differences in ncRNA and epigenetic modifications (such as DNA methylation) between PCOS patients and normal controls.It has been reported that serum miR-21 is markedly increased in PCOS patients.Through targeting LATS1, miR-21 may promote PCOS progression and act as a novel non-invasive biomarker for the diagnosis of PCOS [27].lncRNA SRA has been found to be associated with PCOS, and may be an important mediator of adiposity-related processes in individual susceptible to PCOS [28].The global methylation of peripheral leukocyte DNA has no difference between PCOS patients and controls [29].The methods for genome-wide DNA methylation analysis at a single base pair resolution are evolving quickly [30].Therefore, it will be more feasible to carry out genome-wide studies with sufficient samples in different tissue types.
In conclusion, this meta-analysis successfully integrated gene expression datasets for PCOS.This process stabilized the effects of the studies' extreme heterogeneity and small sample sizes, and uncovered genes and biological pathways associated with PCOS.Finally, we identified 13 common genes among various PCOS tissues, providing novel and potential molecular markers for the diagnosis and prognosis of PCOS diseases.Further functional studies on these genes may improve understanding of the pathological processes of PCOS.

Collection and inclusion criteria of studies
Using the Gene Expression Omnibus (GEO) Microarray Search Tool (http://gbnci.abcc.ncifcrf.gov/geo/) [31] , we searched the GEO database for publicly available studies from its inception to August 31, 2016, using the following keywords: "Homo sapiens" (organism), "PCOS" or "Polycystic ovary syndrome" (study keywords), and "RNA" (sample type).After a systematic review, 22 GSE studies were collected.The inclusion criteria for studies were as follows: 1) patients diagnosed with PCOS diseases and normal controls; 2) gene expression profiling of mRNA; and 3) sufficient information to perform the analysis.Then, 13 datasets from 9 studies were collected for subsequent analysis (Table 1).Figure 1A provides details of the process of data collection and study selection.The workflow of data processing and analysis is illustrated in Figure 1B.

Dataset preparing
Thirteen gene expression datasets were downloaded from the GEO database, which were completed in five different platforms.Because different probes were used to detect gene expression in different platforms, the number of detectable genes varied across platforms (Table 1).
To merge these datasets from different studies, we used three functions in the R package MetaDE: MetaDE.match,MetaDE.merge, and MetaDE.filtering[15].First, the probe with largest interquartile range (IQR) among all probes annotated to the same gene was selected to represent the expression level of the gene.Second, we merged the gene expression profiles to maintain the commonly profiled genes across the 13 datasets.Finally, either un-expressed (10% genes with small mean intensity) or un-informative genes (10% small standard deviation) were filtered, and the remaining 8470 genes were retained for further analysis.

Statistical analyses
The ind.analysis function in the MetaDE package was used to compare gene expression between PCOS patients and normal controls in each dataset.We used moderated t statistics to evaluate statistical significance.
The transcriptome of each dataset was evaluated with the expression variation score (EVS), introduced by Marina Sirota [23].In each dataset, the EVS for the gene j is defined as EVS j =-sign(t j )log(p j ), where t j is the moderated t statistic for the gene j and p j is the p-value of the test.The scale and sign of the EVS represent the degree of significance and the 'direction' of the association, respectively.Then, we used the EVS vector to represent the pattern of the gene expression profile for each dataset.The consistency of the transcriptome tendency was estimated by the correlation coefficients of the EVS vectors between datasets [23].

Quality control
Before meta-analysis, the MetaQC package [24] was used to determine the inclusion/exclusion criteria for meta-analysis.The six quantitative quality control (QC) measures were calculated by MetaQC, and the principal component analysis (PCA) biplots and standardized mean ranks (SMR) were helpful to identify and exclude study with low quality before further meta-analysis.

Meta-analysis
The meta-analysis in each group was completed by MetaDE package.The q-values less than 0.05 for Fisher method, which were subjected to multiple-testing correction with the Benjamini-Hochberg method [32], were the critical to DEGs.The expression patterns in datasets of all DEGs were shown on heatmap plot.

Gene ontology and pathway enrichment analyses
The gene ontology and pathway enrichment analyses of interested gene sets were completed by DAVID web servers (https://david.ncifcrf.gov)[33].

Figure 1 :
Figure 1: The process of data collection, selection, processing, and analysis.(A) The process of data collection and selection; (B) The process of data processing and further analyses.

Figure 2 :
Figure 2: Correlation analyses of the relationships among 13 datasets based on gene expression variation profiles.(A) Clustering with Pearson; (B) Clustering with Spearman; (C) Clustering with Kendall correlation coefficience.The positive and negative correlations between pairs of datasets are shown as red and green respectively, and the size of node index the strength of correlation.

Figure 4 :
Figure 4: Results of DEGs by metaDE.(A) The detection competency curves for Muscle2; (B) The detection competency curves for PCOS6; (C) The heatmap plot of DEG expression profiles for Muscle2; (D) The heatmap plot of DEG expression profiles for PCOS6.The column with 0 on top stands for normal samples, while 1 stands for PCOS patients.

Figure 5 :
Figure 5: The enrichment analysis of the DEGs by DAVID.(A) The barplot of gene ontology enrichment for Muscle2 (green for molecular function, blue for cellular component, red for biological process); (B) The barplot of gene ontology enrichment for PCOS6; (C) Rich factor plot of pathway enrichment for Muscle2; (D) Rich factor plot of pathway enrichment for PCOS6.

Table 2 : Results of quality control measures and SMRs for 11 datasets
*P value not significant after Bonferroni correction (i.e.P > 0.05/# of studies).

Table 4 : Cox proportional hazard analysis of the 13 common DEGs for DFS and OS among patients with ovarian serous cystadenocarcinoma
Abbreviations: DFS, disease-free survival; OS, overall survival; PHA test, proportional hazards assumption test; HR, hazard ratio.* !PHA test P < 0.05 violates the hazards assumption.†Significant.