Systematic assessment of cervical cancer initiation and progression uncovers genetic panels for deep learning-based early diagnosis and proposes novel diagnostic and prognostic biomarkers

Although many outstanding achievements in the management of cervical cancer (CxCa) have obtained, it still imposes a major burden which has prompted scientists to discover and validate new CxCa biomarkers to improve the diagnostic and prognostic assessment of CxCa. In this study, eight different gene expression data sets containing 202 cancer, 115 cervical intraepithelial neoplasia (CIN), and 105 normal samples were utilized for an integrative systems biology assessment in a multi-stage carcinogenesis manner. Deep learning-based diagnostic models were established based on the genetic panels of intrinsic genes of cervical carcinogenesis as well as on the unbiased variable selection approach. Survival analysis was also conducted to explore the potential biomarker candidates for prognostic assessment. Our results showed that cell cycle, RNA transport, mRNA surveillance, and one carbon pool by folate were the key regulatory mechanisms involved in the initiation, progression, and metastasis of CxCa. Various genetic panels combined with machine learning algorithms successfully differentiated CxCa from CIN and normalcy in cross-study normalized data sets. In particular, the 168-gene deep learning model for the differentiation of cancer from normalcy achieved an externally validated accuracy of 97.96% (99.01% sensitivity and 95.65% specificity). Survival analysis revealed that ZNF281 and EPHB6 were the two most promising prognostic genetic markers for CxCa among others. Our findings open new opportunities to enhance current understanding of the characteristics of CxCa pathobiology. In addition, the combination of transcriptomics-based signatures and deep learning classification may become an important approach to improve CxCa diagnosis and management in clinical practice.


INTRODUCTION
Cervical cancer (CxCa) is currently the fourth most commonly diagnosed form of cancer and one of the leading causes of cancer-related mortality in women around the world [1,2]. Human papillomavirus (HPV) has been involved in the initiation and progression of approximately 99% of cervical tumors, in which HPV 16 and HPV 18 contribute approximately 70% [3][4][5]. In recent decades, thanks to vaccination and screening tests, CxCa-associated www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No. 65), pp: 109436-109456 Research Paper death rate has declined significantly. However, the great burden of CxCa still remains a critical issue, especially in developing countries [6]. Current treatment strategies have certain limitations and induce a wide range of side effects on CxCa patients [7,8]. Moreover, patients with cervical precancerous lesions have a five-year survival rate of nearly 100% [9], encouraging scientists to focus on studying the early-stage of CxCa carcinogenesis.
For a long time, microscopic biopsy image analysis has been the backbone of screening and diagnostic processes of CxCa [10]. However, this technique may not be a reliable measure since it is based on subjective observations. Although novel biomarkers have been increasingly studied to complement the limitations of standard cytological evaluations, the lack of suitable biomarkers for monitoring cancer progression of cervical dysplasia has remained a challenging issue and usually results in the use of improper treatments [11,12]. These challenges may stem partially from the lack of profound understanding of the mechanisms of cancer initiation and progression [13]. In addition, the regulations of many CxCa driver genes remain comparatively unknown. Clinically accurate biomarkers or gene expression signatures as well as more systematic approaches, such as supervised learning meta-analysis based frameworks, are still needed. Eventually, it is important to place more effort on discovering and validating potential biomarker candidates, individuals or panels, to improve diagnostic and prognostic assessment.
Meta-analysis and cross-study normalization are two fundamental approaches to integrating data from different microarray data sets. Meta-analysis combines data at the "late stage", while cross-study normalization combines data at the "early stage" [14]. Cross-study transformation and normalization merges data from multiple microarray studies of a common organism and phenotype, removes non-biological differences, also known as batch effects, and then increases the sample size and retains high prediction accuracy of the machine learning-based class assignment analysis [15]. In addition to cross-study normalization, meta-analysis provides a flexible and powerful approach by integrating different microarray data sets and platforms to increase the statistical power, reliability, and generalizability of the results. Therefore, the meta-analysis of genomewide gene expression to identify corresponding molecular mechanisms of cervical carcinogenesis has the potential to improve the prediction of risk, diagnostic decision, prognostic evaluation, and treatment. This method also generates a reasonably complete picture of differentially expressed genes and pathways involved in cancer [16].
In the current study, we integrated available Affymetrix-based microarray data of CxCa patients and conducted a comprehensive meta-analysis on differentially expressed genes, pathway enrichment analysis, and network analysis. Genetic panels that were associated with the initiation and progression of CxCa and their impacts on CxCa diagnosis were proposed and examined. The diagnostic analysis was conducted using a state-of-the-art feed forward deep learning technique. The impact of individual genes on the survival of CxCa patients was further examined using survival analysis of available data sources. Finally, novel candidates were introduced for further investigations of CxCa pathobiology.

Gene expression microarray data set identification and selection
The overall search workflow, data processing, and data analysis of this study are presented in Figure 1. After searching in Gene Expression Omnibus (GEO) and ArrayExpress, we retrieved 640 and 264 records, respectively. We then removed 182 duplicate records between these two databases and obtained a final total of 722 records for screening. Seven hundred fourteen records were excluded because they met one or more predefined exclusion criteria. It is worth noting that GSE39001 and 29570 were excluded because they overlapped with GSE52903. We assessed full-text articles of the remaining eight data sets to evaluate the eligibility of each data set. As a result, GSE75132 was excluded from further investigation because the study combined CIN 3 and cancer samples into one group (CIN 3+). We also conducted a manual search of the reference lists of included data sets, and GSE42764 was found to qualify our inclusion criteria. Therefore, we finally assessed eight data sets to carry out the meta-analysis.
Among the eight data sets, we classified and arranged the samples into three main groups: Cancer versus Normalcy, CIN versus Normalcy, and Cancer versus CIN, which contained eight data sets, three data sets, and two data sets, respectively. Cancer versus Normalcy implied the "overall" alteration in genomewide gene expression, and we observed multistep carcinogenesis mainly by adding CIN versus Normalcy and Cancer versus CIN groups. Cancer versus Normalcy was the group containing the largest number of samples, 93 normal and 202 cancer samples, followed by CIN versus Normalcy group, containing 46 normal samples and 115 CIN samples. The number of samples of Cancer versus CIN group was smaller than that of the two above groups, containing 83 CIN and 49 cancer samples, respectively. The characteristics of the included data sets are shown in Table 1. Finally, the gene expression of the high-grade squamous intraepithelial lesion (HSIL) versus low-grade squamous intraepithelial lesion (LSIL) in CIN group was also investigated.

Meta-analysis on differentially expressed genes in cervical cancer
We identified 5,679 significantly differentially expressed genes (DE genes) in the meta-analysis across eight data sets from Cancer versus Normalcy group. Among the DE genes, 2,877 genes were upregulated (125 genes having the combined effect size > 2) and 2,802 genes were downregulated ( Since a network-based analysis approach can be applied for discovering biomarker candidates and potential therapeutic targets [17], we conducted a network-based meta-analysis to identify the most potential hub genes that may be considered as important genes in CxCa pathobiology. According to the analysis, HDAC1 was considered as the most potential hub gene for Cancer versus Normalcy group. Its degree of centrality (DC) and betweenness centrality (BC) were 209 and 582,340.

Pathway enrichment analysis for detecting biologically meaningful processes of CxCa carcinogenesis
To further identify the biologically meaningful pathways that were involved in CxCa from the DE genes, we performed pathway enrichment analysis. The analysis was performed for upregulated and downregulated DE genes separately for each comparison group. The significantly enriched pathways were considered if they met the qualification criteria of P-value < 0.05 and false discovery rate (FDR) < 0.2. In the upregulated gene groups, Cancer versus Normalcy group exhibited 21 significantly enriched pathways. The CIN versus Normalcy group consisted of 10 significantly enriched pathways, while only four enriched pathways were found in Cancer versus CIN group. Interestingly, cell cycle appeared to be the common and most significantly enriched pathway in all comparison groups. Notably, the cell cycle genes of the three different comparison groups were not completely overlapping. This result may come from the heterogeneity in different stages of the disease. In addition, RNA transport and mRNA surveillance pathways were two others common significant pathways among the three groups. Although one carbon pool by folate, pyrimidine metabolism, and p53 signaling pathways were only significant in Cancer versus Normalcy and CIN versus Normalcy comparison groups, these dominant pathways may also be important in CxCa carcinogenesis. Finally, when considering the aberrantly expressed genes in CIN 2 & CIN 3 (HSIL) relative to CIN 1 (LSIL) using available data from GSE27678 and GSE63514, we discovered the upregulation of cell cycle genes (CDC6, CDC7, CDK4, CDC20, CDKN2C, MCM6, MCM5, MCM3, MCM2, PRKDC, BUB3, and PCNA) and one carbon pool by folate-related genes (TYMS, SHMT2, and DHFR). This observation, once again, emphasized the important roles of the cell cycle and one carbon metabolism in CxCa initiation and progression. It is important to note that there were many genes that are known to be associated with cancer, as shown in the "pathways in cancer" enriched pathway. In the same pattern, we investigated enriched pathways from downregulated genes in the three groups. We obtained 15 enriched pathways in Cancer versus Normalcy group, including several well-known pathways, such as the calcium signaling pathway, Wnt signaling pathway, Hedgehog signaling pathway, and linoleic acid metabolism, as well as gap junction and focal adhesion pathways. Surprisingly, our results indicated a poor number of enriched pathways in CIN versus Normalcy and Cancer versus CIN groups. Furthermore, the Wnt signaling pathway and Hedgehog signaling pathway seem enriched in CIN versus Normalcy and Cancer versus CIN groups although its FDR was greater than 0.2. The mammalian circadian rhythm pathway was the only pathway significantly enriched in CIN versus Normalcy group while the amoebiasis pathway was the unique pathway in Cancer versus CIN group. Table 2 and Supplementary File 3a and 3b show major enriched pathways and their corresponding genes in corresponding comparison groups.

Functional analysis of individual genes that are consistently up-and downregulated
The total number of significantly upregulated genes from the meta-analysis of Cancer versus Normalcy group was 2,877, that of CIN versus Normalcy group was 1,153, and that of Cancer versus CIN group was 994. Using the filter provided by InteractiVenn [18], we obtained 248 overlapping DE genes, which were always upregulated in more aggressive subgroups (i.e., Cancer subgroup in Cancer versus Normalcy group, CIN in CIN versus Normalcy group, and Cancer in Cancer versus CIN group). The total number of significantly downregulated genes of Cancer versus Normalcy, CIN versus Normalcy, and Cancer versus CIN groups were 2,802, 836, and 992, respectively. Using the same approach, 122 overlapped genes were found to be always downregulated in more aggressive subgroups.  To reduce "false positive" candidates and obtain a more reliable list of up-and downregulated candidate genes, we performed an additional validation for the above overlapping DE genes and DE genes from the metaanalysis of two independent data sets. In this validation, we utilized the DE genes in the comparison between 10 CxCa cell lines (N = 21) and normal tissue from two data sets (GSE29216 and GSE9750) in which the DE genes were identified using the same statistical approach. Finally, 113 upregulated genes were confirmed to be consistently upregulated ( Figure 2a). Similarly, 55 consistently downregulated genes were also observed ( Figure 2b). Those genes may be considered the intrinsic genetic characteristics for determining cervical carcinogenesis, and the panel can be seen in Table 3.
The protein-protein network of consistently upregulated genes as well as known and predicted interactions between the nodes curated by STRING are shown in Figure 2c (113 nodes and 401 edges). There was a strong connection of known and predicted interactions arising from most proteins in the network. In addition, the top 10 enriched GO biological processes and the first enriched KEGG pathway were closely associated with cell cycle. Enriched GO terms were mitotic cell cycle (GO:0000278), nuclear division (GO:0000280), and DNA repair (GO0006281), among others. Ten cell cycle related proteins annotated in KEGG were CDC25B, DBF4, E2F3, ORC6, MCM3, BUB1, RAD21, MAD2L1, CDC23, and BUB3.
This observation further emphasized the importance of the cell cycle in CxCa carcinogenesis. Another enriched KEGG pathway was progesterone-mediated oocyte maturation. The protein-protein network of consistently downregulated genes was simpler than that of upregulated genes (54 nodes and 14 edges). HPGD, CRYL1, ECHDC2, ACAA1, ALDH2, and FHIT were strongly connected, and these proteins are associated with metabolic processes (Figure 2d). Small molecule metabolic process (GO:0044281), cellular lipid metabolic process (GO:0044255), single-organism biosynthetic process (GO:0044711), and singleorganism cellular process (GO:0044763) were enriched, which demonstrated that there might be a decrease of some metabolic processes, especially lipid metabolism, in CxCa initiation and progression.

Text mining analysis for the interpretation of the selected genetic panel
To determine whether our obtained candidates were frequently reported together by previous cancerrelated studies, we utilized 113 upregulated genes and 55 downregulated genes as inputs for the CCancer text mining database. For upregulated genes, our input list was significantly enriched in previous cancer-related studies, illustrated by the fact that many genes on the list were reported to be significantly associated with several cancers, including CxCa (P-value < 0.05). Twenty genes were reported in two previous investigations on CxCa. Their corresponding functions were extracted, and known functions in CxCa were reviewed (Table  4). BUB1, DBF4, DKC1, LSG1, NUP155, POLR2H, and ZNF473 are novel targets for further studies since their roles in CxCa initiation and progression are unknown. On the other hand, we did not achieve any significantly downregulated genes that were associated with CxCa. In terms of other tumors, the text mining results indicated that our provided upregulated genes were also correlated with, but not limited to, breast, skin, colorectal, thyroid, ovarian cancer, lymphoma, myeloid leukemia, neuroblastoma, and retinoblastoma. Contrarily, our downregulated genes were associated with melanoma and "wound healing in the elderly". The detailed information of text mining results and the matched genes for each condition can be found in Supplementary File 4. These results may provide better insights into the essential molecular mechanisms of not only CxCa but also multi-cancer interactions.   FOXM1, RFC4, ORC6, NCAPG2, KIF23, TRIP13, CENPN, KIF14, TYMS, AARS, DONSON, E2F3, RNPS1, BRIX1,  TMEM194A, MELK, MCM3, MOCOS, BUB1, ECT2, CDC25B, DEPDC1, FBXO5, POLR2H, RAD1, CDK1,  RAD54L, BUB3, SUPV3L1, DBF4, NETO2, CENPI, SLC7A6, CMC2, MEST, BLM, ZNF281, ACOT9, MAD2L1,  DDX11, PDIA5, ELAVL1, NUP155, RUVBL1, WARS, SS18L1, MTHFD2, LRRC8D, MSH6, IMMT, LHX2, RAD21,  EIF2S2, MDC1, NUP85, C5orf22, CDC23, UCK2, PTDSS1, UBA2, ZNF473, BRCA1, TMEM22, ADAR, NUP160,  ACP1, WBP11, DIEXF, C9orf91, NVL, PTPLAD1, C3orf37, WASF1, HPRT1, ACTL6A, PPAT, DKC1, POGK,  MTFR1, ELF4, HAUS6, TEX10, USP18, PRKCI, TNPO1, ARL6IP1, KIAA1598, NFATC2IP, KIAA0947, PARP12,  NUCKS1, PARPBP, TAF5, CRKL, GOLT1B, CEP152, SLC25A17, HSP90AA1, HERC5, NSL1, FN3KRP, IFI30,  LSG1, PALB2, MTMR4, PSMA6, SFMBT1, TTC13, DAP3, TRIM45 In Cancer versus Normalcy group, the random search for hyperparameters revealed that an optimal model had five hidden layers, l1 regularization of 1.71E-4, and l2 regularization of 1.75E-4 (5, 1.71E-4, 1.75E-4). Similarly, the optimal models of Cancer versus CIN and CIN versus Normalcy were determined to be (5, 6.27E-4, 3.78E-4) and (2, 7.17E-4, 7.75E-4), respectively. Receiver operating characteristic (ROC) curves of the diagnostic ability of the three above two-class classifications are illustrated in Figure 4. The results indicated that deep learning models could distinctly differentiate the benign from malignant conditions. However, it can be argued that the use of the 168-gene signature for classification may show overoptimistic results since the signature was extracted by the meta-analysis using partly overlapping data sets. Hence, to search for a generalized and data-driven classification signature, we first performed the variable important measurement using the area under the curve permutation random forest variable importance measurement (AUC-RF VIM) on the training sets, selected the top 30 genes with the highest importance scores, and then built three new deep learning models. Supplementary File 5 contains the list of selected genes for the classification experiments, and they are markedly different among the three comparison groups. As expected, the new models achieved results comparable with the models derived from the 168-gene signature, and these classification models accurately discriminated the Cancer from CIN, Cancer from Normalcy, and CIN from Normalcy groups. The accuracy, sensitivity, and specificity of training sets with 10-fold cross-validation (combined holdout predictions, calculated from the confusion matrix) and test sets can be found in Table 5.
Finally, for a quick assessment of the class prediction of the ensemble model that combined four other powerful classifiers (random forest, support vector machine, prediction analysis for microarrays, and k-nearest neighbor algorithms), we performed the classification analyses for the three comparison groups (Cancer versus Normalcy, Cancer versus CIN, and CIN versus Normalcy) with whole transcriptome data. Regarding Cancer versus Normalcy group, we possessed a model from 40 frequently selected genes for distinguishing cancer patients from normal people (sensitivity = 88.2% and specificity = 95%), with an average accuracy of 92.8%. For discriminating Cancer from CIN patients, we achieved a model from 68 frequently selected genes (sensitivity = 85.5% and specificity = 71.4%), with an average accuracy of 80.3%. Finally, the differentiation between CIN and Normalcy group was practical since the prediction model from 54 frequently selected genes achieved a sensitivity = 73.9%, specificity = 84.3%, and average accuracy = 81.4%. Supplementary File 6 contains the list of selected genes, PCA of gene expression data, and heatmap visualization of top-ranked genes and their corresponding z-scores. Collectively, our results noted that the classification between the benign and the malignant conditions using small subsets of genes were practical. Cyclin B1, a regulatory subunit of CDK1 and a crucial protein for the transition from G2 phase to mitosis of the cell cycle, is found to be overexpressed in invasive cervical cancer cells [78]. The Human Papillomavirus E6 oncoprotein abolishes cell cycle checkpoints, inducing polyploidy, an early step in the carcinogenesis of cervical cancer. And the upregulation of CDK1 was observed in this process [79].

DBF4
DBF4 zinc finger 1.99 1.34 0.91 DBF4 has not been described to be associated with cervical cancer.

ECT2
Epithelial cell transforming 2 3.10 1.46 1.40 The high expression of ECT2 in the region 3q may be implicated in cervical oncogenesis [80]. High levels of FOXM1 expression were observed in cervical cancer. Its overexpression was correlated with tumor aggressiveness and the presence of cell proliferation indicator Ki67 [82,83]. The overexpression of FOXM1 was associated with the progression and agression of cervical squamous cell carcinomas by enhancing cell proliferation [82], promoting malignant cell migration and invasion [84]. The upregulation of RFC4 was observed in cervical cancer [91]. A study found that RFC4 may be used as a predictor of cancer relapse and survival rate in patients with cervical carcinoma [94].

KIF14
ZNF473 * Zinc finger protein 473 1.28 0.90 0.62 ZNF473 has not been described to be associated with cervical cancer. * The gene of which its expression pattern is associated with the prognosis in TCGA cohort (P-value < 0.05).

Prognostic assessment of the consistently up-and downregulated genes
From the list of 113 consistently upregulated and 55 consistently downregulated genes, we performed a survival analysis to find individual genes that may be associated with the prognosis of CxCa patients. The prognostic assessment was conducted using TCGA cohort. It turned out that the higher gene expression of six genes, ZNF281, DIEXF, POGK, TNPO1, GOLT1B, and COG2, among consistently upregulated genes and the lower gene expression of three genes, SYNGR1, FGFR2, and EPHB6, among consistently downregulated genes, were associated with poor patient prognosis ( Figure 5). These mentioned genes may be considered to be associated with the advanced stage of CxCa, as well as potential biomarker candidates to improve the prognostic assessment. In addition, Cox regression analysis revealed that ZNF281 was the only candidate gene that appeared to be related to poor CxCa prognosis (Cox coefficient = 0.53, adjusted P-value < 0.1). Collectively, ZNF281 may be a key regulator of CxCa initiation and progression. It also exhibits a good potential for both diagnostic and prognostic assessment and as a therapeutic target. Therefore, we further assessed the protein expression level of ZNF281 using tissue microarray samples of an independent cohort of CxCa patients and normal controls. As shown in Figure  6, the protein expression level of ZNF is significantly higher in CxCa samples than that of normal samples (two-tailed P-value < 0.0001). On the other hand, EPHB6 gene expression level may become the potential candidate for good CxCa prognosis (Cox coefficient = -0.45, adjusted P-value < 0.1). www.impactjournals.com/oncotarget DISCUSSION One of the main purposes of this study was to investigate the relevant biological processes specifically related to the initiation, invasion, and progression of CxCa with better confidence compared to individual studies through an integrative systems biology approach. Therefore, we first performed a random effect modelbased meta-analysis to detect DE genes with a higher robust level. Functional analysis of the DE genes of each comparison group revealed that there were many significantly enriched pathways when a normal cell becomes cancerous. Interestingly, cell cycle, mRNA surveillance, and RNA transport pathways were always significantly enriched in the more malignant stages of CxCa carcinogenesis (from Normalcy to CIN and from CIN to Cancer). These results suggest that these pathways perturbations are strongly correlated with CxCa initiation and progression. Notably, cell cycle has a known central role in cancer carcinogenesis and metastasis [19,20]. Cell cycle-related genes, including CDKN2A and MCM2, are overexpressed in HPV (+) CxCa. p16 INK4a (a product protein of CDKN2A) overexpression has been known to be associated with high grade precancerous and CxCa lesions and used as a potential biomarker for identifying low-grade lesions associated with high risk carcinomagenesis [21,22]. However, a recent metaanalysis showed that the overexpression of p16 INK4a was associated with better prognosis of CxCa patients [23]. Other studies demonstrated that HPV infection alters the cell cycle and promotes cervical oncogenesis [24][25][26][27]. It is also worth mentioning that cell cycle-associated genes are significantly upregulated more in HPV (+) CxCa than in HPV (-) CxCa [28]. The upregulation of a cell cycle subnetwork with highly frequent alterations of its regulatory genes in CxCa has also been described [27]. Additionally, lncRNA Hox transcript antisense intergenic RNA (HOTAIR) promotes metastasis by increasing cell proliferation, migration, and the invasion of cancer cells, and the CxCa patients with high levels of HOTAIR often have a poorer prognosis. The pathway analysis of the DE genes (with logarithm fold change > 2) using GEO2R of GSE67522 revealed that the cell cycle is also significantly enriched by the contribution of CDC6, RBX1, SKP1, CCNE1, CDK2, CUL1, and MYC. Using network-based meta-analysis, we identified HDAC1 of the cell cycle pathway as the prominent hub gene in Cancer versus Normalcy and CIN versus Normalcy groups, while CREBBP was the prominent hub gene, among others, in Cancer versus CIN group. HDAC1 was reported to be  more upregulated in cervical dysplasia and carcinoma than in normal cells [29]. The overexpression of HDAC1 was also associated with cell proliferation in breast cancer, cancer stem cells, and prostate cancer [30][31][32][33]. Mutation of CREBBP was described in relapsed acute lymphoblastic leukemia [34]. However, our analysis revealed that the gene expression levels of HDAC and CREBBP were not correlated with the prognosis of CxCa patients in TCGA cohort.
Recently, the association between cell cycle and metabolism in the survival mechanism of cancer cells has been demonstrated [35]. Moreover, the one carbon pool by folate pathway, which was upregulated in Cancer versus Normalcy and CIN versus Normalcy groups in our study, has been noted as a predominant pathway in cancer cell survival and progression for many years [36]. TYMS, SHMT2, and DHFR genes of one carbon metabolism are proven to contribute to genome instability and cancer development [37]. MTHFD2, another gene associated with one carbon metabolism, is considered to be an important player for various types of cancer and correlated with mitochondrial folate metabolism [38]. Moreover, the important roles of MTHFD2 in cell proliferation, DNA synthesis control, and cell migration in cancer have been confirmed [39,40]. Thus, the connection between cell cycle and one carbon metabolism in CxCa is an attractive target for prospective studies. RNA transport and mRNA surveillance are the two other major pathways of CxCa pathobiology in our study. The association between these pathways and viral genome replication in HPV (+) CxCa has been documented [41,42]. The viral oncogenes and their proteins expressions (E5, E6, and E7, for example) increase the cell proliferation, inhibit apoptosis resulting from DNA damage, and enhance malignant transformation [43]. However, the fundamental regulatory mechanisms of RNA transport and mRNA surveillance associated genes remain to be explored. Although the use of high-throughput gene expression profiling for disease diagnosis is promising, the clinical and translational potential of this approach has been limited due to the failure of cross-study validation [44]. Nevertheless, we extended the capacity of gene expression profiling on CxCa and CIN diagnosis by combining independent data sets using the Empirical Bayes crossstudy normalization method and conducting deep learning classifications on the relatively large sample size of transcriptome data of CxCa, CIN, and normal samples. The results demonstrate that the patients with CxCa can be accurately diagnosed by genetic classification panels using deep learning techniques and by the ensemble supervised learning tools. Noticeably, the benign and the malignant conditions are differentiated just by a small subset of gene expression information. This differentiation may help improve the utility of the transcriptome-based diagnosis in clinical practice. In addition, gene expression patterns from tumors may be used for better patient management, as is the case, for example, in breast cancer. Further studies are guaranteed to improve the classification performance by either increasing the sample size, and not limiting the study to Affymetrix-based microarray data, or improving the predictive power of the supervised learning models.
In the prognostic assessment, ZNF281, DIEXF, POGK, TNPO1, GOLT1B, COG2, SYNGR1, FGFR2, and EPHB6 are found to be associated with the prognosis of CxCa patients. Notably, those genes are consistently upand downregulated in a multi-stage manner; thus, they can be considered to be associated with tumor development and aggression. Nevertheless, the precise roles of these genes in regulating CxCa remain unknown. DIEXF is known as an alternative polyadenylation-indicator gene in non-small cell lung cancer, and its expression is correlated with cell proliferation [45]. A study indicated the presence of a TNPO1-IKBKB (IKK-beta) fusion in prostate cancer but not in benign tissue [46]. The shortening of 3′ UTRs of SYNGR1 was associated with poorer prognosis in triple-negative breast cancer [47]. FGFR2 was involved in increased risk of breast cancer [48]. To the best of our knowledge, the roles of POGK, GOLT1B, and COG2 in carcinogenesis have not been identified.
Among the candidates, the dysregulation of ZNF281 at the protein level in CxCa was confirmed in our study.
Zinc finger proteins play a central role in diverse biological processes, including monitoring gene expression. The regulation mechanisms of those proteins in cancer progression vary among different types of cancer or even in the same cancer at different stages [49]. ZNF281, a zinc finger transcription factor, belongs to the C2H2-type zinc finger motif, a subfamily of zinc finger proteins. Although the known functions of ZNF281 in cancer biology have been limited, recent studies have provided new insights into the function of ZNF281 in EMT and its association with the WNT signaling pathway [50,51]. ZNF281 can serve as an EMT-inducting transcription factor (EMT-TF), along with SNAIL, SLUG, TWIST1/2, and ZEB1/2, which induces EMT transcriptional changes. Furthermore, the expression of ZNF281 is directed by SNAIL, an EMT inducer, and in turn, required for the transcription of SNAIL. In addition, ZNF281 itself regulates a number of EMT-related effector genes [52]. The abovementioned evidence combined with our findings suggest that ZNF281 may be considered as a new prognostic and therapeutic target for the management of CxCa. However, the precise role of ZNF281 in CxCa initiation and progression remains to be elucidated.
Our study has several limitations. Firstly, the distinct gene expression biosignatures of cervical adenocarcinoma (five samples) and cervical adenosquamous carcinoma (one sample) were unable to be explored. Secondly, we did not examine the effects of HPV to CxCa carcinogenesis, but instead, focused on the general process in the multi-stage manner. Finally, the combination of CIN1, CIN2, and CIN3 on behalf of cervical precancerous lesions may introduce biases since they are associated with different subtypes of HPV (CIN1 and CIN2 mainly with intermediate risk HPV genotypes while CIN 3 mainly with HPV16).
Better strategies for diagnosis and prognostic assessment of CxCa can be achieved by improving our understanding of the formation and development of the disease. Although the mechanisms of CxCa carcinogenesis, especially HPV integration-driven cervical carcinogenesis have been studied, the alterations of genes in the dysregulated processes and their effects have not been intensively scrutinized. In the current study, we conducted a comprehensive analysis and suggest new evidence on the mechanisms of cervical oncogenesis and metastasis. Our findings also indicated that the overexpression of cell cycle, one carbon pool by folate, RNA transport, and mRNA surveillance pathways play critical roles in the multi-stage development of CxCa. Genetic panels and a deep learning classification approach, which can be used to improve the diagnosis of CxCa, were proposed. We also introduced ZNF281 as a novel biomarker for the prognostic assessment of CxCa and a new potential therapeutic target. Other promising genes that may be significantly important to CxCa progression were also identified. In conclusion, the current study provides opportunities to improve current understanding of CxCa pathogenesis and hopefully helps improve CxCa patient outcome.

Study design, search strategy, and selection criteria of the meta-analysis
A systematic search of Gene Expression Omnibus (GEO) and ArrayExpress was performed using the following query structure: "("cervical" OR "cervix") AND ("cancer" OR "cancers" OR "cancerous" OR "neoplasm" OR "neoplasms" OR "neoplastic" OR "tumor" OR "tumors" OR "tumour" OR "tumours" OR "tumorous" OR "tumourous")". Our results covered all available data sets published up to December 2016. Identified data sets were evaluated for eligibility by at least two authors. Disagreements among reviewers were resolved by consensus with other reviewers. A data set was included if it contained Affymetrix-based microarray data and the study design met the following criteria: the data set provided clear definitions of analyzed samples; the data set contained at least one comparison among Cancer versus Normalcy, Cancer versus CIN, or CIN versus Normalcy groups. We excluded data sets for the following reasons: no relevant item found; the study was carried out from a cohort without a control group; the results contained overlapping data in other data set(s); and the studied subject was not human. Only the largest data set was included among overlapping data sets. We also manually searched the reference lists of the original studies of included data sets to find other relevant data sets.

Data pre-processing, gene expression metaanalysis, and pathway enrichment analysis
In this study, data preprocessing, gene expression analysis and meta-analysis, and functional analysis were conducted by the published protocol of NetworkAnalyst with minor modifications in data set production and data normalization processes [17]. To normalize gene expression data of included data sets, we applied a robust multi-array analysis method using the Bioconductor Affy package [53,54]. In addition, Database for Annotation, Visualization, and Integrated Discovery (DAVID) online software was used to convert the gene labels to Entrez IDs for unsupported probe platforms from NetworkAnalyst prior to the analysis [55]. A random effects model was utilized for the meta-analysis. Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation was applied for pathway enrichment analysis to detect corresponding biologically enriched pathways. Potential hub genes in network analysis were detected by betweenness centrality (BC) and degree centrality (DC) of the first-order network. Protein-protein networks of the consistently up-or downregulated genes were derived from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database version 10.5 [56]. In the protein-protein networks, known interactions from curated databases and experiments were presented along with predicted interactions (minimum required interaction score = 0.4). Functional enrichments were detected using KEGG and Gene Ontology (GO) biological process where applicable (false discovery rate < 0.05).

Text mining for gene list interpretation
CCancer web-based software was applied to examine whether our reported gene list was significantly enriched in terms of the intersection with previously reported gene lists for other biological processes [57]. This approach is based on the hypothesis that significant similarities among gene/protein lists may indicate potentially be a good indicator of similarity in molecular mechanisms among corresponding biological processes.

Data exploration and visualization
Principal component analysis (PCA) is an orthogonal transformation analysis that aims to reduce the dimension of the data but at the same time minimizes information loss. In this study, we applied PCA to examine the tendency of separation among samples that belong to different groups. A heatmap was also used to visualize the data of selected features. The analysis and visualization were performed using FactoMineR, factoextra 1.0.4.999, and MetaboAnalyst 3.0 [58][59][60]. A Venn diagram was produced using InteractiVenn [18].

Variable importance measurement and class assignment analysis
The heterogeneity of the data sets from different platforms was adjusted using the Combat method to integrate different data sets covering the same biological condition into one unique data set [61]. The corresponding data sets were later split into training and test sets using the caret package version 6.0-76 [62]. For Cancer versus Normalcy group, the splitting ratio was 50:50 while the splitting ratio for Cancer versus CIN and CIN versus Normalcy groups was 70:30. Variable importance measurements of the training sets were conducted using the area under the curve permutation random forest variable importance measurement (AUC-RF VIM) of party package version 1.2-3 [63]. The top 30 genes of each data set were introduced to the deep learning-based classification independently with the 168-gene panel.
The deep learning analyses were executed using H2O package version 3.10.3.6 in R environment version 3.3.3 [64]. For training the predictive models, the adaptive learning rate method for stochastic gradient descent optimization was used as a default. We carried out the random hyperparameter search for the number of hidden layers, l1 regularization, and l2 regularization. The number of layers was set from two to five, and each layer contained 200 neurons. The l1 regularization and l2 regularization searches were conducted using the sequence function containing the values from 0.00 to 1.00E-3 by 1.00E-6. The search criteria strategy was set as "Random Discrete" for searching all the combinations of the number of layers, l1 regularization, and l2 regularization. Other parameters were set empirically or by the default settings. The number of epochs was adjusted at 100. The Rectifier linear activation function was applied for a fast and accurate training process. Additional regularization methods, including dropout (hidden dropout ratios = 0.5) and early stopping (logloss stopping metric, stopping tolerance = 0.01, and stopping round = 5) were employed. The training sets were used for hyperparameter tuning processes via 10-fold crossvalidation. The performances of the fine-tuned models with the optimal hyperparameters were then measured on the test sets. The cut-off of the classification was the corresponding value that optimize F1-score at default.
Ensemble class assignment analyses on the corresponding data sets of the whole transcriptome information were conducted using ArrayMining online software [65]. In brief, the data set was introduced to the Class Assignment Module with the settings of the ensemble method for feature selection. The ensemble method is a combination of three different univariate filters (Pearson correlation filter, signal-to-noise-ratio filter, and F-score filter) to make the ensemble feature ranking. An ensemble prediction method that combines random forest, support vector machine, prediction analysis for microarrays, and k-nearest neighbor algorithms were applied. This combination is supposed to provide a more robust classification analysis. For model evaluation, we performed 10-fold cross-validation. The maximum feature subset size was set at the default (30 features). The performance of the prediction models was appraised using accuracy, sensitivity, and specificity.

Prognostic analysis
We investigated the prognostic ability of selected genes using Cancer Genome Atlas (TCGA) data set. mRNA-based Cox regression analysis of a cohort of 264 TCGA patients were extracted using Oncolnc [66]. Cox regression factors on survival were set at default: "coxph(Surv(times, died) ~ gene + grade1 + grade2 + grade3 + grade4 + age)". In addition, we exploited the Kaplan-Meier method with log-rank test for comparing survival curves in two groups with the divided option of upper 50 percent mRNA expression and lower 50 percent mRNA expression patients. A log-rank P-value less than 0.05 (Kaplan-Meier) and adjusted P-value less than 0.1 (Cox regression) were considered to be the statistically significant thresholds.

Tissue microarray (TMA) staining and analyses
Human cervical tissue microarray slides (CR6161) were purchased from the US Biomax Inc. (Rockville, MD, USA). Rabbit anti-ZNF281 (1:30, Atlas Antibodies, HPA051228) antibody was used for the experiment. Immunohistochemistry (IHC) was performed for human normal tissues and cervical cancer tissues using the protocol previously described with minor modifications [67]. Images were obtained from Aperio Scanscope digital slide scanners and analyzed by the vendor's software (Leica, Wetzlar, Germany). The staining score was given as 0, 1, 2, and 3 for the tissues with stainless cells, <10% stained cells, 10-50% stained cells, and > 50% stained cells, respectively. Tissue loss cores were excluded from the analysis. Only squamous cell carcinoma, adjacent normal, and normal tissues were considered in the final statistical analysis (256 cancer and 26 normal tissues). A Mann-Whitney U test was conducted to compare differences between the normal group and CxCa group using GraphPad Prism 6 (San Diego, CA). IHC scoring visualization was performed using lattice package version 0.20-35 in R environment version 3.3.3 [68]. The significance level was set at 0.05.