Optimal subset of feature miRNAs consisting of 7 miRNAs that can serve as a novel diagnostic and prognostic predictor for the progression of cervical cancer

Objective: Cervical cancer is the second most commonly diagnosed cancer in women. Novel prognostic biomarkers are needed to predict the progression of cervical cancer. Results: In total, 44 significantly differentially expressed miRNAs were identified. An optimal subset of 7 feature miRNAs was identified, including hsa-miR-144, hsamiR-147b, hsa-miR-218-2, hsa-miR-425, hsa-miR-451, hsa-miR-483, and hsamiR-486. The feature miRNAs were used to construct an SVM classifier and showed a good performance in predicting pathologic stages of samples. SVM classification was found to be an independent prognostic factor. Functional enrichment analysis indicated that these feature miRNAs are involved in tumorigenesis. Materials and Methods: Cervical cancer expression data were obtained from the Cancer Genome Atlas database. MicroRNAs (miRNAs) significantly differentially expressed between earlyand advanced-stage samples were identified by expression analysis. An optimal subset of feature miRNAs for pathologic stage prediction was delineated using the random forest algorithm and was used for the construction of a cervical cancer-specific support vector machine (SVM) classifier. The roles of feature miRNAs in cervical cancer were analyzed by functional annotation. Conclusions: The subset of feature miRNAs could potentially serve as a novel diagnostic and prognostic predictor for cervical cancer.


INTRODUCTION
As one of the most frequently diagnosed cancers in women worldwide, it is estimated that cervical cancer accounts for more than 500,000 new cases and 250,000 deaths each year [1].HPV (human papillomavirus) infection is recognized as the most significant risk factor that presents in most cervical cancers [2,3].Integration of HPV into the cellular genome causes genome instability, transcriptional variations, and epigenetic alterations [4,5].However, HPV alone is not sufficient to induce malignant transformation [6].Therefore, additional cancer-causing genetic variations may underlie the development and progression of cervical cancer.
MicroRNAs (miRNAs) are small, noncoding RNA molecules of approximately 22 nucleotides in length, functioning through targeting mRNAs to regulate gene expression at the post-transcriptional level [7,8].MiRNAs are involved in diverse biological processes, including cell cycle, differentiation, and metabolism [9].Increasing evidence highlights the involvement of altered miRNA expression in cervical cancer [10].Examples include hsa-miR-21 [11], hsa-miR-196a [12], and hsa-miR-497 [13].Hsa-miR-21 is an oncogene overexpressed in cervical cancer, the inhibition of which upregulates the tumor suppressor, programmed cell death 4, and suppresses cell proliferation [11].Upregulation of hsa-miR-196a has also been detected in cervical cancer tissues, in which it promotes cancer cell proliferation [12].Unlike hsa-miR-21 and hsa-miR-196a, hsa-miR-497 is a tumor suppressor for cervical cancer and suppresses cancer cell migration and invasion by targeting insulin-like growth factor 1 receptor [13].
Potential prognostic miRNA signatures of cervical cancer have been identified.A miRNA signature for clinical response consisting of hsa-miR-200a and hsa-miR-9 has been previously identified by the expression analysis of candidate miRNAs [14].According to the expression levels of hsa-miR-200a and hsa-miR-9, cervical cancer samples could be divided into low-and high-risk groups.Functional analysis indicated that both hsa-miR-200a and hsa-miR-9 are likely to play important roles in cervical cancer metastasis.However, only a limited number of miRNAs were analyzed in this study [14], and miRNA expression differences between pathologic stages have not yet been examined.
To identify novel outcome predictors of cervical cancer, we comprehensively analyzed the expression levels of miRNAs using data from The Cancer Genome Atlas (TCGA).miRNAs significantly differentially expressed between early (I and II) and advanced (III and IV) pathologic stages were identified, followed by the identification of an optimal subset of feature miRNAs.The subset of feature miRNAs showed good performance in progression prediction and may serve as a promising prognostic predictor of cervical cancer in clinical practice.

Differential miRNA expression between earlyand advanced-stage groups
Among the 143 samples in the training dataset, 90 were early-stage and 53 were advanced-stage samples.
miRNAs with low expression (median expression value < 1.0) were removed, and the remaining 318 miRNAs were used for further analysis.In total, 51 miRNAs were identified to be significantly differentially expressed between early-and advanced-stage groups by t-test, whereas 49 miRNAs were identified by Wilcoxon ranksum test (Supplementary Figure 1).In total, 44 miRNAs were identified to be significantly differentially expressed by both t-test and Wilcoxon rank-sum test.
Two-way hierarchical clustering analysis was carried out based on the expression levels of the 44 miRNAs and separated the training dataset into 2 distinct clusters, designated as cluster 1 and cluster 2 (Figure 1A).Cluster 1 mainly consisted of early-stage samples, including 86 early-and 15 advanced-stage samples.Cluster 2 mainly consisted of advanced-stage samples, including 38 advanced-and 4 early-stage samples.The overall accuracy of hierarchical clustering in classifying samples was 86.71% (124 out of 143 samples).Moreover, the 2 clusters correlated significantly with cancer progression status (χ 2 = 69.5245,p < 2.2e-16).Kaplan-Meier analysis showed that samples in cluster 1 were related with a significantly better prognosis than those in cluster 2 (Figure 1B; logRank p = 1.179e-05), which was consistent with the fact that cluster 1 mainly consisted of early-stage samples, whereas cluster 2 mainly consisted of advanced-stage samples.

Cervical cancer-related optimal subset of feature miRNAs
The random forest algorithm was used to identify an optimal subset of feature miRNAs from the 44 significantly differentially expressed miRNAs, using the training dataset.The results showed that the OOB error reached a minimum (0.181) when 7 miRNAs were used for fitting (Figure 2A).These 7 miRNAs, including miRNAs were hsa-miR-144, hsa-miR-147b, hsa-miR-218-2, hsa-miR-425, hsa-miR-451, hsa-miR-483, and hsa-miR-486, were selected as optimal miRNAs and are summarized in Table 1.According to the expression levels of these 7 miRNAs in the training dataset, hsa-miR-147b and hsa-miR-218-2 showed a significantly higher expression in the early than in the advanced stage, whereas the remaining 5 miRNAs showed a significantly lower expression level in the early stage (Figure 2B).
Hierarchical clustering analysis based on the expression levels of the feature miRNAs showed that samples in the training dataset could be separated into 2 clusters, (Figure 2C).Similar to the hierarchical clustering results based on the expression levels of all 44 significantly differentially expressed miRNAs, cluster 1 mainly consisted of early-stage samples (74 early-vs.10 advanced-stage samples) whereas cluster 2 mainly consisted of advanced-stage samples (16 early-vs.43 advanced-stage samples).The overall classification accuracy was 81.82% (117 out of 143 samples).Additionally, cluster 1 was related with significantly better prognosis than cluster 2 (Figure 2D; logRank p = 0.01175).

A cervical cancer-specific SVM classifier for pathologic stage prediction
A cervical cancer-specific SVM classifier was constructed based on the expression levels of feature miRNAs in the optimal subset.Samples in the training dataset were classified as early-stage-like or advancedstage-like using the SVM classifier.The results showed that the SVM classifier could classify samples in the training dataset with an overall accuracy of 85.31% (122 out of 143 samples; sensitivity = 79.81%,specificity = 94.44%,positive prediction value (PPV) = 88.095%,negative prediction value (NPV) = 84.16%,and area under the receiver operating characteristic (ROC) curve (AUC) = 0.897) (Figure 2E).Kaplan-Meier survival analysis showed that the predicted early-stage-like group had a significantly better prognosis than the advanced-stage-like group (Figure 2F; logRank p = 0.004715).

Validation of the performance of the feature miRNAs in pathologic stage prediction
The validation dataset was used to validate the performance of the 7 feature miRNAs in predicting pathologic stage.The validation samples were first classified into cluster 1 and cluster 2 by hierarchical clustering (Figure 3A).Similar to the results for the training dataset, cluster 1 mainly consisted of early-stage samples (78 early-vs.15 advanced-stage samples) and cluster 2 mainly consisted of advanced-stage samples (37 early-vs.12 advanced-stage samples).The overall accuracy was 80.99% (115 out of 142 samples).Prognosis in the 2 clusters was analyzed using Kaplan-Meier survival curve analysis.In consistence with the findings based on the training dataset (Figure 2D), classification in cluster 1 indicated significantly better prognosis (Figure 3B).
Then, we classified the validation samples into early-stage-like and advanced-stage-like groups using the SVM classifier.Consistent with the training dataset findings, the validation samples were classified with an accuracy of 80.98% (115 out of 142 samples; sensitivity = 73.46%,specificity = 91.11%,PPV = 80.49%, NPV = 81.19%,and AUC = 0.857) (Figure 3C).Kaplan-Meier survival curve analysis showed that the early-stage-like samples corresponded with significantly better prognosis than the advanced-stage-like group (Figure 3D; logRank p = 0.006246).

Independence of SVM-predicted group in progression prediction
Correlations between clinical variables and prognosis were analyzed via Cox regression.Both univariate and multivariate Cox regression showed that SVM-predicted group, pathologic T, and new tumor were independent prognostic factors, as they correlated significantly with prognosis (p < 0.05) in both the training and the validation dataset (Table 2).Kaplan-Meier survival curve analysis for both the training and the validation dataset showed that samples under pathologic T0 and T1 categories were related to a significantly better prognosis than those under pathologic T2 and T3 categories, and samples without new tumor correlated with a better prognosis than those with new tumor (Supplementary Figure 2).
Stratified analysis was performed to evaluate the independence of SVM-predicted group as a prognostic factor.Samples in both the training and the validation dataset were stratified according to age, pathologic M, pathologic N, pathologic T, smoking, new tumor, radiation therapy, and targeted molecular therapy.Univariate Cox regression showed that SVM-predicted group correlated significantly with prognosis for elder patients (> 45 years of age) in both the training and the validation dataset (Table 3).Similar results were found for patients without new tumor and radiation therapy (Table 3).Additionally, SVM-predicted group correlated significantly with pathologic N1 patients in the training dataset and pathologic N0 patients in the validation dataset (Table 3).Kaplan-Meier survival curves for samples stratified by age, pathologic N, new tumor, and radiation therapy are shown in Figure 4.  differentially expressed mRNAs were identified.Target mRNAs of the feature miRNAs were predicted using miRanda [15].As a result, we found that 63, 86, 59, 14, 33, and 3 significantly differentially expressed mRNAs were targeted by hsa-miR-218, hsa-miR-144, hsa-miR-425, hsa-miR-451, hsa-miR-483, and hsa-miR-486, respectively, while none were targeted by hsa-miR-147b.We constructed a cervical cancer-related miRNA-mRNA regulation network, consisting of feature miRNAs and significantly differentially expressed mRNAs targeted by these feature miRNAs (Figure 5A).The network consisted of 207 nodes, containing 6 feature miRNAs and 201 significantly differentially expressed target mRNAs.

Functional annotation of the 7 feature miRNAs
To interpret the potential roles of the feature miRNAs in cervical cancer, functional enrichment analysis was conducted using DAVID [16].The results showed that mRNAs in the network were significantly enriched for cancer-related GO biological processes and KEGG pathways (Figure 5B and 5C).The GO terms included cell-cell signaling, epithelium development, ion transport, and adhesion (Figure 5B).The KEGG terms included pathways in cancer, calcium signaling pathway, basal cell carcinoma, Hedgehog signaling pathway, Wnt signaling pathway, heparan sulfate biosynthesis, and TGF-beta signaling pathway (Figure 5C).

DISCUSSION
Although efforts have been made to identify cervical cancer-related miRNAs [10][11][12][13][14], few studies have focused on miRNAs related to the progression of cervical cancer, which is essential for the development of novel diagnostic biomarkers for this disease.In the present study, we comprehensively analyzed cervical cancer-related miRNA expression profiles and identified 44 miRNAs significantly differentially expressed between early-and advanced-stage samples, using a training dataset.From these miRNAs, an optimal subset of feature miRNAs was selected, including hsa-miR-144, hsa-miR-147b, hsa-miR-218-2, hsa-miR-425, hsa-miR-451, hsa-miR-483, and hsa-miR-486.Both hierarchical clustering and SVM prediction results demonstrated the high performance of these feature miRNAs in the prediction of cervical cancer progression.The high performance of this subset of feature miRNAs was validated using a validation dataset.Similar to the findings for the training dataset, both hierarchical clustering and SVM prediction could classify the samples in the validation dataset with high accuracy.Additionally, Kaplan-Meier survival analysis showed that the SVM-predicted early-stage-like group showed significant better prognosis than the advanced-stage-like group for both the training and the validation dataset.We also examined the independence of SVM prediction as a prognostic factor.Both univariate and multivariate Cox regression analyses confirmed the independence of SVM-predicted group in outcome prediction for the training and validation datasets.The prognostic power of SVM prediction was additionally evaluated using stratified analysis.Elder patients (> 45 years of age) in both datasets could be classified into early-stage-like and advanced-stage-like groups with significant differences in prognosis.Similar results were acquired for patients with no new tumor and patients that had received no radiation therapy in both datasets.Taken together, the results demonstrated that SVM prediction was a prognostic factor independent of other clinical factors, including age, new tumor, and radiation therapy.
MiRNAs function by regulating the expression of target mRNAs [8,9].To interpret the functions of the feature miRNAs in cervical cancer, potential target mRNAs were identified and subjected to GO and KEGG functional enrichment analyses.Cancer-related GO biological processes and KEGG pathways were overrepresented among these mRNAs, which supported the importance of the feature miRNAs in cervical cancer.Two KEGG terms, Hedgehog signaling pathway and Wnt signaling pathway, deserve special attention.Both pathways are essential for cancer cell proliferation, migration, and invasion in various cancers, and thus, represent promising targets for cancer therapy [17,18].
The high predictive performance of the feature miRNAs may be explained by their close relation to cervical cancer and other cancers.hsa-miR-218 reportedly is tumor suppressor that is downregulated in cervical cancer [19] and functions though suppressing migration, invasion, and metastasis by targeting survivin [20].Additionally, lower-than-normal expression of hsa-miR-218 is related to infection with high-risk, rather than low-risk, HPV [21].Consistent with the tumorsuppressive role of hsa-miR-218, we found that hsa-miR-218-2 expression is reduced in advanced-stage patients.hsa-miR-486, hsa-miR-425, and hsa-miR-144 are also reportedly engaged in cervical cancer [22][23][24].A variant of hsa-miR-486, hsa-miR-486-3p, acts as a tumor suppressor through inhibiting cell growth and metastasis by targeting ECM1 [22].hsa-miR-425-5p, a variant of hsa-miR-425, is significantly upregulated in cervical cancer and may serve as a prognostic indicator [23].hsa-miR-144 is significantly downregulated in metastatic cervical cancer [24], though its role remains elusive.The remaining 3 feature miRNAs, while not directly associated with cervical cancer, have been associated with other cancer types.Dysregulation of hsa-miR-483 has been found in various cancers, including adrenocortical cancer [25], prostate cancer [26], and lung adenocarcinoma [27].hsa-miR-451 has been reported to be a tumor suppressor in different cancers [28,29] and a protective effect of hsa-miR-147b has been found in ovarian cancer [30].Considering the involvement of these feature miRNAs in diverse cancer types, they may play important roles in the development and progression of cervical cancer and may serve as potential biomarkers of this disease.However, further experimental and functional studies are needed to reveal the specific roles of these feature miRNAs in cervical cancer.
In summary, we identified 7 feature miRNAs of cervical cancer by a comprehensive miRNA expression analysis.The feature miRNAs were significantly associated with the progression of cervical cancer and were used for the construction of a cervical cancerspecific SVM classifier.The SVM is a promising predictor of progression and outcome.Meanwhile, the feature miRNAs may be novel therapeutic targets in future clinical practice.

Data source
mRNA and miRNA expression profiles (Illumina HiSeq 2000 RNA sequencing data) of cervical cancer were downloaded from TCGA (https://gdc-portal.nci.nih.gov/)database.Samples for which both mRNA and miRNA data as well as information on pathologic stage and survival were available, were selected for our study.In total, data of 285 samples were collected, which were randomly divided into training (143 samples) and validation (142 samples) datasets.Clinical characteristics of these samples are summarized in Table 4.

Screening of significantly differentially expressed miRNAs
Samples in the training dataset were divided into early (I and II) and advanced (III and IV) pathologic stage groups.miRNA expression profiles were compared between the two groups using the t test (http://127.0.0.1:26738/library/stats/html/t.test.html)and Wilcoxon rank-sum test (http://127.0.0.1:26738/library/ stats/html/wilcox.test.html)under R3.1.0.FDR (false discovery rate) < 0.05 and |logFC (fold change)| > 0.263 were set as thresholds for the selection of significantly differentially expressed miRNAs for both methods.Significantly differentially expressed miRNAs derived from both the t test and the Wilcoxon rank-sum test were used for further analysis.
Two-way hierarchical clustering analysis based on centered Pearson correlation [31] was performed using the pheatmap package (https://cran.r-project.org/package = pheatmap) in R. Correlation between clusters and pathologic stages were analyzed by the Chi-square

Selection and validation of an optimal subset of feature miRNAs
Feature miRNAs were selected from differentially expressed miRNAs using the bootstrap algorithm [32]  of the random forest package (https://cran.r-project.org/package=randomForest) [33] in R. The optimal subset of feature miRNAs was the one yielding the minimum outof-bag (OOB) error.
Based on the expression values of the optimal subset of miRNAs, two-way hierarchical clustering analysis was conducted for both the training and the validation dataset, followed by Kaplan-Meier survival analysis of different clusters.
A cervical cancer-specific support vector machine (SVM) classifier was constructed based on the expression values of the optimal subset of miRNAs, using the SVM function (core function: Sigmoid kernel, cross: 10-fold cross validation) in the e1701 package in R [34].The SVM classifier was used to predict the pathologic stages of samples.Based on the predictions, samples in both the training and the validation dataset were classified into either an early-stage-like or an advanced-stage-like group.The prognosis of different groups was analyzed using Kaplan-Meier survival curve analysis.

Independence analysis of SVM-predicted group as a prognostic factor
Clinical information on age, pathologic M, pathologic N, pathologic T, smoking, new tumor, radiation therapy, and targeted molecular therapy was extracted from both the training and the validation dataset.Correlations between clinical variables and prognoses were analyzed by univariate and multivariate Cox regressions, using the survival package under R3.1.0.Clinical variables with p < 0.05 were considered to be significant and independent prognostic factors.
Samples in both the training and the validation dataset were further stratified according to different clinical variables.The relation between SVM-predicted group and prognosis of cervical cancer was analyzed using univariate Cox regression and Kaplan-Meier survival curve analysis for each stratum, with p < 0.05 considered significant.Functional analysis of the optimal subset of feature miRNAs mRNAs significantly differentially expressed between early and advanced stages were screened as described above for miRNAs.mRNAs targeted by the optimal subset of feature miRNAs were predicted using miRanda (http://www.microrna.org/microrna/home.do)[15], which were further intersected with the significantly differentially expressed mRNAs.Thus, the selected mRNAs were used for the construction of a regulatory network of the optimal subset of feature miRNAs.mRNAs in the regulatory network were functionally annotated using DAVID (https://david.ncifcrf.gov/)[16] and significantly enriched Gene Ontology (GO) biological processes, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were retrieved, with p < 0.05 set as a threshold for significantly enriched terms.

Figure 1 :
Figure 1: Hierarchical clustering and survival analyses of samples in the training dataset.(A) Two-way hierarchical clustering of samples in the training dataset based on the expression levels of the 44 significantly differentially expressed miRNAs yielded 2 clusters, labeled as cluster 1 and cluster 2. Early-and advanced-stage samples are indicated in red and black in the horizontal bar.(B) Kaplan-Meier survival curves of the 2 clusters in (A).Survival curves for cluster 1 and 2 are indicated in green and red, respectively.

Figure 2 :
Figure 2: Optimal subset of feature miRNAs and its performance in sample classification.(A) OOB error curve for feature miRNA selection using the training dataset.The OOB error is plotted against the number of feature miRNAs used.The red vertical dashed line indicates the minimum OOB error (0.181), where the number of selected feature miRNAs is 7. (B) Expression levels of the feature miRNAs of the optimal subset in early-(red) and advanced (black)-stage groups.* p < 0.05, ** p < 0.01, *** p < 0.005.(C) Two-way hierarchical clustering of samples in the training dataset based on the expression levels of the 7 feature miRNAs yielded 2 clusters, which are labeled as cluster 1 and cluster 2. Early-and advanced-stage samples are indicated in red and black in the horizontal bar, respectively.(D) Kaplan-Meier survival curves of the 2 clusters in (C).Survival curves for clusters 1 and 2 are indicated in green and red, respectively.(E) ROC curves of the training dataset generated using the SVM classifier.The AUC was calculated to be 0.897.(F) Kaplan-Meier survival curves for early-stage-like (red curve) and advanced-stage-like (black curve) groups as predicted by the SVM classifier.

Figure 3 :
Figure 3: Validation of the optimal subset of feature miRNAs using the validation dataset.(A) Two-way hierarchical clustering of samples in the validation dataset based on the expression levels of the 7 feature miRNAs yielded 2 clusters, which are labeled as cluster 1 and cluster 2. Early-and advanced-stage samples are indicated in red and black in the horizontal bar, respectively.(B) Kaplan-Meier survival curves for clusters 1 (green curve) and 2 (red curve) in (A).(C) ROC curves of the validation dataset generated using the SVM classifier.The AUC was calculated to be 0.897.(D) Kaplan-Meier survival curves of early-stage-like (red curve) and advanced-stagelike (black curve) groups classified by the SVM classifier.

Figure 4 :
Figure 4: Stratified Kaplan-Meier survival curve analysis.Samples were stratified by age (A and B), pathologic N (C and D), new tumor (E and F), and radiation therapy (G and H) for both the training and the validation dataset.Survival curves for early-stage-like and advanced-stage-like groups in each stratum are shown in green and red, respectively.

Figure 5 :
Figure 5: Functional enrichment analysis of mRNAs targeted by the optimal subset of feature miRNAs.(A) Regulatory network consisting of feature miRNAs and their targeting mRNAs.Up-and downregulated feature miRNAs in advanced-stage samples are shown as orange triangles and diamonds, respectively.Up-and downregulated mRNAs are shown as purple triangles and green diamonds, respectively.Potential regulatory relationships between feature miRNAs and targeting mRNAs are indicated as gray lines.(B and C) GO (B) and KEGG (C) term enrichment analysis of mRNAs in (A).Gene numbers (vertical axis) are plotted against GO and KEGG terms (horizontal axis).-log(p-values) are indicated as red dots and lines.

Table 2 : Cox regression analysis of the correlations between clinical variables and prognosis for both the training and the validation dataset
a Hazard ratio; b Confidence interval.

Table 3 : Stratified prognosis analysis of patients in both the training and the validation dataset
a Hazard ratio; b Confidence interval.