Computational development of a molecular-based approach to improve risk stratification of endometrial cancer patients

Histological classification and staging are the gold standard for the prognosis of endometrial cancer (EC). However, in morphologically intermediate and doubtful cases this approach results largely insufficient, defining the need for better classification criteria. In this work we developed an algorithm that based on EC genetic alterations and in combination with the current histological classification, improves EC patients prognostic stratification, in particular in doubtful cases. A panel of 26 cancer related genes was analyzed in 89 EC patients and somatic functional mutations were investigated in association with different histology and outcome. An unsupervised hierarchical clustering analysis revealed that two groups of patients with different tumor grade and different prognosis can be distinguished by mutational profile. In particular, the mutational status of APC, CTNNB1, PIK3CA, PTEN, SMAD4 and TP53 resulted to be principal drivers of prognostic clustering. Consistently, a decisional tree generated by a data mining approach summarizes the consequential molecular criteria for patients prognostic stratification. The model proposed by this work provides the clinician with a tool able to support the prognosis of EC patients and consequently drives the choice of the most appropriated therapeutic strategy and follow up.


INTRODUCTION
Endometrial cancer (EC) is the most common gynecological cancer in industrialized countries. About 142000 new cases of EC are diagnosed every year worldwide, and about 42000 women die every year from EC [1]. Most ECs are diagnosed after the menopause, with the highest incidence around the seventh decade of life [1]. The early onset of symptoms explains why, at the time of the diagnosis, 70% of the patients present an early-stage disease, thus far resulting in a favorable prognosis with 77% 5-year overall survival rate (OS). On the contrary, women with advanced or recurrent disease present a low response rates to conventional chemotherapy and extremely poor outcomes [2].
Traditionally, EC is classified into two types according to Bokhman model based upon clinical-pathologic features [3]. Type 1 ECs are endometrioid cancer, associated with hyperestrogeneism and typically preceded by endometrial hyperplasia. They are often diagnosed at an early stage, Research Paper www.oncotarget.com and have a good prognosis. Type 2 EC includes nonendometrioid cancers such as serous, clear cell, mixed cell, undifferentiated and carcinosarcoma. These neoplasms not estrogens correlated, often occur in the presence of an atrophic endometrium and have a poor prognosis. The 5-year OS rate of patients with endometrioid adenocarcinoma (type 1) range from 75% to 86%, in contrast to 50% to 60% of patients with non-endometrioid cancer (type 2).
Genetically, Type 1 endometrioid ECs present high percentage of mutations in PTEN, KRAS, ARID1A and CTNNB1, as well as defects in DNA mismatch repair. Type 2 non-endometrioid ECs frequently show aneuploidy, p53 mutations and HER2 amplification. PIK3CA mutations are frequent in both EC histotypes [4]. Well known prognostic factors are age, International Federation of Gynaecology and Obstetrics (FIGO) stage, depth of myometrial invasion, tumor differentiation grade, tumor type and lymphovascular space invasion (LVSI) [5,6]. Moreover, new prognostic factors were investigated [7][8][9] to identify tumor with poor outcome. Although more than one risk-based classification of EC have been proposed numerous EC cases, in particular those with intermediate phenotype and grading (e.g. endometrioid tumor G2) still have uncertain prognosis.
Recently, The Cancer Genome Atlas Research Network (TCGA) reported a comprehensive genomic and transcriptomic analysis of EC based on nextgeneration sequencing (NGS) technologies, analysis of DNA methylation, reverse-phase protein array, and microsatellite instability [10]. The study categorized the most common histotypes into four genomic classes: ultra-mutated tumors (POLE) with a favorable prognosis, microsatellite-instable tumors (microsatellite hypermutated) and low copy number tumors (microsatellitestable) both with an intermediate prognosis and high copy number tumors (serous-like) with a poor outcome. Moreover the TCGA study revealed also that a subset of ECs diagnosed as high-grade endometrioid carcinomas harbored copy number and mutational profiles more similar to those of serous ECs and in general no mutations (excluding POLE) were identified as unique to any of the four genomic classes. In view of the substantial genetic and morphological heterogeneity in EC, these data suggested that the current histopathology-based classification approach requires a revision, which could take into account also the complicated molecular profiles of EC [4]. While offering a complete overview of the EC genetic and molecular landscape, the TCGA classification was only partially associated with prognosis, giving results that seem to be in contrast with literature data and would need further investigation. Furthermore, the use of this type of screening in clinical routine, where a rapid prognostic prediction and treatment choice it's needed, appears still not feasible because too expensive in terms of time, cost and interpretation of the results, due to its elevated complexity.
In this study we investigated a novel molecularbased approach to predict prognosis in EC. The model, based on DNA sequencing of few genes, subdivides EC in "good prognosis" and "bad prognosis" and can be applied in the investigation of ambiguous cases, and to support Bokhman's model and histological grading when the canonical approach is not sufficient to predict tumor outcome.

Study population
In this study 89 EC patients were analyzed. Clinical and pathological features of the patients enrolled in this study are shown in Supplementary

Next generation sequencing analysis revealed a massive genetic mutations frequency in endometrial cancer population
A next generation sequencing approach was applied to investigate mutations in a panel of 26 oncogenes and onco-suppressors genes. In order to outline a molecular profile with prognostic potential we considered only genetic variants known to have a frequency lower than 1% in total population and supposed to have an effect on protein coding. Supplementary Table 2 summarizes all 893 genetic alterations identified by sequencing. Variants classified as synonymous, intronic, non coding, polymorphic or localized in 3'UTR regions (608) were excluded while variants classified as missense, frameshift, stop-gain or affecting splicing sites (285) were included in further analysis. Seventy-six of 89 ECs (85.4%) presented at least one somatic mutation in one of the 26 genes analyzed while 13 ECs (14.6%) didn't present any somatic mutation in the considered genes. PTEN and PIK3CA resulted the most mutated genes: 56/89 (62.9%) patients presented at least one mutation in PTEN, 37/89 (41.6%) in PIK3CA. 20 patients had more than one somatic mutation in PTEN and 12 had more than one in PIK3CA. www.oncotarget.com Even 4 mutations for gene in the same patient were identified for PIK3CA and PTEN.

Unsupervised hierarchical clustering based on patients' genetic profile distinguished tumors with different grading
A hierarchical clustering analysis based on Euclidean distance between samples and Ward agglomerative procedure was applied to perform an unsupervised subdivision of the EC cohort taking into account only genetic characteristics. Variables considered were expressed as the number of mutations occurred in each gene (range 0 ÷ 4).
Two clusters derived from this analysis: Cluster 1 with 23 EC samples and Cluster 2 with 66 ( Figure 1A). Table 1 summarizes the clinical features frequencies within the two clusters. Intriguingly, a strong association between clusters and tumor grading (P value < 0.001) was observed.
In particular the molecular model efficiently identifies type 1 G1 tumors, positioning them all in cluster 2. Lax Kuman histological classification likewise resulted significantly associated with cluster subdivision; about 85% of the low grade tumors were grouped in cluster 2. By contrast, no differences in age, BMI, FIGO stage, lymph nodes positivity between the two clusters were observed.

Molecular based clustering distinguished two groups of patients with different trend of survival
Next, we sought to investigate whether molecular clustering could be effective to distinguish patients with different prognosis. To this purpose, a Cox proportional hazard model was applied to compare the overall survival and the disease free survival between the two clusters.
At first, we performed the analysis over all 89 EC patients. Table 2 summarizes the number of events of death and recurrence registered in total population and reported the hazard ratio between the two clusters. The obtained differences between the two groups were not statistically significant ( Figure 1B). However, due to the tumor type, where deaths and recurrence are quite rare, the changes in terms of recurrence between the two clusters (from 22% of cluster 1 to 14% of cluster 2) are indeed clinically relevant.
The same analysis was then performed on selected cases, composed of 82 type 1 ECs included in the genetic profiling ( Figure 1C). When restricted to an histologically homogenous cohort, the Cox proportion hazard model demonstrated that the molecular clustering, inferred on the basis of the genetic profile, correlates significantly with patient's disease specific survival (Logrank P value = 0.033). In particular, cluster 2 presented a 4 times lower risk of death because of the tumor (HR = 0.26) ( Table 2). By contrast disease free survival probability was not significantly different between the two clusters (Logrank P value = 0.108).
Overall, these data seem to indicate that the molecular based clustering, proposed in this model, is suitable to distinguish "poor prognosis" EC patients (cluster 1) from "good prognosis" EC patients (cluster 2).

Mutational status of a small group of genes influences tumor grading and patients prognostic classification
In order to investigate which genes were the most relevant in this model and for the clustering of EC patients, we generated an heatmap to represent the number of mutations occurred in each gene for any single case ( Figure 2).
Interestingly, heatmap representation shows that no mutations in APC gene were found in cluster 2 patients, while 9 mutations (corresponding to 5 out of 23 patients: 21.7%) were observed in cluster 1. By contrast, no mutations in KRAS were observed in cluster 1, while 14 patients in cluster 2 presented at least one KRAS mutation (14/66, 21.2%). Mutations in PIK3CA were observed in 23/23 (100%) patients in cluster 1 and 14/66 (21.2) in cluster 2, but all tumors presenting more than one variant for PIK3CA were localized in cluster 1. In cluster 1 19/23 (82.3%) patients presented both PIK3CA and PTEN mutation. In cluster 2 the coexistence of these mutated genes was observed only in 9/66 (13.6%) cases.
In order to statistically investigate these observations we analyzed the frequencies of mutations of each gene in the two clusters (Table 3). Statistical univariate analysis confirmed the significantly different distribution of APC, CTNNB1, KRAS, PIK3CA, PTEN, as observed in the heatmap. Furthermore, multivariate analysis confirmed a significant different distribution of CTNNB1 and PIK3CA mutations, suggesting a possible role of these gene mutational profiles as drivers of the cluster generation. In addition, total mutational load (calculated considering all 26 Trusight tumor genes) was found to be statistically different in the two clusters: while in "bad prognosis" cluster 1 a mean of 6 mutations for patient was observed, in cluster 2 the mean mutational load was only 2, suggesting as expected that the coexistence of a larger number of mutations could influence the development of a worse tumor phenotype. www.oncotarget.com Finally, we explored the correlation between genes mutations and EC tumor grade (Table 3). In a univariate analysis KRAS, PIK3CA and TP53 mutations presented a frequency distribution significantly different among the distinct tumor grades. In particular KRAS mutations were more recurrent in low grade type 1 EC (p = 0.043) while single or double mutations of PIK3CA and TP53 occurred with higher frequency in high grade EC (p = <0.001). The

A data mining tool based on few genes mutation analysis could support the prognosis of EC patients
All together, these data indicate that mutation analysis in a limited number of genes could generate a model to improve risk-based stratification of EC patients. In order to provide the clinician with an easy and useful tool for the EC patients prognostic classification, we used a data mining approach to define a consecutive sequence of rules and to generate a schematic representation of the model. Figure 3A shows a classification tree that was created to summarize principal rules that drove patients clusterization. PIK3CA, PTEN and CTNNB1 mutational status appears to be the main drivers in cluster generation. In particular, patients presenting more than one mutation in PIK3CA are predicted to have a bad prognosis (cluster 1) while patients with no mutation in PIK3CA can be automatically classified in good prognosis group (cluster 2). Instead, in patients with only one mutation in PIK3CA, the evaluation of PTEN and CTNNB1 will be necessary: coexistence of mutations in PIK3CA, PTEN and CTNNB1 can be considered a marker of bad prognosis (cluster 1) while women with one mutation on PIK3CA but no mutation in PTEN are predicted to have a better survival (cluster 2). A 10-fold cross validation was used to evaluate this data mining method. The decision tree proposed above had 90% classification accuracy, 76% Matthew Correlation Coefficient, 74% sensitivity and 97% specificity.
However, 14 out of 89 patients were not classified in accordance with the proposed model, suggesting the

DISCUSSION
To date histological characterization and staging are the gold standard for EC prognosis. Different tumor histological criterion such as Bockman typing, FIGO stage, grading and Lax Kurman binary classification [3-5, 11, 12] can be used to predict EC outcome. Nevertheless, in some morphologically intermediate and doubtful cases, anatomopathological classification and risk based stratification turns out to be insufficient and inefficient.
In this study we explored the mutational profile of a selected cohort of EC with the aim of developing a simple genetic-based tool to improve the accuracy of the current stratification methods for EC patients. We investigated the occurrence of mutations in a panel of 26 cancer related genes, in a population of 89 EC with different histological characteristics and different outcome. An unsupervised hierarchical clustering analysis demonstrated that the mutational profiles obtained from this analysis effectively separate endometrial tumors in two groups characterized by different tumor grades and different prognosis.
Statistical analysis were performed to define which of the genes investigated could be considered principal drivers of the prognostic clustering: APC, CTNNB1, PIK3CA, PTEN, SMAD4 and TP53 resulted as the most influencing mutated genes. Moreover, rules definition indicates that not only the presence or absence of somatic and damaging mutations on these genes, but also the number of variants occurred on the same gene in each sample can be determinant in predicting patient outcome. Finally a data mining strategy based on the generation of a decision tree was used to summarize a consequential list of classification rules applicable to perform EC risk stratification based only on molecular data ( Figure 3B).
The PI3K pathway activation regulates key aspects of cancer biology including metabolism, cellular growth, survival and resistance to apoptosis [13]. PTEN counteracts the activation of PI3K pathway by hydrolyzing and inactivating phosphatidylinositol 3,4,5-triphosphate (PIP3), the molecule responsible for the activation of the signalling cascade [14]. The PI3K/AKT/mTOR pathway is also involved in cross-talk with other signalling pathways, including the RAS/RAF/MEK [15] and estrogen receptor (ER) [16,17]. Data from the literature, indicate that constitutive activation of the PI3K/AKT pathway in EC occurs mostly through mutational inactivation of PTEN or by mutational activation of PIK3CA [18]. A high frequency of PIK3CA and PTEN mutation and often coexistence of mutations in both these genes have already been described as frequently occurring in EC [19]. Interestingly, in our analysis, PIK3CA and PTEN mutations were identified as the principal determinants of patients prognostic clustering further highlighting the fundamental role of this pathway in EC. Moreover we showed that two PIK3CA mutations or the coexistence of PIK3CA and PTEN mutations are needed to influence endometrial cancer prognosis. Our observations, in accordance with data presented by Oda et al. [19] that described the lack of influence of a single PIK3CA mutation on EC, indicated that in EC more than one mutational event in PI3K/AKT pathway genes is necessary to functionally influence this pathway and to Given the frequency of abnormalities in the PI3K/ AKT pathway, this signaling pathway represents one of the most promising targets for EC therapy. Thus, the identification of genetic mutations within key genes of this pathway could represent valuable markers for patient selection and therapy response monitoring.
The third gene involved in the proposed prediction model was CTNNB1. For its high mutation frequency, the   role of CTNNB1 mutations has been often investigated in association with EC. In particular, mutations occurred in CTNNB1 exon 3 as in our cases were associated with an accumulation of B-catenin in nucleus [20] and with a consequent activation of Wnt/β-catenin pathway that was already associated with worse survival in type 1 EC [21] .
The same observation on CTNNB1 were reported in a recent work that showed an association between CTNNB1 mutation in low grade EC patients and a higher risk of tumor recurrence [22]. The method described in this work will need to be corroborated in separates sets of ECs and with a  bigger cohort of patients to strengthen the prognostic differences obtained with our model. These data suggest that the approach described by this work could became a double function tool. First of all it represents an easy and relatively economic molecular profiling of EC that could be associated to histological classification to make patients prognosis in particular in doubtful and intermediate cases.
Secondly, the development of a small NGS panel based on the mutational analysis of the few genes emerged from this model could represent a rapid method to investigate those genes that are considered the most promising molecular target for EC therapy. Currently, in the most recent guidelines, some management recommendations for EC patients such as use of adjuvant therapy are still based on scant evidences [23]. The genetic-based model proposed in our study could provide a more appropriated and tailored treatment to patients diagnosed with EC. In particular, this clustering strategy could help to identify a genetic subgroup of patients that would benefit for adjuvant therapy and closer follow-up but that based on the current classification remain undertreated. Moreover, the application of this tool could help sparing low risk EC patients from aggressive therapy and intensive follow-up.

Patients selection
Clinical annotations of EC patients treated and followed at the Azienda USL -IRCCS di Reggio Emilia (Italy) from 2000 to 2016 were checked for cohort selection. Patients with histological diagnosis of type I and type II EC who received surgery were included in the study protocol. Exclusion criteria were: inadequate EC management according to internal and international guidelines [24,25], neoadjuvant chemotherapy performed before surgery, less than 18 years of age, non-Caucasian ancestry, inadequate follow-up according to internal guidelines, absence of written informed consent, diagnosis of a previous or concurrent cancer(s) and unavailable follow-up data. A follow-up was defined "adequate" in case of adherence to the following schedule: type I EC at stage IA and grading G1/G2 -physical and gynecological examination, and transvaginal ultrasound every 6 months for the first 2 years, then every 12 months for at least 3 years; type I EC at stage IB and/or any grading G3 and any type II tumor -physical and gynecological examination, and transvaginal ultrasound every 6 months for the first 5 years. Further investigations such as abdominal ultrasound, chest X-ray, computed tomography scan, and serum CA 125 levels were performed if clinically indicated. 105 patients responded to inclusion criteria and were considered for the study. Eighty-nine out of 105 patients had FFPE tumor tissue useful for genetic analysis.
Clinical, pathological and genetic data of every patients remained anonymous and were recorded in an electronic password-protected database. The study was approved by the Local Ethical Committee and all patients provided written informed consent to take part to the study.

Next generation sequencing
DNA was extracted from Formalin fixed paraffin embedded (FFPE) EC tissues using Maxwell nucleic acid extractor (Promega) and then quantified and quality evaluated using Kapa SYBR Fast qPCR kit.
MiSeq Reporter software was used to elaborate MiSeq row data and produce fastq and vcf files. Variant studio software (Illumina) was used to visualize list of mutations occurred in each sample, annotate them and apply selection filters. Mutation were considered reliable if presenting a minimum frequency of 5% and a minimum coverage of 500×.

Statistical analysis
All analysis performed in this study were elaborated using R software.
To generate the unsupervised hierarchical clustering it was taken into account the number of non-silent mutations occurred in each gene for each patient. Variables were expressed as ordinal values with a range from 0 (no mutation) to 4. Only data obtained from sequencing were used as attributes in the analysis, no clinical variables were included. Euclidean distance was used to compute distance measures between samples. Ward agglomerative hierarchical clustering procedure was applied. A two clusters subdivision was chosen to obtain numerically comparable groups.
Analysis of association between clusters, clinical characteristics and gene mutations were performed using Fisher test and generalized linear models. Survival analysis was conducted applying Cox proportional hazard model and Kaplan Meier curves were generated. Associations and differences were considered statistically significant if presented a P value lower than 0.05

Classification tree generation
Orange Canvas software [26] was used to generate the classification tree. Mutational status of genes were the only attributes considered in the analysis and "cluster 1" and "cluster 2" were the two decision class. All attributes were defined as continuous. Gain ratio was used as attribute selection criterion. For pre-pruning, a minimum of 5 instances for leave was fixed. For post-pruning the recursively merging of leaves with the same majority class was performed and m parameter was fixed to 1. www.oncotarget.com Classification accuracy, sensitivity and specificity of these method were calculated after a 10-fold cross-validation.

Author contributions
FT participated to study design, performed experiments and bioinformatics data analysis and wrote manuscript. DN participate to study design, performed experiments and revised the manuscript. RB supported study design and bioinformatics data analysis and revised the manuscript. AC supported study design and participate to manuscript writing. EF participate to study design and revised the manuscript. VM managed patients database. RZ supported experiments and revised the manuscript. GBLS supported study design and recruited patients. BC supported study design and revised manuscript. VDM participate to study design, recruited patients and wrote manuscript.