MiR-192, miR-200c and miR-17 are fibroblast-mediated inhibitors of colorectal cancer invasion

Colorectal cancer remains a leading cause of cancer-related death worldwide. A previous transcriptomics based study characterized molecular subgroups of which the stromal subgroup was associated with the worst clinical outcome. Micro-RNAs (miRNAs) are well-known regulators of gene expression and can follow a non-linear repression mechanism. We set up a model combining piecewise linear and linear regression and applied this combined regression model to a comprehensive colon adenocarcinoma dataset. We identified miRNAs involved in regulating characteristic gene sets, particularly extracellular matrix remodeling in the stromal subgroup. Comparison of expression data from separated (epithelial) cancer cells and stroma cells or fibroblasts associate these regulatory interactions with infiltrating stromal or tumor-associated fibroblasts. MiR-200c, miR-17 and miR-192 were identified as the most promising candidates regulating genes crucial for extracellular matrix remodeling. We validated our computational findings by in vitro assays. Enforced expression of either miR-200c, miR-17 or miR-192 in untransformed human colon fibroblasts down-regulated 85% of all predicted target genes. Expressing these miRNAs singly or in combination in human colon fibroblasts co-cultured with colon cancer cells considerably reduced cancer cell invasion validating these miRNAs as cancer cell infiltration suppressors in tumor associated fibroblasts.

These z parameters form a special ordered set of type 2 (SOS2) [1], where at most two adjacent members can be non-zero and the sum of all members equal one. For every quantile, an additional parameter (β) was introduced to define the term (eff): The z parameters are scalar and can be determined directly from the experimental data, whereas the β parameters are optimized by the solver. The term eff k represents the predicted gene expression of sample k. Monotony is enforced by introducing a binary auxiliary variable (x) defined as: with M = 1000 always fulfilling both inequalities in the corresponding dataset.
The optimization criterion of the model was to minimize the absolute difference between the measured gene expression (g k )and the predicted gene expression (eff k ): As this model is a minimization problem, the absolute value term can be solved by introducing an auxiliary variable (e k ) and transferring into two inequalities: The MILP solver Gurobi (http://www.gurobi.com) minimizes the sum of error terms across all samples by optimizing the β parameters within the constraints defined by equations 3 to 6. For each defined quantile of miRNA expression, a corresponding β coefficient is fitted against the measured gene expression for the target gene under investigation, yielding a step-wise approximation of target gene expression depending on miRNA expression.
Gene-and miRNA expression profiles were pre-processed and miRNA -target gene pairs were extracted from TarBase [2]. The piecewise linear model was established for every miRNA -target gene pair individually. The sample set was split into 5 equally sized partitions for 5-fold cross-validation, in which the parameters β and x were optimized in the model using samples from only 4 partitions (training partitions). The 5 β parameters were subsequently utilized to predict gene expression for the samples in the remaining fifth partition (validation partition). Predictive performance was estimated by calculating Pearson's correlation of measured and predicted gene expression across the validation partitions from all cross-validations. The binary parameter, x, indicates whether the model fit is strictly monotonically increasing or decreasing in the training partitions used. Only miRNA -target gene pairs with predictive performances of r ≥ 0.25 and a strictly monotonically decreasing model fit for the majority of partitions (at least 3 out of 5) were considered for further analysis.

The linear model
In the linear model, the predicted expression (eff k ) of a gene in sample (k) was defined as: in which the variable β and the offset variable β 0 were the optimization parameters and the variable exp k represented the miRNA expression in sample k. The optimization criterion for the linear model was analogous to the piecewise linear model as described in the previous section. Gene-and miRNA expression data as well as miRNA -target gene pairs were utilized as described for the piecewise linear model. We performed 5-fold crossvalidation by splitting the sample set in 5 equally sized partitions, ran the linear model using the samples of 4 partitions and evaluated the optimized parameters β 0 and β. Subsequently the parameters β 0 and β were utilized to predict the gene expression for the samples of the left-out fifth partition. The predictive performance was estimated by calculating Pearson's correlation of measured and predicted gene expression across validation samples of all cross-validations. Only miRNA -target gene pairs with predictive performances of r ≥ 0.25 were considered for further analysis.

Enrichment analysis in miRNA transfection experiments
We used normalized miRNA expression data as provided by the authors of the original studies. If miRNA transfection experiments were performed in replicates, we combined the expression values by using the median expression value per gene. If not directly provided in the dataset, we calculated the log ratio of gene expression of the test conditions (transfection with miRNA mimics or antagomiRs) and control conditions (mock or miRNA scrambled control) to observe differential gene expression. Enrichment tests were performed using the "geneSetTest" method from the limma package [3] depending on the experimental design: negative enrichment in case of miRNA over-expression (when using miRNA mimics) and positive enrichment in case of miRNA inhibition (when using antagomiRs). Due to the small number of experiments (n=41 for colon cancer, n=36 for prostate cancer), we refrained from multiple testing correction and treated p-values of the enrichment tests smaller or equal to 0.05 as significant. It is to note that a pre-miRNA measured in the transfection experiments under different experimental conditions could have been assigned to more than one stem-loop identifier in the TCGA datasets.
These stem-loop identifiers may show distinct expression profiles and share a highly overlapping, yet different set of target genes and therefore may have yielded different prediction results from the models. For instance, the pre-miRNA hsa-miR-7 matches 3 stemloop identifiers used in the TCGA dataset: hsa-mir-7-1, hsa-mir-7-2 and hsa-mir-7-3. Therefore, we evaluated individual combinations of pre-miRNA and experimental condition from the transfection experiments and modeling data of the corresponding stem-loop miRNAs yielding 60 combinations for the colon-and 42 combinations for the prostate adenocarcinoma dataset.

Differential expression and gene set enrichment analysis of the predicted target genes Identifying differentially expressed miRNA and genes
We tested all miRNAs and genes for differential expression in each molecular colorectal subgroup by a pairwise comparison to each of all other subgroups using a two-sided Student's t-test followed by multiple testing correction with the Benjamini-Hochberg method [4]. Any tested gene or miRNA with an adjusted p-value of at most 0.05 was considered to be differentially expressed. For each of the 4 subgroups, we selected genes and miRNAs which were differentially expressed between the subgroup under investigation and at least two other subgroups.

Gene set enrichment analysis
To elucidate the biological relevance of the predicted target genes in general and specifically in the molecular subgroups of colorectal cancer defined by Guinney et al. [5], we performed gene set enrichment analysis using TopGO [6]. TopGO functions as a wrapper package to facilitate enrichment analysis using gene set definitions from Gene Ontology (GO) [7]. We restrained our analysis to GO terms related to "biological process" and performed enrichment tests on the set of target genes predicted by the combined model for each miRNA individually. As global background we included all miRNA target genes provided by the TarBase database and chose "elim" as algorithm, "fisher" as the test statistics and tested only GO terms with a minimum number of 5 assigned genes neglecting very specific gene sets. Because multiple testing correction of p-values using standard techniques would come up with only very few significant gene sets and the tests performed by the "elim" method are not independent as described in the TopGO vignette (https://www.bioconductor.org/ packages/release/bioc/vignettes/topGO/inst/doc/topGO. pdf, section 6.2), we set the significance threshold to a rather low value of 0.005 to reduce the false positive rate but did not correct for multiple testing. Typically, such enrichment tests yield large, difficult to manage list of gene sets. To overcome this, redundant GO terms were filtered using linear optimization. Redundancy between two GO terms was quantified using Jaccard similarity coefficients. First, GO term-pairs with a high degree of overlap were included in the model. In the end, at most one GO term from a pair was chosen in such a way that the overall number of non-redundant GO terms was maximized. In addition, we assigned each term to one out of 18 broader categories according to their biological function: "angiogenesis", "apoptosis", "cell adhesion", "cell cycle", "cell differentiation", "cell migration", "developmental process", "DNA repair", "extracellular matrix process", "immune system process", "localization/ transport", "metabolic process", "nervous system process", "receptor signaling", "response to stimuli", "transcription" and "translation". Terms that did not fit in any of those categories were assigned to the category "miscellaneous". To identify subgroup specific miRNAs and enriched gene sets, we filtered the identified gene sets having at least 3 genes being either up-regulated in the respective subgroup and the corresponding miRNA being down-regulated in the same molecular subgroup or vice versa.

Functional grouping of genes related to extracellular matrix remodeling
According to their molecular function in the context of extracellular matrix (ECM) remodeling, candidate genes were grouped into one of the following categories: "ECM component", "ECM degradation", "ECM synthesis", "ECM -cell signaling" and "Others". Genes that code for proteins assembling essential parts of ECM structures like collagen, laminin or fibrillin were assigned to the category "ECM component". "ECM degradation" summarizes candidate genes like matrix metallopeptidases that are directly or indirectly involved in ECM decay. Target genes that are involved in the synthesis of ECM structures or maintain its integrity like lysyl oxidase were assigned to the category "ECM synthesis". Finally, all genes that are part of direct ECM -cell interactions like integrins or are involved in ECM -cell communication like TGF-β were assigned to the category "ECM -cell signaling". Genes that did not clearly fit in one of the above categories were combined in the category "Others". It is to note that even though genes could functionally belong to more than one group, we restricted their functional grouping to the most relevant category according to the literature.

Analysis of tumor-and stromal cell content
Based on the clinical data from the TCGA colon adenocarcinoma dataset provided by the TCGA consortium [8], we analyzed tumor-and stromal cell content among the defined molecular subgroups. In detail, we used the attributes "percent_stromal_cells_ BOTTOM", "percent_stromal_cells_TOP", "percent_ tumor_cells_BOTTOM" and "percent_tumor_cells_TOP", all of which were determined by histopathologists after individual image analysis. "BOTTOM" and "TOP" refer to the bottom and top section of the tissue which has been imaged. We compared samples of one molecular colorectal subgroup with the combination of all samples from all other molecular subgroups and performed a significance test using a two-sided Student's t-test.

Correlation analysis of miRNA expression and target gene protein expression
We tested the identified miRNA -target gene pairs for potential inverse correlation of miRNA expression and protein abundance. We employed a protein data set generated by the Clinical Proteomic Tumour Analysis Consortium (CPTAC) [9] comprising quantification of 5561 proteins in 95 colon and rectal samples from the TCGA cohort whereof 49 samples were present in the TCGA colon miRNA expression data set. Normalized spectral counts were used for protein. We computed Pearson's correlation of miRNA expression and protein spectral counts for all miRNA -target gene pairs where the data was available. Statistical significance for each miRNA -target gene pair was calculated using a rankbased permutation test (n = 1,000,000). Briefly, for both data matrices the sample labels as well as the feature labels were shuffled and the Pearson' correlation coefficient was calculated for each miRNA -target gene pair. This process was repeated 1,000,000 times.

Evaluating model predictions using miRNA transfection experiments TCGA colon adenocarcinoma dataset
To confirm predictions and evaluate both models, we assembled expression data from 60 miRNA or antagomiR transfection experiments from the Gene Expression Omnibus (GEO), and tested for significant enrichment of the predicted target genes of the respective miRNA among the genes that were down-regulated by miRNA transfection (details in Supplementary 1.3 and workflow in Supplementary Figure 1). The piecewise linear model predicted miRNA target genes, which had better enrichment in experimental datasets from 26 transfection experiments, whereas the linear model predicted miRNA target genes, for which enrichment was only better in 20 experimental datasets. Combining both the piecewise linear and linear models using the union of predicted target genes, improved enrichment in 43 experimental datasets above that achieved by either model alone. No miRNA targets predicted by either model or the combination of both were significantly enriched in datasets from 14 experiments. Supplementary Table 7A gives an overview of these results together with the accession numbers from the datasets in the GEO database, used cell lines, transfected miRNAs, conditions and the p-values for each model and the combination of both models.

TCGA prostate adenocarcinoma dataset
We performed enrichment tests on target genes that were predicted by the linear model, the piecewise linear model and the combination of both using the TCGA prostate adenocarcinoma dataset. We investigated publicly available expression datasets of 42

Investigating the biological relevance of target gene predictions
To benchmark the model predictions for biological relevance, we performed gene set enrichment analysis using the predicted target genes for each miRNA from either the linear or the combined model, both performed using the TCGA colon adenocarcinoma dataset. Target gene predictions from the combined model identified 212 significantly enriched gene sets beyond those identified using the linear model alone. Among these additionally identified gene sets were "transforming growth factor beta receptor signaling pathway", "positive regulation of epithelial cell migration" and "endothelial cell migration" as potentially regulated by miR-29b-2 which has been shown in breast cancer [10], prostate cancer [11] and multiple myeloma [12].

Functional relevance of experimentally validated target genes
We briefly describe the results of our literature analysis concerning the functional relevance (in the context of extracellular matrix remodeling, tumor development and progression, metastasis and clinical outcome) of the experimentally validated target genes in colorectal cancer and other tumor entities.

Validated target genes of miR-192
FBN1 is an extracellular glycoprotein secreted by fibroblasts that forms long, elastic microfibrils as important components of the extracellular matrix. It has been shown to promote ovarian tumorigenesis and metastasis in mouse models [13]. When highly expressed, FBN1 indicates poor overall survival in ovarian cancer [13]. Inhibition of E-cadherin and β-catenin and stimulation of matrix metallopeptidase 2, 9 and 13 are further characteristics of FBN1 as described by Wang et al. [13]. PLOD1 is another important collagen crosslinking enzyme necessary for the biogenesis and stability of collagens [14]. Gilkes and colleagues reported an increased expression of PLOD1 and another gene family member, PLOD2, in breast cancer tissue compared to normal breast tissue [14]. The cell surface peroxidase PXDN is, when secreted into the extracellular space, involved in ECM formation by crosslinking collagen IV. It has been shown that silencing of PXDN leads to significant reduction of cancer cell invasion in melanoma [15]. The growth factor FGF2 is part of many cellular processes and particularly known as an inducer of angiogenesis. Knuchel and co-workers identified fibroblast-bound FGF2 as part of the FGFR-SRC-ITGAV/ITGB5 cascade that induces tumor cell adhesion to fibroblasts and tumor cell motility which was shown in 2D and 3D models of colorectal cancer cells [16]. Interestingly, Musumeci and co-workers showed an direct relation between the down-regulation of miR-15 and miR-16 and the up-regulation of their direct target genes FGF2 and FGFR1 in tumor-associated stroma cells of prostate cancer [17]. The linker protein DST is a component of hemidesmosomes [18] and connects intermediate filaments to the actin cytoskeleton and helps to form actin bundles around the nucleus.

Validated target genes of miR-200c
In leiomyoma, a benign smooth muscle cell tumor occurring in the uterus, TIMP2 and FBLN5 were identified as direct targets of miR-200c [19]. The extracellular glycoprotein FBLN5 is involved in the formation of elastic fibers and antagonizes fibronectin-mediated signaling by binding to the same integrin receptors (reviewed in [20]). FBLN5 was found to be induced by TGF-β signaling in fibroblasts and endothelial cells [21] and plays a role in initiating and enhancing epithelial-to-mesenchymal transition in normal and malignant mammary epithelial cells [22]. In the same study, FBLN5 was shown to promote the expression and activity of MMP2 and MMP9 and the growth of human 4T1 breast cancer cells in mice [22]. Contrary to its tumor promoting role, it was observed that FBLN5 can suppress the formation of metastasis in lung and liver [23]. The glycoprotein FN1 was found to be significantly down-regulated after re-expression of miR-200c in human Hec50 endometrial cancer cells [24]. By binding to the α5β1 integrin transmembrane receptor dimer, FN1 mediates cell-matrix adhesion and is directly involved in cell migration and invasion [25]. Overexpression of FN1 was observed in myofibroblasts and in epithelial cells in samples of colorectal cancer patients [26]. Elevated epithelial expression of FN1 can be an indicator of lymph node metastasis in primary colorectal cancers [27]. SERPINH1 (also referred as HSP47) is involved in the synthesis and deposition of collagen as a chaperone present in the endoplasmatic reticulum. SERPINH1 is linked to the progression of breast cancer [28], cell migration and invasion of cancer cells in cervical squamous cell carcinoma [29], and tumor growth and invasion in glioma [30]. The transcription factor ETS1 plays a role in numerous tumor types and is linked to tumor progression, invasion, metastasis and angiogenesis. It is active in cancer cells, cancer-associated fibroblasts and endothelial cells (reviewed in [31]). In particular in colorectal cancer, stromal protein levels of ETS1 were significantly associated with the formation of lung metastasis [32]. It was reported to be a direct target of miR-200c in mouse embryonic stem cells [33]. Of note, ETS1 is known to repress the transcription of the miR-192 host gene [34] and can induce the expression of KDR, another potential target gene of miR-200c.

Validated target genes of miR-17
The cytokine transforming growth factor beta 1 (TGFB1) is a well-known ligand and driver of TGF-β signaling which is central for tumor development and progression, angiogenesis, epithelial-to-mesenchymal transition and metastasis (reviewed in [35][36][37]). Tumor cell-derived TGF-β is actively promoting the transdifferentiation of fibroblasts to a myofibroblastor tumor-associated fibroblast phenotype [38]. In the course of colorectal cancer progression, the interaction of tumor cells and tumor-associated fibroblasts leads to a hyperactivation of TGF-β signaling in tumor-associated fibroblasts and an elevated secretion of TGF-β into the extracellular space whereas TGF-β signaling is lost in most cancer cells [39]. Besides other known target genes, TGF-β is known to increase its own expression [39] and the expression of MMP2 [40]. LAMC1 is a member of a large family of extracellular glycoproteins which form large polymers with other laminin isoforms and form a major constituent of the basement membrane besides collagen. LAMC1 is essential during embryonic development [41]. In the context of tumor development, it is known that LAMC1 promotes cancer cell migration and invasion in prostate cancer [11] and enhances tumor cell motility and invasion in uterine and ovarian carcinomas [42]. Furthermore, it was shown that LAMC1 is a direct target of miR-22 and miR-29a/b/c in prostate cancer [11,43] and of miR-205 in receptor triple-negative breast cancer [44].