In-depth characterization of breast cancer tumor-promoting cell transcriptome by RNA sequencing and microarrays.

Numerous studies have reported the existence of tumor-promoting cells (TPC) with self-renewal potential and a relevant role in drug resistance. However, pathways and modifications involved in the maintenance of such tumor subpopulations are still only partially understood. Sequencing-based approaches offer the opportunity for a detailed study of TPC including their transcriptome modulation. Using microarrays and RNA sequencing approaches, we compared the transcriptional profiles of parental MCF7 breast cancer cells with MCF7-derived TPC (i.e. MCFS). Data were explored using different bioinformatic approaches, and major findings were experimentally validated. The different analytical pipelines (Lifescope and Cufflinks based) yielded similar although not identical results. RNA sequencing data partially overlapped microarray results and displayed a higher dynamic range, although overall the two approaches concordantly predicted pathway modifications. Several biological functions were altered in TPC, ranging from production of inflammatory cytokines (i.e., IL-8 and MCP-1) to proliferation and response to steroid hormones. More than 300 non-coding RNAs were defined as differentially expressed, and 2,471 potential splicing events were identified. A consensus signature of genes up-regulated in TPC was derived and was found to be significantly associated with insensitivity to fulvestrant in a public breast cancer patient dataset. Overall, we obtained a detailed portrait of the transcriptome of a breast cancer TPC line, highlighted the role of non-coding RNAs and differential splicing, and identified a gene signature with a potential as a context-specific biomarker in patients receiving endocrine treatment.


INTRODUCTION
There is substantial evidence to support the presence of a subpopulation of tumor-promoting cells (TPC) in both hematologic and solid tumors with self-renewal and asymmetric division capabilities. The proposed model looks at TPC as responsible for treatment failure due to their resistance to anticancer drugs and due to the inability of the presently employed drugs to specifically target the TPC subpopulation [1][2][3][4].
Identification and enumeration of such cells is difficult, their phenotypes are poorly defined and no specific biomarker allows a clear distinction of TPC, although several markers have been proposed [5]. The assay largely recognized as the gold standard to define a population of cells with tumor-initiating ability consists of xenotransplantation of serially diluted number of cells in immunocompromised mice (i.e., NOD/SCID or NOD-scid IL2Rg null mice) [6], and an optimal tool for isolating breast TPC from clinical tumors is an in vitro functional approach (i.e., sphere formation) [7]. www.impactjournals.com/oncotarget In breast and other tumor types, much effort has been made to identify the pathways involved in maintenance of the TPC phenotype and to tackle possible TPC-specific targets with therapeutic potential. Among others, Notch [8,9] and Hedgehog pathways [10] have been suggested as central pathways for TPC maintenance. More recently, a role for NF-κB NF-kappaB-related genes [11,12] and for inflammatory cytokines [13,14] has been proposed, also linking stemness with epithelialmesenchymal transition [15,16].
Accumulating evidence in other malignancies suggests that also poorly characterized non-coding RNAs (ncRNAs) could have a role in cancer [17] and in the maintenance of a stem-like phenotype [18]. In addition, the isoform composition of the coding transcript population has been demonstrated to be important in stem cell biology [19,20] and cancer [21]. Massive RNA sequencing (RNA-seq) allows an in-depth transcriptome analysis, which includes the annotation and evaluation of differential expression for both the coding and noncoding transcripts and the identification and quantitative evaluation of alternative splicing events. This type of analysis proved to extend biological knowledge and to identify additional biomarkers [22].
We previously reported the isolation and in vitro propagation of highly tumorigenic mammospheres isolated from the MCF7 breast cancer cell line (commonly defined as MCFS) [23]. In the present study, we obtained gene expression profiles of MCFS and parental MCF7 cell lines using Illumina microarrays and SOLiD RNAseq. Different analytical approaches for RNA-seq were used and the results compared. Differentially expressed coding and non-coding RNAs, deregulated pathways and alternative splicing events were identified by specific bioinformatic approaches and validated in vitro. Finally, the significance of the TPC gene signature derived from our model was confirmed in a cohort of endocrine therapytreated breast cancer patients.

Comparison of RNA-seq and microarray signals
Transcriptome analysis of tumor-promoting mammospheres (MCFS) and of the parental breast cancer cell line MCF7 was run in triplicate on microarrays and as a single experiment using RNA-seq after linear isothermal DNA amplification. In microarray data, 9,283 probes, corresponding to the same number of genes, were retained after normalization and filtering.
Since using only one method for the analysis of RNA-seq datasets can result in a suboptimal analysis, especially when working with a cancer transcriptome, we decided to apply two different methods for the absolute quantification of gene expression after the genome mapping step, i.e., Lifetech Lifescope 2.5.1 pipeline and the TopHat/Cufflinks method (version 1.0.2), using as a common reference gene annotation the UCSC RefSeq dataset. Supplementary Table S1 summarizes the number of genes with non-zero quantification for expression values processed according to the two pipelines. Ninetyfour percent of the genes detected by Cufflinks was also identified as expressed by Lifescope, whereas the latter globally identified a higher number of genes. For each library, gene expression levels, measured as RPKM or FPKM (Reads/Fragments Per Kilobase of exon per Million fragments mapped) for each of the two experiments, were correlated, displaying good correlation coefficients (R = 0.97/0.96) ( Figure 1A).
We next correlated fold changes (FC) between MCFS and MCF7 cells obtained with the two analytical pipelines. As for RPKM and FPKM values, a good correlation was found, but the correlation coefficient (R = 0.89) was affected by large expression differences present between the two cell lines (up to 1000-fold changes). Usually, fold changes higher than 2 are already considered to be biologically relevant, but, as can be seen in Figure 1B, such an extent of modulations seems to be less reproducible when using two different processing procedures, even though starting from the same raw sequence data.
Fold changes obtained from RNA-seq were then correlated with the results elaborated from microarray data for common genes. As expected, lower correlations were obtained with RNA-seq data, independently of the pipeline used, clearly showing a higher dynamic range, in fold change terms, than microarrays ( Figure 1C). A similar platform comparison, run on CD44 + /CD24cancer stem isolated from primary ER-positive breast cancer cells [24] reported a good match between next generation transcriptome sequencing and microarrays, but no details were given on the analytical pipeline used for RNA-seq data.

Identification of enriched gene sets and functional validation of data
Microarray as well as RNA-seq expression data were subjected to a Gene Set Enrichment Analysis (GSEA) in order to provide a robust way to compare elaborated gene expression data sets obtained with different platforms and to highlight biologically meaningful pathways modulated in MCFS compared to MCF7 cells. For RNA-seq data, both data analysis procedures were considered in order to investigate their reliability by comparison with array data. For all data, genes were ordered according to fold changes and array data were ordered according to GSEA statistics. Data analyzed with Cufflinks, besides fold changes, were ordered according to the ranking suggested by the Cuffdiff statistics. Enrichment for functionally related genes was tested across a collection of 4,850 curated gene sets (C2 collection), and a summary of obtained results is reported in Supplementary Table S2. Similar results were obtained for array data ranked using either fold change or GSEA statistics. Gene quantification using Cufflinks and ranking using Cuffdiff statistics slightly outperformed the other methods in terms of number of significantly enriched gene sets and concordance with array data. Consequently, we focused on GSEA statistics ranked array data and Cuffdiff ranked RNA-seq data to elaborate a biological interpretation of the coding genes modulated in MCFS TPC.
To obtain a meaningful interpretation of the findings, our approach was to distribute the gene sets in different categories, linking them to a possible biological function. Such biofunctions, which were differentially modulated in MCFS compared with MCF7 cells are listed in Table 1, together with information of the single gene sets supporting them. The main biofunctions are reported below, together with results of validation experiments.

Proliferation
As expected, based on the growth kinetics of the two cell types, genes associated to cell cycle progression, DNA replication and proliferation were expressed at lower levels in MCFS cells. Whereas the low proliferative potential of TPC is well known, no enrichments in proliferation related -genes has been reported in other studies comparing gene expression profiles of CD44 + /CD24breast cancer stem cells with the remaining bulk tumor cells [24] suggesting that the experimental conditions chosen for MCFS enrichment may play a role.

Endocrine therapy sensitivity
Gene sets up-and down-regulated after treatment of MCF7 cells with 17ß-estradiol [25,26] were found significantly enriched in the MCF7 and MCFS phenotype, respectively (Table 1). It is noteworthy that despite a distinct regulation of genes associated with estrogenic stimulation in the parental MCF7 cell line and in the derived TPC cells ( Figure 2C), the estrogen receptor (ER) itself was expressed at comparable levels ( Figure 2A). Such a gene expression pattern suggested the acquirement of an estrogen-insensitive phenotype in the MCFS, a hypothesis that was experimentally verified. Estrogen insensitivity has already been reported in the literature for CD44 + /CD24cells purified from human tumors, which were described as ER-negative also when deriving from ER-positive tumors [27]. Treatment with 10 −8 M 17ß-estradiol for 6 days caused almost a doubling in the MCF7 cell growth rate compared to untreated cells (P = 0.033), whereas as expected based on gene expression data, estradiol had no significant effect on MCFS cell growth ( Figure 2B). Consistent with the loss of estrogen sensitivity in the MCFS cells, also treatment with the pure antiestrogen fulvestrant displayed a higher cytostatic effect in MCF7 cells than in MCFS (80% vs 30% growth inhibition, respectively). Such results suggest an insensitivity of MCFS cells to estrogenic stimulations and a limited response to treatment with antiestrogen, in agreement with impairment on estrogenic response in MCFS cells.
In order to provide a further confirmation of the impairment in ER-mediated response to estrogens in MCFS cells, we evaluated the expression levels of typically ER-related genes after exposure of the cells to MCF7 < 1e-4 < 1e-4 Endocrine Therapy Sensitivity

Epigenetic Control
Gene sets involved in altered methylation, acetylation status as a possible epigenetic mechanism of selection during tumorigenesis

Cholesterol Metabolism
Gene sets involved in cholesterol metabolism pathways

Undifferentiation
Gene sets associated with cells undifferentiation status estradiol. In agreement with the proliferative behavior of these cells in response to estrogens, also induction of the estrogen-regulated genes GREB1, PGR, CSD and TFF1 was stronger in MCF7 cells than in MCFS, with a more than two-fold difference depending on the considered gene ( Figure 2C).
In accord with literature data demonstrating that TPCs are intrinsically resistant to conventional chemotherapeutic agents and to radiotherapy [4,28,29], we provided evidence that such cells are also less sensitive to competitive ER antagonists, such as selective estrogen receptor down regulators, suggesting that the outgrowth of a subpopulation of cells with tumor-promoting properties might be responsible for hormone therapy resistance in breast cancer. The presence of ERα, the main ligandmediated transcriptional factor responsible for estrogenic effects in breast cancer, still guides the choice of endocrine treatments, although it is known to represent a better predictor for endocrine insensitivity (if negative) than an optimal sensitivity biomarker [30]. In fact, acquired, but also de novo resistance to endocrine therapy is often observed in tumors defined as ER+. Here we show new data suggesting that treatment with fulvestrant might fail due to the presence of cells with tumor-promoting characteristics such as our MCFS cells.
It is also worth to mention that paradoxically the TPC population (defined as enriched for expression of ESA + CD44 + CD24 low cells), increases in response to estradiol treatment due to paracrine stimulation by non-TPC ER-positive cells mainly through EGFR [31].

Immune response
Genes associated with immune response were expressed at higher levels by MCFS cells. In keeping with previous reports even in different TPC models [11,12], our result suggests a central role for NF-κB signaling in MCFS cells, as many pathways and genes regulated by this transcription factor were found up-regulated. Of particular note, CXCL8, whose expression is regulated by NF-κB and which is involved in self-renewal of mammospheres [32], showed higher expression in MCFS than in MCF7 cells. Therefore, using an ELISA assay we explored at the protein level release of the interleukin in the culture medium. We also validated in vitro, by the ELISA assay, the production of other cytokines (TNF and MCP-1), confirming the gene expression data ( Figure 2D).

Other biological functions
Differentially enriched gene sets also suggested a role of epigenetic mechanisms, cholesterol metabolism and growth factor (mainly epidermal growth factor) response in the maintenance of "stemness". Finally, several gene sets supported the undifferentiated state of MCFS cells (Table 1).

Identification and validation of differentially expressed ncRNAs
To identify differentially expressed IncRNAs, we compared transcripts derived from Cufflinks analysis with ncRNAs represented in the RefSeq

Biofunction
Description n° Gene sets associated Gene Sets Enriched in: FDR Array

Growth factor response
Gene sets involved in respose to different growth factors database (June 2013 release) and with those annotated in the ENCODE/GENCODE v.7 lncRNA catalog (June 2013) dataset. This way a total of 331 and 398 noncoding transcripts were respectively identified and were manually annotated in subclasses (Table 2 and  Supplementary Table S3).
Five ncRNAs were selected for experimental validation: four of them were down-regulated in MCFS compared to MCF7 cells (SNHG3, PVT1, RMST, and LINC00673), whereas the remaining one (LINC01278) was up-regulated. In addition, we chose to validate the differential expression of the long ncRNA MALAT1, Figure 2: MCFS cell are less sensitive to E 2 and fulvestrant stimulation and secrete higher quantities of IL-8 and MCP-1 compared to than MCF7 cells. A. Western blotting analysis for ERα expression. ERα protein levels were normalized to a loading control. Numbers reported below gel images were obtained by densitometric analysis and represent the relative expression level of ERα normalized to β-actin expression level. B. Effect of 17β-estradiol (top) and fulvestrant (bottom) on cell growth. Cells were exposed to 10 −8 M 17β-estradiol or fulvestrant for 6 days, and their growth was evaluated by direct cell counting. Cell growth of treated cells was expressed as percentage of that of untreated cells. Bar charts represent a mean ± CV% of 3 experimental replicates (*P < 0.05 by twotailed Student's t test). C. Relative expression of ER-related genes in response to 17β-estradiol exposure. The relative expression levels of ER-related genes were measured after 48 h of exposure to 10 −8 M 17β-estradiol by quantitative real-time PCR analysis and normalized to RPL13A expression level. Relative fold changes calculated by the ΔΔCt method refer to control cells. Bars represent mean ± SD relative fold changes, derived from 3 technical replicates. D. Quantification of secreted IL-8 (top) and MCP-1 (bottom) in conditioned media. The absolute quantity of cytokines secreted in culture media was measured by ELISA kit assays and normalized to the number of viable cells. Bars represent mean ± SD of cytokine levels (picograms per 10 5 cells) derived from 3 technical replicates. www.impactjournals.com/oncotarget previously reported to be highly expressed also in breast cancer tissue [33], as it showed a two-fold increased expression in MCFS compared to MCF7 cells by RNA-seq, but without reaching statistical significance ( Figure 3A). Real-time RT-PCR assays substantially confirmed the RNAseq results in all cases except SNHG3 ( Figure 3B), which did not show a statistically significant differential expression between the two cell lines. This might be due, at least in part, to the design of the corresponding RT-PCR assay, which specifically amplified only one transcriptional isoform of SNHG3 (i.e., NR_036473). On the contrary, the differential expression of MALAT1 was more marked in the real-time assay (about a 2.7-fold increase in MCFS compared to MCF7 cells) than in RNA-seq data.

Identification and validation of splicing events
Alternatively spliced genes were identified using a reciprocal junction analysis as implemented in the AltAnalyze software [34], starting from the exon level expression quantification and exon junction files generated by the Lifescope pipeline. A total of 2,471 reciprocal junctions showed a significant splicing score (ASPIRE > 0.2 or < 0.2). A filtering step was subsequently applied in order to identify the most robust and biologically meaningful splicing events (Supplementary Figure S1). A complete list of such filtered events is reported in Supplementary Table S4.
Among the genes that were consistently detected as displaying significant differential alternative splicing events, we selected 3 for experimental validation: the myoferlin (MYOF) exon 18 alternative splicing, the inclusion/skipping of SRFS10 exon 3, and a fusion between the VMP1 and RPS6KB1 genes. In particular, AltAnalyze predictions evidenced a significant decrease of both MYOF exon 18 and SRSF10 exon 3 inclusion events in MCFS compared to MCF7 cells, as well as an increase in fusion events between multiple RPS6KB1 exons and VMP1 exon 12 (Supplementary Table S4).
To verify the predictions, a specific measurement of the relative abundance of transcripts including or excluding the MYOF/SRSF10 candidate exon was obtained by fluorescent competitive RT-PCR (see Supplementary Methods for details).
We confirmed that inclusion of the in-frame 39-base pair (bp) exon 18 was reduced in MCFS compared to MCF7 cells (from 42% to 21%), thus generating an imbalance in the levels of the two alternatively spliced isoforms (NM_013451 -myoferlin isoform a, including exon 18; NM_133337 -myoferlin isoform b, excluding exon 18) ( Figure 4A). Structurally, myoferlin contains 6 tandem C2 domains, designated as (C2A-C2F), a central DysF domain, and a single C-terminal transmembrane region. Each C2 domain folds into an 8-stranded beta-sandwich and usually contains a calcium-binding region. The two myoferlin isoforms, a and b, differ for 13 amino acids located within  [35]. We also validated a significant reduction of SRSF10 expression in MCFS and a concomitant imbalance in the inclusion of the in-frame 363-bp exon 3, which decreased from 41% to 16% ( Figure 4B). Splicing factor SRSF10 is an atypical member of the serine/arginine-rich family of proteins that can function as a sequence-dependent splicing activator [36]. These regulatory proteins can modulate both exon activation and repression in vivo, which is likely dependent on  their binding location within the pre-mRNA. Using this strategy, the RNA-binding protein regulates splicing in the cell [37,38].
In parallel, to better characterize the predicted RPS6KB1-VMP1 fusion events, we performed RT-PCR using a reverse primer in VMP1 exon 12 and a forward primer located either in RPS6KB1 exon 1 or in exon 4. Using the forward primer in RPS6KB1 exon 4, we obtained a unique amplification product of about 150 bp, whereas using the forward primer in exon 1 we amplified two different products of about 430 and 340 bp, respectively. The direct sequencing of individual RT-PCR products allowed us to identify two different fusion transcripts, one including RPS6KB1 exons 1 to 4 and VMP1 exon 12 (fusion A), and the other connecting RPS6KB1 exons 1 and 2 to VMP1 exon 11 (fusion B)( Figure 5A-5B). The molecular model structure predictions derived from the two open reading frames reconstructed from the fusions, performed with the I-TASSER software [39], suggest that the first fusion could maintain an activity of interaction with ATP and an A-kinase activity, whereas the second fusion could result in a more homogenous multiple alphahelix structure which maintains only a generic protein amino acid binding function, as identified from GO term association ( Figure 5C-5D).
RPS6KB1 is a serine/threonine-protein kinase, an important regulator of cell size control, protein translation, and cell proliferation [40] and it is one of the best characterized downstream targets of mTOR.
Vacuole membrane protein 1 (VMP1) is a plasma membrane protein and an essential component of initial cell-cell contacts and tight junction formation. It has been described as a cancer-relevant cell cycle modulator, but the function of the protein and its mode of action in tumor progression are still unknown. Its high expression is correlated with noninvasive breast cancer cell lines [41].
Our RT-PCR data suggest that the fusion transcript (RPS6KB1-VMP1) is expressed more in MCF7 cells than in MCFS, as suggested by the AltAnalyze analysis.
Interestingly, Inaki et al. [42] found the expression of transcript fusion between RPS6KB1 and VMP1 in 70 breast primary tumors from Singaporean patients and they found that such fusion was expressed in 30% of breast cancers [42]. They found additional fusions points, including the same fusions points we found between RPS6KB1 exon 2 and VMP1 exon 11 and RPS6KB1 exon 4 and VMP1 exon 12. Such gene fusion is caused by tandem duplication of 17q23. In fact, in breast cancer this chromosomal region is frequently amplified.

Genes overexpressed in MCFS are associated with resistance to endocrine therapy in patients
One of the hypotheses suggested by the interpretation of differentially modulated coding genes was a reduced sensitivity to endocrine treatment of MCFS. To evaluate whether the gene expression modulation observed in this model could also be traced in clinical samples of breast cancer, a consensus list of 77 genes (Supplementary  Table S5) overexpressed in MCFS in both array and RNAseq data was derived. The 77-gene list was evaluated in a public dataset (GSE48905) of gene expression profiles of breast cancers from patients enrolled in the NEWEST (Neoadjuvant Endocrine Therapy for Women with Estrogen-Sensitive Tumours) trial comparing the clinical and biological activity of fulvestrant, 500 mg vs 250 mg, in the neoadjuvant setting [43]. Interestingly, our signature of genes overexpressed in MCFS was enriched in resistant tumors (stable or progressive disease) compared to responders ( Figure 6). Such data, combined with our in vitro findings, are clinically relevant since fulvestrant has been approved as a second-line therapy for patients experiencing recurrence after a tamoxifen regimen, as it lacks cross-resistance with other antiestrogens, has no agonistic activity and accelerates degradation of ERα. Our results question the utility of fulvestrant as secondline endocrine therapy and suggest that treatment with this pure antiestrogen might fail due to the outgrowth of TPC cell subpopulations.

Distinguishing between cell-line dependent and culture-condition dependent differences
Finally we asked whether the particular condition for MCFS isolation could have influenced the gene expression pattern. In fact, most studies trying to tackle transcriptome variations in TPCs compared to parental cells have employed enrichments based on the expression of selected markers [24,27,31], whereas we chose another largely used approach that is to enrich for TPC by modifying the growth conditions. As a consequence MCFS cells might be regarded as 3D cultures since they form mammospheres whereas MCF7 cells as 2D cultures growing in adhesion conditions. To evaluate the extent of this possible bias, we took advantage from Kenny and colleagues study correlating morphologies and gene expression profiles of 24 breast cancer cell lines grown in 2D versus 3D conditions [44]. We extrapolated from the study 3D versus 2D gene expression profiles for MCF7 cells only and 545 unique genes which showed a fold change larger than 2 (271 up-regulated and 274 down-regulated, Supplementary Table S7). Of our 77-gene signature (all up-regulated genes), 15 genes were up-regulated also in Kenny's study (19% overlap) and 3 were downregulated, suggesting that 3D culture might have affected gene expression, but that it was not the main driver for the differences identified in our study. Moreover, none of the key genes investigated in this study was differentially expressed in Kenny's study. When considering all the 25 cell lines included in their study, Kenny and colleagues identified 41 genes differentially expressed in 2D versus www.impactjournals.com/oncotarget 3D cells, and a statistically significant overexpression of 1 out of 8 identified Gene Ontology classes, i.e 'signal transducer activity'. These results suggest that in 3D cultures the main differences relate to regulation of signal transduction, whereas we identified many more biological functions enriched in MCFS compared to MCF7 cells. In our analysis 'Signal transduction' is not even among the main enriched biological functions and only appears to be limited to 'growth factor response' (Table 1). This is again an indirect evidence for the fact that the distinct MCFS gene expression pattern is cell line-rather than culture condition-dependent. Unfortunately, no external data are available for a similar evaluation of lncRNAs and alternative splicing findings, although a similar pattern is expected.
We also would like to emphasize that the lack of growth of MCFS cells upon stimulation with 10 -8 M E 2 is not a simple result of culture conditions since in the presence of a serum -free medium supplement (2% XerumFree TM XF205, TNC Bio BV, Eindhoven, NL), estradiol was still stimulatory for MCF7 (Supplementary Figure S2). This is a further direct evidence for the fact that it is not the culture condition that dictates the biological function, but that our TPC enrichments method does in fact select a different cell subpopulation.
In an interesting paper on primary glioblastoma [45], single-cell RNA-seq was used to collect clues on intratumoral heterogeneity in clinical tumors. In parallel, subpopulations of stem like cells (GSCs) were modeled in vitro as spherogenic cultures that initiate tumors in mice and their transcriptomes were compared to more differentiated cells expanded as adherent monolayers in serum (i.e. exactly our approach). The derived stemness consensus signature when applied to the single cell transcriptional profiles, revealed a gradient of stemness among the single cells, suggesting that in vitro models do in fact give important clues, although limiting to the phenotypic extremes of cellular states. We have not yet proven such a concept in breast, but results on bulk tumors go in this direction.

Cell cultures
The MCF7 breast cancer cell line was purchased from the American Type Culture Collection (ATCC; Manassas, VA, USA) and cultured in DMEM/F-12 (Lonza, Slough, UK) medium supplemented with 10% fetal bovine serum (Lonza). MCFS were derived from the MCF7 cell line [21] and propagated as floating mammospheres in mammary epithelial cell growth medium, an appropriate growth medium composed of MEBM (Lonza) supplemented with B27 without vitamin A (Life Technologies, Foster City, CA, USA), heparin 0.6 U/ml (Eparina Vister 5,000 U/ml), human recombinant epidermal growth factor, 20 ng/ml (Peprotech, NJ, USA) and human recombinant basic fibroblast growth factor, 10 ng/ml (Peprotech). Both cell lines were cultured at 37°C in humidified 5% CO 2 atmosphere. Cell vitality was assessed by the trypan blue exclusion assay (at least 95%) before starting any experiment. Authentication of cell lines by short tandem repeat DNA profiling analysis was performed at the Niguarda Ca' Granda Hospital in Milan.

Total RNA extraction
Cell pellets (~10 6 cells) were immediately put on ice and homogenized in 1 ml of TRIzol ® Reagent (Life Technologies); lysates were stored at −80°C for not more than 2 weeks. Total RNA was extracted following the manufacturer's instructions. Contaminating DNA was digested using recombinant DNase I (Life Technologies), and RNA samples were purified using the RNeasy MinElute Cleanup Kit (Qiagen, Germantown, Maryland, USA). Total RNA was quantified by a NanoDrop spectrophotometer and assessed for quality by the Agilent RNA 6000 Nano kit (Agilent Technologies, Santa Clara, CA, USA) using the Agilent 2100 Bioanalyzer (Agilent Technologies).
For RNA-seq experiments, total RNA from 5000 MCF7 cells and MCFS were extracted following Agencourt ® RNAdvance ® Cell v2 (Beckman Coulter, Danvers, MA, USA) protocol using Agencourt's patented SPRI ® paramagnetic bead available in the kit. After Proteinase K cell lysis and protein digestions, total RNA was separated from contaminants exploiting its binding to the magnetic particles. Manufacturer instructions were followed with the exception of the DNase treatment step, which was carried on with the Ambion ® TURBO DNA-free™ (Life Technologies). Fifty μl of a solution containing 1 μl of TURBO DNase (2 Units/μl), 5 μl of 10X TURBO DNase Buffer, and 44 μl of nuclease-free water was prepared, added directly to the bead-linked RNA and incubated at 37°C for 30 min. The addition of aqueous DNase released DNA and RNA from the beads; while DNA was digested, RNA was re-bound to the beads by adding 250 μl of Agencourt Wash Buffer. Thereafter, as in the original Agencourt protocol, RNA was eluted in 40 μl of nuclease-free water and quantified using Qubit™ RNA Assay Kits (Life Technologies); its quality was checked using the Agilent ® RNA 6000 Pico kit (Agilent Technologies, Santa Clara, CA, USA).

Microarray hybridization and analysis
Array experiments with the Illumina platform were run by the Functional Genomics Core Facility at the Fondazione IRCCS Istituto Nazionale Tumori, as previously described [46]. Briefly, RNA derived from frozen samples was processed for microarray hybridization using the Illumina BeadChips HumanRef-8 V3 kit. Total RNA (800 ng) was reverse transcribed, labeled with biotin and amplified overnight using the Illumina RNA TotalPrep Amplification kit (Life Technologies) according to the manufacturer's protocol. The biotinylated cRNA sample (1 μg) was mixed with the Hyb E1 hybridization buffer containing 37.5% (w/w) formamide and then hybridized at 58°C overnight to the Illumina BeadChips HumanRef-8 V3 (Illumina, San Diego, CA, USA). Array chips were washed with the manufacturer's E1BC solution, stained with 1 μg/ml Cy3-streptavidine (Amersham Biosciences, Buckinghamshire, UK), and eventually scanned with the Illumina BeadArray Reader.
Three replicates were profiled for both cell types. Microarray raw data where generated using the Illumina BeadStudio 3.8 software and processed using the lumi package [47] of Bioconductor. After quality control, the Robust Spline Normalization was applied, and probes with a detection P < 0.01 in at least 2 of 6 samples were selected (11406 probes). When multiple probes per gene were present, only the probe with the highest detection rate or highest interquartile range was retained (9,283 probes). Raw and processed data were deposited at the Gene Expression Omnibus data repository (GSE58383).

SOLiD library construction and sequencing
Total RNA (90 ng) was amplified, prior to transcriptome sequencing, using Ribo-SPIA ® technology developed by NuGEN following Ovation RNA-seq (2009, NuGEN ® Technologies, San Carlos, CA, USA). Briefly, first-strand cDNA was prepared from total RNA using unique primers that hybridize either to the 5′ portion of the poly(A) sequence or randomly across the transcript. The resulting mRNA within the cDNA/mRNA complex was fragmented, and DNA was amplified using a linear isothermal DNA process developed by NuGEN named SPIA ® . The post-SPIA modification process completed the amplification step, producing targets appropriate for SOLiD library preparation. The amplified cDNA was purified prior to subsequent processing for library construction using RNA clean purification magnetic beads (Beckman Coulter Genomics), as suggested in the protocol. Double-stranded DNA was quantified with Qubit™ DNA BR Assay Kits (Life Technologies), and its size distribution was checked using the Agilent ® RNA Nano kit (Agilent) following the manufacturer's instructions.
DNA libraries, one for each sample, were constructed following the SOLiD™ 3 Plus System Library Preparation Guide (Life Technologies) manufacturer's instructions. One μg of cDNA was diluted in 100 μl in 1X low TE buffer and transferred in a Covaris™ microTUBE. DNA was sheared using the following Covaris S2 System conditions (Covaris, Woburn, MA, USA): 10 cycles 60 s each, 10% duty cycle, 5 intensity and 100 cycles/burst. Sheared samples were first end-repaired, columns purified and ligated to specific Solid adaptors containing P1 and P2 sequences. DNA was size-selected using a SOLiD™ Library Size Selection gel run in the E-Gel ® iBase™ system: the 150-200 bp region was collected.
The nick-translated adaptor-ligated DNA was amplified using Library PCR Primer P1 and P2 with 6 cycles, and after SOLiD™ Library Column purification, the yield and size distribution of the libraries were checked using the Agilent ® DNA 1000 Kit (Agilent Technologies). Four PCR emulsions, two for each library, were manually performed following the Applied Biosystems SOLiD™ 3 Plus System Templated Bead Preparation Guide, according to the manufacturer's instructions (Life Technologies).
The library and emulsion qualities were checked in a workflow analysis run following the SOLiD™ 3 Plus System Instrument Operation Guide (Life Technologies).
Sequencing was done using standard fragment settings on the SOLiD™ Systems V3 plus according to SOLiD™ 3 Plus System Instrument Operation Guide protocol (Life Technologies). At least 150 M of tags for each library, 50 bp long, distributed in 2 quad each, were sequenced.

RNA-seq data processing
Sequence reads in SOLiD Color Space format were mapped to the UCSC repeat-masked hg19 reference genome and analyzed for transcript representation in RPKM and splice site-fusion representation with the Whole Transcriptome Analysis module of the Lifetech Lifescope 2.5.1 analysis software and ad hoc created perl scripts. The resulting genome alignment files in .bam format were used to originate the fastq files corresponding to the aligned sequence reads and for further analysis with the TopHat/Cufflink/Cuffdiff whole transcriptome analysis pipeline.

Gene set enrichment analysis
Enrichment analysis in mRNA expression data was performed using GSEA (v. 2.0) [48]. The C2 collection (v. 3.1) containing 4,850 gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts, was tested for enrichment on both microarray and RNA-seq data. Microarray data were ranked according to default signal to noise GSEA metrics or according to fold change. RNAseq data were ranked according to fold change or, for the Seqsolve processed data, according with the Cuffdiff statistics. Only gene sets for which a number of genes > 15 and < 300 was found in the data were tested. Gene sets with a false discovery rate of < 1% were considered significantly enriched.

Alternative splicing analysis
To identify splicing events in MCFS compared with the parental MCF7 cells, AltAnalyze software [34] was used, starting from the exon-level quantification and exon junction tables generated by the Lifescope pipeline. A reciprocal junction analysis, which identifies pairs of exon-exon junctions differentially expressed in opposite directions in the two cell lines, was performed. Such a method is reported to be very accurate at detecting true alternative splicing events. Statistical significance was assessed using the ASPIRE score, and events with ASPIRE > 0.2 or < 0.2 were considered significant. A combination of filtering criteria was applied to the more than 2000 significant events in order to select a more limited number of biologically relevant elements. The filtering procedure is summarized in Supplementary Figure S1.
First of all, we selected events occurring in genes expressed at relatively high levels, i.e., RPKM > 5 in MCFS and MCF7 cells. Then, we distinguished between events with positive or negative ASPIRE values. In exon junctions, positive ASPIRE values correspond to a down-regulation of junction 1 and upregulation of junction 2, and vice versa for ASPIRE negative values. The subsequent criterion was to have a value higher than 15 in the sample were the junction was up-regulated. For events in unchanging genes, we also selected events where fold change value was higher than 3.

CONCLUSIONS
We performed a detailed study of TPC transcriptome using microarrays and RNA-seq. Informative RNAseq data were derived starting from small input RNA, making the approach applicable to various scenarios where a limited amount of material is available. Some of our findings were already reported as being crucial in TPC, although using different approaches to isolate such a tumor subpopulation, therefore supporting the validity of our model. Of note was the up-regulation of the NF-κB pathway, IL-8 and other inflammatory cytokines. Our TPC also showed less responsiveness to endocrine treatment, and, interestingly, genes over-expressed in MCFS cells were found to be up-regulated also in clinical tumors resistant to fulvestrant treatment, suggesting that they might represent a new putative predictive marker of hormone-treatment resistance. Finally, RNA-seq analysis suggested an involvement of several ncRNAs and differential splicing events in maintenance of the TPC phenotype.
We are aware that it is not yet clear if the putative stem cell components derived from established cell lines is a valid model for TPC and that experimental conditions might be very critical, however, i) the confirmation of some earlier literature data [27,31], ii) the in vitro validation of the hypothesis generated by the data and iii) the successful identification of a gene signature predicting response to fulvestrant obtained in this study, give additional strength to other still to validate findings and to the model itself.