Single-cell transcriptomes reveal the mechanism for a breast cancer prognostic gene panel

The clinical benefits of the MammaPrint® signature for breast cancer is well documented; however, how these genes are related to cell cycle perturbation have not been well determined. Our single-cell transcriptome mapping (algorithm) provides details into the fine perturbation of all individual genes during a cell cycle, providing a view of the cell-cycle-phase specific landscape of any given human genes. Specifically, we identified that 38 out of the 70 (54%) MammaPrint® signature genes are perturbated to a specific phase of the cell cycle. The MammaPrint® signature panel derived its clinical prognosis power from measuring the cell cycle activity of specific breast cancer samples. Such cell cycle phase index of the MammaPrint® signature suggested that measurement of the cell cycle index from tumors could be developed into a prognosis tool for various types of cancer beyond breast cancer, potentially improving therapy through targeting a specific phase of the cell cycle of cancer cells.


INTRODUCTION
Breast cancer remains one of the most devastating diseases worldwide. Traditionally, estrogen receptor (ER), progesterone receptor (PR) and Her2 have been used to classify breast cancer into three subtypes: positive for hormonal receptors (ER+), Her2+, and triple negative (TNBC) without these receptors. While ER+ tumors are treated with selective ER modulators or aromatase inhibitors and Her2+ tumors are treated with antibodies Research Paper www.oncotarget.com against this receptor [1], TNBC tumors do not have a specific treatment, prompting the search for innovative approaches. A genome-wide association study (GWAS) revealed the tumor necrosis factor superfamily member 13B (TNFSF13B) gene, suggesting new target-specific therapies [2]. Translation of these transcriptome results to clinical practice is underway. The clinical significance of a breast-cancer-associated, 70-gene signature panel (a.k.a., MammaPrint ® signature) was first reported as a clinical tool to help refine prognosis in the NEJM fourteen years ago [2], as implemented by Agendia BV, Amsterdam, The Netherlands for scale-up clinical trials. Initially, van 't Veer and colleagues screened a total of 25,000 genes of using microarray technology for supervised clustering on mRNA expression of 70 genes and van 't Veer et al. validated for either a low or high risk of patients. The 70-genes assay was adopted in the US for clinical practice since van 't Veer moved to University of California, San Francisco. Based on the 70-gene MammaPrint ® signature, the same group reported a follow-up NEJM article, showing that "no chemotherapy led to a 5-year rate of survival without distant metastasis that was 1.5% lower than the rate with chemotherapy", with 1550 patients (23.2%) at high clinical risk and low genomic risk for recurrence, out of a randomized Phase 3 study with 6693 enrolled early-stage breast cancer patients [3]. This suggests that approximately 46% of women at high clinical risk may not need chemotherapy. Monitoring the MammaPrint ® 70-gene signature can guide the treatment. However, these genes were selected empirically from breast cancer cases through time. It is not clear why these genes have predictive power and whether such a panel can be applied to other types of cancers.
Here, we report a new algorithm to cluster genes that share the same cell cycle phase (i.e., G 0 , G 1 , S, or G 2 ) based on a spectrum of single-cell transcriptomes from a cell-cycle model system. This algorithm allows cells to be sorted into subpopulations of sharing the same cellcycle phases. We inferred a possible mechanism by which predictive power of MammaPrint ® signature predicts its clinical outcomes for breast cancer.

RESULTS
We defined phase-specific, cell-cycle-dependent single-cell transcriptomes using the model system -Fucci cells, which have fluorescent cell-cycle phase-specific indicators. We obtained single-cell transcriptomes from these Fucci cells with our microfluidic platform with nanoliter reactors [5]. Combining these two technologies allowed for the characterization of a cell cycle phasespecific map using a similarity matrix (algorithm) based on known cell cycle genes (GO:0022402). We used this algorithm to create a novel cell cycle map of known cell cycle genes in the corresponding sequential order ( Figure 1). As expected, known cell cycle genes had expression perturbation profiles that agreed with previously reported studies of physical cell lysates. In addition to known cell cycle genes, genes indicated by the Self-Organizing Map (SOM) analysis were also plotted onto the cell cycle map to identify novel candidate cell cycle genes, termed cell cycle index.
We applied this algorithm to assess the cell cycle activity of the MammaPrint ® 70-gene signature [4] to create a cell-cycle index for cell-cycle-phase-specific mapping as generated from single-cell transcriptomes. In addition to the previously reported 15 cell cycle-related genes [5,6], our strategy revealed 23 additional cell cycleassociated genes among the 70 MammaPrint ® genes. Among the 23 newly identified cell cycle-related genes, we identified 15 genes regulating G1 phase ( Figure 2B), 5 genes regulating S-phase ( Figure 2C), and 3 genes regulating G2 phase (Figure 2A). More importantly, these cell cycle specific genes are associated with clinical outcomes, as judged with current database of breast cancer patients' consequences in multiple reports and clinical trials, including cancer recurrence (Table 1), cancer pathological stage (Table 2), and primary versus metastatic disease (Table 3).

DISCUSSION
Current commercially available assays include the MammaPrint ® , OncotypeDX (the 21-gene Recurrence Score, with reverse transcriptase PCR (RT-PCR), PAM50/ Prosigna (on the expression of 46 genes using quantitative PCR (qPCR), by NanoString Technologies, Seattle, WA), Endopredict (use the expression of 8 cancer-related and 3 reference genes determined with RT-PCR, by Myriad Genetics Inc, Salt Lake City, UT), of which four criteria is considered: "assay development and methodology, clinical validation, clinical utility and economic value" [7]. Although level IA clinical trial results show MammaPrint ® and OncotypeDX for prognostic information, clinical utility studies show a higher reduction in chemotherapy was achieved only by OncotypeDX. The inconsistency may be derived from inter-tumor and intra-tumor heterogeneity as these results were based on assays on bulky tumors. We have recently shown that a cancer relapse signaling pathway can be detected only in a single-cell transcriptome, not in a bulky tumor analysis, Figure 1: Sequential perturbations of cell-cycle-specific genes in a single-cell model system. After organizing single-cell transcriptomes by similarity into a sequencing order, expression levels of various cell-cycle-specific genes were plotted to visualize the sequential perturbation of individual genes during the cell cycle. Cell cycle phases were defined and colored based on the cell cycle molecular map. As expected, G0/G1-specific genes had higher expression levels in the G0/G1 phase (A) and G2/M-specific genes had high expression levels in the G2/M phase (B). G2/M-specific genes had high expression levels in the G2/M phase and the early G0/G1 phase (C). Note: the numbers along the outside circle (#1 -29) represent the cell cycle phase: #1-#15 for G1-phase; #16-#22, S-phase; #23-#29, G2/M-phase. The number on the vertical scale radiating from the center represents the level of gene expression with the center representing 0, the lowest, scaling up to the outer circle, the highest. as bulky tumor transcriptomic analyses obscure the low signal of certain biomarkers [8]. We rationalized that single-cell transcriptomic analysis can enhance signalto-noise ratio over bulky tumor transcriptomic analysis. Here, we discovered that a total of 38 genes (23 of them were revealed within the present work) measured with the MammaPrint ® assay are cell cycle regulated. We conclude that the MammaPrint ® assay would average out perturbations in gene expression when applied to a mixture of cells and thus remain undetected when using bulky tumors. We further illustrate that the advantage of our finding would make it possible (1) to attack the respective gene products using a combination therapy and (2) to deliver the right drugs to cells of the respective cell cycle stage.
More specifically, the results of these singlecell transcriptomes provide details relating to define perturbation of critical genes during the cell cycle. Such concentrating on cell cycle related genes reveals a possible mechanism for the greatest prognostic power of MammaPrint ® . Our results, which have identified an additional 23 cell cycle-related genes, bring the total  We propose that these results can be applied to focus potential strategies to create more efficacious, targeted combinations of drugs, specifically addressing the appropriate cells in the correct cell cycle stage [9,10]. We suggest this cell cycle stage index (i.e., mapping out a spectrum of a specific phase of the cell cycle for a particular single cell) can be applied in cancer treatment ( Figure 3) by monitoring subclonal growth through determination of switchboard signals (i.e., cytokines and chemokines) which may transform cells from dormancy to dominating subclones [11]. Clinical applications of cocktails of targeting a specific phase of the cell cycle can be applied to modulate the subclonal evolution for prognosis, thereby guiding the timing of therapeutic intervention. The 38-cycle-related genes in the MammaPrint ® 70-gene panel, as confirmed by clinical specimens (Tables 1, 2, 3), shed new light on subclonal evolution of common cancer treatment resistance, as certain cell-cycle stages, such G0, show resistance to treatment. This suggests that management of cell cycle stages can be therapeutic effective as it is tumor-subclone-specific, which is cell-cycle-dependent ( Figure 3). Such single-cell knowledge of categorizing distinct subpopulations of an organ may serve as a starting point to reconstruct the origins for the different subtypes of any organ-specific disease [12]. Single-cell identity may serve as a resource to map out the defined changes occurring during disease onset, potential to improve methods of detection and treatment. Next generation sequencing technology enables single-cell mRNA sequencing (scRNAseq) to create a high-resolution "molecular census of recognition" that was previously unseen cellular differences [13]. Isolated from prostate tumor and associated cells, for example, Calcinotto and colleagues identified two IL-23-regulated proteins (STAT3 and RORγ) that support androgen-receptorassociated tumor growth [14] in myeloid-derived suppressor cells (MDSCs) (including monocytes and neutrophils) at an immature state [15]. From insights into this immature state, novel therapies of either antibody (blocks IL-23) or enzalutamide (androgen-receptor inhibitor) were employed to shrink the tumors [16].
To secure efficacy of cell cycle-based therapy, physicians need to track down the spatiotemporal changes of each cell within breast tissue at molecular, cellular and tissue levels. Technologies (next generation sequencing technology, single-cell mRNA sequencing for singlecell gene signatures, and positioning imaging system for biomarkers [17]) can capture the spatiotemporal information that defines the cell-by-cell origins of breast cancer, in its earliest phases that acquire genetic alterations, potential for early cancer detection, leading to cancer prevention, slowing cancer progression or early management for co-survival with cancer [11], before it turns into a life-threatening cancer burden [18]. Such non- invasive imaging-based analysis of spatiotemporal geneexpression patterns [17] in single cells may shed new light onto the developmental processes that lead to "wait-andwatch" approaches [18], by defining "therapeutic windows" [19], with multiple check points of the full spectrum of cellular diversity that exists in the human body, in cancer initiation, progression and metastasis.
Cell cycle gene regulation plays a critical role not only in repair after tissue injury (e.g., stem cell-mediated regeneration), but also in cancer surveillance, and more recently has been employed as a tool in immunotherapybased "wait-and-watch" approaches to cancer [18,20]. Cell cycle gene regulation is tightly regulated by essential kinases such as members of the CDK family and their effectors. Monitoring cell cycle activity-including phenotyping immune cell and cancer stem cell subsets, tracking cell proliferation, and measuring cytokine production-can provide insights into the overall status of cancer in patients, and identify those cell populations which manage to survive cancer treatments. However, conventional gene expression profiling, which relies on population averaging of millions of heterogenous cells (malignant vs. normal cells), cannot realistically be expected to detect fine perturbations in gene expression that are precisely regulated on both temporal and spatial scales. Consequently, expression perturbations of many cell-cycle-related genes may remain masked in crude cell lysate analysis. The expression patterns of these cell-cycle genes as predicted by this algorithm remain to be validated in clinical trials for prognosis of other types of cancers. Conclusion: With single-cell transcriptome analysis, we reveal that the previously unknown predictive power of MammaPrint ® signature panel arises from the cell cycle genes it profiles, and that the expression patterns of these cell cycle genes can be used for prognosis of other types of cancers.

MATERIALS AND METHODS
Cell culture and single-cell cDNA synthesis H9 human embryonic stem cells (WA09, WiCell Research Institute, Inc.) were maintained with a feederfree protocol as previously described [21,22]. HeLa F.
(RIKEN Cell Bank RCB2812) cells (Fucci cells) [23] were grown under standard conditions and cultured in Advanced DMEM (Invitrogen, Carlsbad, California), 1% Fetal Bovine Serum (HyClone, South Logan, Utah) and 0.5% penicillin-streptomycin (Invitrogen, Carlsbad, California). Cells were trypsinized using TrypLE (GIBCO) and resuspended in 1X phosphate-based buffer (PBS). Singlecell encapsulation was performed through hydrodynamic flow controlled by syringe pumps connected to a customfabricated microfluidic device as described previously [24]. The single-cell encapsulation, cDNA synthesis and amplifications in a nanoliter scale were carried out as previously reported [25,26] Briefly, single cells were encapsulated into 600 pico-liter droplets with T-junction microfluidic devices or laser cavitation devices [27]. These droplets then were manipulated into 10 nano-liter After organizing single-cell transcriptomes by similarity into a sequential order (centerclustering), expression levels of various cell-cycle-phase-specific genes were plotted to visualize the sequential perturbation of individual genes during the cell cycle, a virtual time series. Expression levels were scaled from 0 (undetectable) to 1 (maximum expression). Cell cycle phases were defined and colored. As expected, G0/G1-specific genes had higher expression levels in the G0/G1 phase and an S-specific gene was mainly expressed within the S phase. G2/M-specific genes had high expression levels in G2/M phase and early G0/G1 phase. The sequential expression order suggests that mRNAs of many G2/M-specific genes are not degraded until early in G0/G1 phase after cell division. (B, C) Cancer subclones are defined by single-cell transcriptome-clustered cell cycle gene clustering (all cells in a subclone share and stay at one cell-cycle stage, which is used to guide treatment. The effective therapeutics drive the number of tumor subclones decrease while the number of tumor subclones at relapse increase. reactors for reverse transcription reaction and diluted to 10 μl. The resulting products were then stored at −20° C until they were used for quantitative PCR or microarray gene expression profiling (Illumina). Samples were prepared for PCR amplification and subsequent microarray analysis by adding preamplification master mix (Illumina, San Diego, CA) as described previously [25]. For comparison, a Single-Cell Preamplification Kit (Life-Technologies, CA) was also used for cDNA synthesis and subsequently for qPCR verification.

RNA-seq analysis
For RNA-seq samples, trimmed reads were mapped onto human genome hg38 using Tophat 2.0.8 as implemented in Flow with default settings, using Gencode 20 annotation (www.gencodegenes.org) as guidance. Gencode 20 annotation was used to quantify aligned reads to genes/transcripts using the Partek E/M method. [28] Read counts per gene in all samples were normalized using Upper Quartile normalization [29] and analyzed for differential expression using Partek's Gene Specific Analysis method (genes with less than 10 reads in any sample were excluded). To generate a list of significantly differentially expressed genes among different tissues of the same patient, a cutoff of FDR adjusted p < 0.05 (Poisson regression) and fold change >|2| was applied.

Gene expression normalization and QC
Illumina gene expression data from each sample was processed and normalized independently of the other tissue/sample types. Analyses were restricted to genes significantly expressed in all single HeLa cells and lysates at a nominal p-value of 0.01, yielding 2,181 significant expression features out of 29,377 annotated probes on the array. For assessing agreement between single-cell data and lysate data, both were processed using a log 2 transformation followed by quantile normalization. To maximize the signal-to-noise ratio for fine mapping of single cells to the cell cycle, a more comprehensive approach was taken. The Lumi R package [30] was used to employ a variance-stabilizing transformation [31] that utilizes technical replicates on the array followed by Robust Spline Normalization. Extreme outliers across single cells were identified on a probe-by-probe basis and the values were set to 'missing' if they were more extreme than three interquartile ranges away from the first or third quartiles. However, no more than one outlier exceeded this criterion per probe. Missing data were then inputted using a nearest-neighbor averaging method.

Statistical analysis
In cells that are approximately homogeneous with respect to factors other than their cell cycle stages, expression of cell cycle-related genes were used to determine the position in the cell cycle of each cell in relation to others. That is, similarity between cells in expression data was used to reflect similarity in cell cycle stage. We included in this analysis not only expressed genes (as described above), but also genes for which existing literature provided us with some evidence of their involvement in the cell cycle. Specifically, the inclusion criteria required that genes be classified in the GO category cell cycle process (GO:0022402). Among the many approaches that have been applied to clustering genes based on microarray expression patterns, Ray et al. [32] applied a traveling salesman problem (TSP) solver called Concorde [33] to obtain a one-dimensional ordering of genes within clusters. We used Concorde to estimate the shortest Hamiltonian path (a variation of the TSP where the path does not end at the starting position) based on Euclidean distance through the single cells (rather than genes). This approach required the construction of an adjacency matrix for single cells, which was generated using a network-based approach usually applied to the identification of co-expressed modules of genes [34].
The cell cycle status of individual cells forms a perturbation map, i.e., a diagram of how expression of a gene is perturbed as a cell passes through the cell cycle. This perturbation map was used with the Self-Organizing Maps (SOM) [35] approach to identify clusters of genes with the same perturbation signature. Genes with perturbation patterns similar to known cell cycle genes were inferred to be new candidate cell cycle genes.

Author contributions
SCL and JFZ conceived of the study and drafted the manuscript, and all other authors participated in its experimentation and revision of the manuscript. All authors read and approved the final manuscript.