Challenges in using liquid biopsies for gene expression profiling

Circulating tumor cells (CTCs) have potential utility as a surrogate biomarker of tumor biology via a liquid biopsy. The aim of this study was to evaluate if the nCounter NanoString assay could be used for accurate gene expression profiling of CTCs using the PAM50 research-use-only CodeSet. Analysis was performed on CTCs isolated by the ANGLE Parsortix system from healthy blood spiked with the breast cancer cell lines Hs578T, SkBr3, MDA-MB-231 or MCF7. Using cell lines as gold standard positive controls and Parsortix processed blood without spiking (unspiked) as negative controls, we found an average of 12 significantly differentially expressed genes among spiked samples versus unspiked controls. We validated our findings with the NanoStringDiff differential expression statistical method. The NanoString recommended targeted pre-amplification introduced false positive results due to pre-amplification bias, and the amplification of non-cancer genes from normal leukocytes confounded gene expression profiling of CTCs. Pre-amplification bias is a concern for other similar assays that may be used as discovery tools or target validation of transcripts of interest in gene expression profiling of CTCs. We recommend the use of an unspiked negative control when evaluating CTC technologies regarding gene expression profiling. Given that the molecular profiling of CTCs as a liquid biopsy may have clinical ramifications for potential treatment selection in future clinical trials, our study emphasizes cautious consideration of pre-analytical variables such as amplification bias in the context of liquid biopsy studies.


INTRODUCTION
Over the last years, research focused on circulating tumor cells (CTCs) has captured great attention as a potential "liquid biopsy" with recognized potential utility in cancer diagnosis, prognosis, and personalized treatments in patients with a variety of solid tumors [1]. Since CTCs are constantly released from the primary tumor and the metastatic sites into the blood circulation, their enumeration and molecular characterization provide useful "real time" information about tumor biology in the context of disease progression [2,3]. Furthermore, different published studies have shown the scope of molecular characterization of CTCs at different levels, (DNA, RNA and protein), and functional in vitro and in vivo assays in understanding the tumor biology and the metastatic process. Similarly, CTC-based clinical studies have provided information about micrometastatic disease in early stages of cancer, patient prognosis, acquisition of resistance to therapy, risk of metastasis and disease progression. Additionally, studies also support the clinical relevance of monitoring CTC counts during treatment to www.impactjournals.com/oncotarget/ Oncotarget, 2018, Vol. 9, (No. 6), pp: 7036-7053 Research Paper evaluate treatment efficacy, and patient stratification to guide new therapy decisions and treatment management [1,2,[4][5][6].
Given the low number of CTCs, cell enrichment strategies are necessary to select and capture rare CTCs from a background of millions of blood cells per milliliter. Regardless of the enrichment method, major obstacles such as CTC purity and CTC rarity pose additional challenges for CTC identification and CTC molecular characterization based on next generation sequencing, targeted multi-marker gene expression analysis by qPCR, and targeted multi-marker hybridization techniques such as the nCounter NanoString assay [7]. The CellSearch CTC assay, the only currently United States Food and Drug Administration approved system, enriches CTCs based on the detection and identification of the epithelial cell adhesion molecule (EpCAM) and pan-cytokeratins in the tumor cell membrane and cytoplasm. However, this system has limited utility enriching CTCs with low EpCAM expression, and CTC populations are recovered along with considerable amounts of contaminating leukocytes [8]. Various blood-cell-depletion techniques based on antibody selection against CD45 and cell filtration have been tested for research applications [9], but the extensive processing time and laborious sample preparations result in low CTC recovery, possible changes in gene expression due to microenvironmental conditions or assay manipulations, and reduced cell viability [10][11][12]. Furthermore, newly developed laboratory platforms based on cell size and cell deformability permit capturing CTCs even if cells have undergone epithelial-to-mesenchymal transition (EMT), while collecting CTCs suitable for both molecular and in vivo studies. A promising technology, the ANGLE Parsortix cell separation system is an epitope-independent microfluidic platform that captures and recovers viable, intact CTCs from clinical samples. The Parsortix system pumps the blood sample through a disposable cassette formed by a stepped gradient that narrows to a critical gap of 10μm. CTCs are captured at this critical gap, but smaller cells (erythrocytes and most leukocytes) pass through. Harvesting is achieved by applying a gentle reverse flow that allows for the recovery of CTCs for analysis. Various studies have shown the Parsortix system's versatility in capturing and harvesting CTCs suitable for in vitro culture and molecular analysis by techniques such as genomic hybridization, qPCR and RNA-seq [13][14][15][16].
The rapid decline in sequencing costs, and the advent of multiplexed gene expression assays have made it possible to perform targeted and comprehensive molecular characterization of CTCs by various molecular assays [7,13,17,18]. The nCounter NanoString gene expression assay captures and counts individual mRNA transcripts using a highly sensitive multiplexed approach [19]. The Prosigna assay (NanoString Technologies, Seattle, WA) is an in vitro diagnostic platform utilizing the NanoString technology which has been accepted into certain international treatment guidelines based on clinical validation of the prognostic significance of this gene expression based assay [20]. The NanoString PAM50 assay has been optimized to classify tumor material from formalin-fixed, paraffin embedded (FFPE) breast cancer as intrinsic subtypes that can serve as a prognostic indicator for late distant recurrence-free events (>5 years after treatment) in early stage hormone positive breast cancer patients [21,22]. The nCounter NanoString PAM50 analysis system additionally allows the direct detection, in a single reaction, of the quantity of transcripts from 50 specific endogenous genes and from 8 different reference genes using a pair of mRNA target-specific multiplexed CodeSet probes (capture and reporter probes). After hybridizing the target sample with the probe pairs, the abundance of mRNA transcripts is digitally counted by the number of times a color-coded probe is detected. Counts are subsequently used to measure gene expression levels [19,23].
Although this multiplexed 50-gene test has been validated on FFPE breast tumor tissue, in the present study we aimed to evaluate the feasibility and accuracy of the NanoString nCounter platform for CTC-geneexpression analysis using the PAM50 Research-Use-Only CodeSet following the standard nCounter single cell gene expression protocol.

ANGLE Parsortix system allows CTC recovery
As a proof-of-principle experiment, we determined the CTC enrichment capability of the ANGLE Parsortix system by measuring the CTC recovery rate or the number of harvested CTCs after processing SkBr3 spiked blood samples (these samples were not subjected to NanoString PAM50 analysis). CTC recovery rate was calculated by SkBr3-EpCAM+ cell enumeration by flow cytometry. By quantitating the number of endogenous WBCs CD45+ carryover cells, potential leukocyte contamination was also assessed. FACS results indicated that the Parsortix system recovered an average of 5 EpCAM+ cancer cells (20-35%) from blood samples initially spiked with 20 SkBr3 cells (Figure 1). Although it has been previously shown that the Parsortix system recovers about 40-80% of CTCs [8,14], the relatively low number of the identified EpCAM+ captured cells may be explained by heterogeneity in EpCAM expression among SkBr3 cells, as well as cell loss during the antibody staining, washing and FACS process. Average leukocyte carryover was 2342 CD45+ cells (range 542-5174 cells) among all spiked samples and unspiked controls. Furthermore, we also spiked MDA-MB-231 GFP expressing cells in healthy donor blood, and found that the CTC capture efficiency of the Parsortix system (measured as the number of GFP www.impactjournals.com/oncotarget positive cells trapped in the critical gap of the Parsortix cell separation cassette) ranged between 61-75% (data not shown).

Targeted gene pre-amplification is required for NanoString PAM50 assay
Next, we sought to test if multiple target enrichment (MTE) is a necessary step when performing NanoString analysis on Parsortix harvested material. Therefore, we created five scenarios in which different ratios of FACS sorted SkBr3 EpCAM+ cells and CD45+ cells were mixed and lysed. Gene expression analysis by NanoString PAM50 was performed on both pre-amplified (MTE) and non-pre-amplified (non-MTE) products from the same cell lysate. Results in Figure 2 show very low numbers of transcript copies, below average background cut-off level (14 counts, range between 10-22 counts) on non-MTE samples compared to MTE samples (average background cut-off of 32 counts, range between 18-54 counts). These results indicate that in order to obtain robust PAM50 gene signal on Parsortix CTC harvests, the recommended target pre-amplification is required. Therefore, we performed MTE amplification on the Parsortix harvests from all unspiked controls and from all spiked samples included in this study.

nSolver gene expression analysis of CTC mimic samples using NanoString PAM50 assays
To identify a specific CTC PAM50 gene signature among blood samples spiked with breast cancer cell lines, we performed a differential NanoString PAM50 gene expression analysis to discriminate gene expression profile between paired blood samples: spiked and unspiked.
nCounter NanoString PAM50 data was normalized using nSolver Data Analysis Software. Intriguingly, we observed differences in gene expression of reference genes among spiked samples and unspiked controls. Among the 7 NanoString PAM50 reference genes included in the MTE primer pool (Table 1), only PSMC4, RPLP0, SF3A1 and MRPL19 were constitutively expressed among all spiked samples, unspiked controls and cell line controls. Therefore, for further differential gene expression analysis we re-normalized NanoString PAM50 counts against those 4 reference genes. By performing a hierarchical clustering of the re-normalized transcript counts, we found poor distinction between spiked samples and unspiked controls, but clear distinction between unspiked samples and bulk PB control samples ( Figure 3A). Similarly, results showed a wide distribution among spiked samples and unspiked controls, but clear correlation among PB controls by the principal component analysis as shown in Figure 3B. These results suggest that NanoString PAM50 CodeSet panel not only targets cancer specific genes, but also gene transcripts expressed by normal blood cells confounding gene expression of enriched CTCs when compared to control samples.
Additionally, we compared the gene expression among bulk PB controls (RNA extracted from blood, but not subjected to MTE) versus unspiked blood controls (subjected to MTE). We not only observed relatively high gene expression of some genes by unspiked controls compared to bulk PB control ( Figure 4), but genes such as ANLN, BIRC5, CCNE1, CDH3, EGFR, KRT5, MAPT, and MMP11 were found to be expressed in unspiked samples, while not expressed by bulk PB controls, suggesting that pre-amplification during MTE cycles may produce false positive gene expression results. Since the MTE step was performed only on the NanoString input material, the performance of the internal negative controls used to establish that the background expression threshold was not altered. In fact, similar cut-off values were obtained as shown in Figure 4. On ANOVA testing, by comparing the mean log of normalized counts among all bulk PB controls versus all unspiked samples, we found that 41/49 (83.7%) of the PAM50 genes were significantly different (p<0.001).
Furthermore, transcript counts for individual genes between bulk PB controls and unspiked samples among different donors did not reflect the linear unbiased target pre-amplification expected after the MTE process. We observed inconsistent transcript-count-fold changes between unspiked controls and bulk PB controls among genes and among healthy donors ( Table 2). For example, by comparing fold change values for reference genes among all donors, we observed that some of them, such as PSMC4, PUM1, RPLP0 and SF3A1 were over-amplified in donor 6 compared to those in other donor samples (fold changes ranging from 0 to 102). We noticed similar discordant fold-change results among those 8 genes with opposite expression between unspiked samples and bulk PB controls ( Table 2). These discrepancies not only support our finding of the inconsistent biased amplification during the MTE step, but also reflect normal patient-topatient heterogeneity. Furthermore, a non-template control (PBS) subjected to MTE in three technical replicates showed expression of two genes: KRT17 and MYC.
Conversely, when performing a global comparison of NanoString PAM50 gene expression among spiked samples with the corresponding cell line control ( samples only MIA were found to be similarly underexpressed among spiked samples and cell line controls. In contrast, we found differences in gene expression in almost half of the PAM50 genes between spiked samples and controls ( Figure 5), particularly genes such as CDCA1, CDH3, ESR1, EXO1, FOXA1, GPR160, KNTC2, KRT17, MLPH, MYBL2, PGR, PTTG1, RRM2, SLC39A6, TYMS, and UBE2C had opposite findings on spiked samples with each cell line versus the corresponding bulk cell line regarding the assay calling expression versus no gene expression signal. These findings in conjunction with the difficulty in obtaining ultra-pure CTC enriched populations raise the concern that amplified expression of background leukocytes may affect nCounter transcript counts as well as the resulting importance of selecting genes that are not expressed in normal peripheral blood cells.

Differential gene expression in breast cancercell-spiked samples
To determine a unique set of genes differentially expressed among spiked samples that could be pinpointed as a tumor specific gene signature, we performed a gene expression selection based on the background cut-off value, and using corresponding unspiked samples and cell lines as negative and positive controls, respectively. As shown in Figure 6, differentially expressed genes were classified in three main groups: (i) genes only expressed in spiked samples and in cell line controls but not expressed in unspiked controls (green dots), (ii) genes expressed by all spiked, unspiked and cell lines, but with stronger expression in spiked samples compared to unspiked control (purple dots). For this group, we defined stronger expression as 20 or more transcript counts over the unspiked control, and (iii) not expressed genes in neither spiked samples nor in unspiked and cell line controls (no dots). Based on this classification, we found an average of 12 genes either uniquely expressed or highly expressed in ≥50% of the 6 samples spiked with the different cell lines. Specifically, Figure 6 shows that 18, 9, 13, and 12 genes (gray bars) for Hs578T, SkBr3, MDA-MB-231 and MCF-7, respectively, were considered as potential cell line specific genes. This demonstrated that there is high variability in NanoString gene expression of both Parsortix CTC mimic samples (spiked samples), and Parsortix processed negative expression controls (unspiked) among donors.
Since those differentially expressed genes correspond to a very small fraction of the genes included in the NanoString PAM50 panel, our results indicated that the data normalization and the background correction performed by the nSolver Analysis software during gene expression analysis of nCounter data may have limited utility pinpointing differentially expressed genes that correspond to unique tumor cell gene signature in CTC mimic samples versus leukocyte background.
Furthermore, we evaluated the potential diagnostic accuracy of the NanoString PAM50 on our breast cancer CTC-mimic-model for future use as validation technique for other assays such as RNA-seq and microarrays. By comparing PAM50 gene expression results from spiked samples to those from the corresponding cell line gold standard control we were able to test the ability of the NanoString PAM50 to predict and to classify Parsortix spiked samples as truly CTC specimens. The low sensitivity (<50%) and specificity (<75%) detecting cell line gene signatures in spiked samples, in conjunction with the relatively high false positive rate (60-80%) shown in Table 3, indicate that the leukocyte gene amplification, and the over-amplification of some genes after MTE confound CTC gene expression posing major obstacles for using NanoString PAM50 in Parsortix CTC enriched populations to monitor tumor biology in breast cancer. Additionally, we performed a pairwise correlation and linear regression analyses on the PAM50 gene expression between cell line controls and bulk cell line RNA. Since bulk cell line RNA was not subjected to MTE (100ng of extracted RNA as NanoString input material) as shown in Figure 7, we used its NanoString PAM50 gene signature as a gene expression 'gold standard'. As indicated by the scatter plots and the coefficients of determination in Figure 8, less than 62% of gene expression correlated (R 2 =0.49, R 2 =0.46, R 2 =0.62, and R 2 =0.38 for Hs578T, SkBr3, MDA-231, and MCF-7 controls respectively) between MTE cell line control versus non-MTE 'gold standard' cell line RNA.

Donor 19
Reference genes

Differential gene expression validation using NanoStringDiff statistical method
To validate our differential gene expression analysis, we used the open source statistical platform NanoStringDiff [24]. Following the described protocol, we used the NanoStringDiff script to identify and/or validate differentially expressed genes among spiked samples. Briefly, this method assumes a negative binomial-based model to fit the discrete nature of the nCounter data and corrects for platform source of variation, sample content variation and background noise. Additionally, q-values are determined as a refined level of In Table 4, we show all p and q-values for the NanoStringDiff comparison analysis between each pair of spiked sample and unspiked control per donor. We found that only 8, 2, 21, and 17 genes (bold font) were statistically significantly expressed in Hs578T, SkBr3, MDA-231, and MCF-7 spiked samples, respectively. By performing the NanoStringDiff statistical method, we could validate some of the differentially expressed genes previously found after nCounter normalization and gene expression analysis. From the differentially expressed genes shown in Figure 6, NanoStringDiff validated 5 out of 18 genes among Hs578T spiked samples; 11 out of 13 genes in MDA-231 spiked samples; and 7 out of 12 of the genes among MCF-7 samples. Differentially expressed genes in spiked samples confirmed by both nCounter and NanoStringDiff analyses are highlighted by the gray boxes in Table 4. Although no standard method has been proposed for differential gene expression analysis on nCounter data from liquid biopsies, NanoStringDiff may be considered as an alternative statistical method allowing for multivariate analysis to select independent genes that define the outcome of sample type. However, we emphasize the need for a standardized methodology for gene expression analysis of nCounter NanoString PAM50 data when used for targeted transcriptome profiling of CTC or as validation technique of more comprehensive analysis such as whole transcriptome sequencing RNA-seq. tumor cells served as a CTC mimics (spiked samples, n=24. Blood from donors who donated more than 2 tubes of blood were spiked with more than one cells line), and other samples without adding tumor cells (unspiked control, n=24) were used as CTC negative control. RNA from cultured cells lines (0.1ng of extracted RNA) was used as positive control for gene expression for spiked MTE samples, and bulk RNA from the same cell lines (100ng) was used as 'gold standard' for NanoString PAM50 gene expression analysis. Bulk RNA from peripheral blood controls (no Parsortix processing) was also used as a negative control for gene expression.

Circulating tumor cells have enormous potential as a liquid biopsy of the real-time status of tumor biology
without a priori knowledge of what tumor markers may be present. This very fact allows CTCs to serve as a potential target discovery tool but leaves the approach vulnerable to misinterpretation due to confounding by background peripheral blood leukocytes. The objective of our study was to evaluate the feasibility and accuracy of using the NanoString PAM50 Research-Use-Only code set for gene expression profiling of breast cancer CTCs isolated using the ANGLE Parsortix system. While we have optimized a workflow for transcriptomic profiling of rare CTCs using the Parsortix system, in this report, our data indicate that the NanoString PAM50 panel has limited utility in differentiating CTC mimics from leukocyte background. In our hands, capture rates for spiked cell line CTC mimics ranged from 61-75% with the Parsortix system, which is similar to other investigators' reports [8,14]. Our group previously reported that CTC recovery rates using EpCAM based selection with FACS sorting were highly dependent on the intrinsic subtype of the cell line tested, with claudin-low cells showing the lowest recovery rates [17,18]. Based on those findings and the lack of a priori knowledge of specific surface markers when processing breast cancer liquid biopsies, we sought to use an epitope independent CTC enrichment platform to overcome the limitation in isolating claudin low and/or EpCAM negative cells. The Parsortix system is advantageous over FACS in that it does not rely on affinity-based selection and permits harvesting isolated CTCs within 90 minutes of blood draw via a bench top assay, which is advantageous both logistically and in terms of RNA quality of harvested cells. The Parsortix isolates could then be used to obtain single cells using other technologies, however, this would add hours to the process and would no doubt degrade the RNA. The Parsortix harvests are not ultrapure populations of CTCs, therefore, it is important to consider the possibility of overlap in gene expression between cancer cells and leukocytes for genes of interest empirically known to be relevant to breast cancer. Therefore, when working with enriched but not ultra-pure CTC samples, amplified gene expression of background leukocytes may influence  Background subtraction or normalization to PB controls is not possible in the context of NanoString PAM50 MTE since it produces inconsistent false positive results. The NanoString PAM50 code set has been shown by others to be useful for classifying breast cancers into intrinsic subtypes and in generating risk of recurrence scores (ROR) [23,25]. The NanoString single cell protocol calls for a multiple target enrichment step that is probe specific and involves limited amplification of cDNA (fewer than 16 cycles) to attempt to avoid amplification bias and allow for multiplexed hybridization based profiling of even single cells. The target cDNA molecules bind to barcode and capture probes resulting in detectable fluorescent spots [26]. The number and type of each fluorophore barcode is counted [27]. NanoString assays without MTE were previously reported to have an assay sensitivity of less than one copy of mRNA per cell, which is superior to microarrays and on par with QPCR in sensitivity [28]. One of the main advantages of NanoString in general is that it does not typically require conversion of RNA to cDNA or MTE, however, those steps are essential when performing rare cell profiling such as for characterizing CTCs. Although PCR based amplification prior to barcode capture is necessary to obtain robust read counts, our findings indicate that it may introduce false positive results and may introduce errors that lead to generating sequence changes not present in the original sample (primer dimer, DNA polymerase error, etc.) [29]. Therefore we emphasize that additional calibration/normalization steps and correction factors for pre-amplification bias are essential for single cell gene expression analysis of nCounter NanoString data. Moreover, as demonstrated in this paper, the inconsistent non-linear target amplification during MTE steps produced false positive results hindering the use of unspiked control gene expression as background (to eliminate leukocyte gene signal) that could be subtracted from spiked samples when evaluating for a unique CTC gene expression signature based on the spiked samples. This has relevance to potential future use in clinical trials to ensure that patients are not incorrectly classified by gene expression given the observed biases.
In this report, we utilized two types of negative controls -bulk, unsorted peripheral blood from healthy donors, and a second tube of blood from each donor that was processed on the Parsortix to control for the effect of size based microfluidics selection on leukocyte background gene expression (unspiked samples). CTC spike in mimics were also compared to a positive control of known cell lines. Had we not included the unspiked control samples in our study, we would have erroneously concluded that our principle component analysis of NanoString PAM50 data provided conclusive evidence that CTCs could be separated from peripheral blood in terms of gene expression profiling with the PAM50 ( Figure  3). Spiked and unspiked Parsortix processed specimens clustered more closely together than with either peripheral blood or cell line controls. Multiple target enrichment clearly influenced transcript count results by consistently yielding false positive results (Figures 4 and 5). We did not observe these same issues with false positives in our ongoing RNA-seq studies of breast cancer samples (data not shown), which argues against the alternate hypothesis that it is the Parsortix system's microfluidics filtration that introduces bias to gene expression. Moreover, lack of correlation of the gene expression between the positive controls (cell line and 'gold standard' controls) cautions against performing targeted PCR amplification prior to gene expression analysis unless a method is validated to not introduce high false positive calls. This study emphasizes the importance of either selecting genes that are entirely not expressed in peripheral blood or of performing a careful background subtraction or normalization procedure that considers the peripheral blood gene expression signature that is unique to each patient given that considerable heterogeneity is present both in CTCs and in the peripheral blood specimens from patient to patient. In actual cancer patients, no positive control exists to reliably distinguish signal from noise.
We concluded that the NanoString recommended targeted pre-amplification of the PAM50 Research-Use-Only code set introduced false positive results due to pre-amplification bias, and the amplification of noncancer genes from normal leukocytes confounded gene expression profiling of CTCs. Pre-amplification bias is a concern for other similar assays that may be used as discovery tools or target validation of transcripts of interest in gene expression profiling of CTCs. We recommend the use of an unspiked negative control when evaluating CTC technologies regarding gene expression profiling.
The ANGLE Parsortix has broad potential applications in CTC research given that it is capable of rapidly capturing CTCs without a requirement for affinity based marker selection. Our workflow for whole transcriptome RNA Seq profiling with the Parsortix is the basis for ongoing clinical trials at our institution. However, the current study suggests that target validation with the NanoString PAM50 is not a viable option for CTC research.

Blood specimens and cell lines
By standard venipuncture procedure, two blood samples of 7.5ml each from 19 different cancer-free female donors were collected in EDTA tubes (Becton Dickinson, Franklin Lakes, NJ). Donors were selected based on the inclusion criteria established in the approved IRB protocol. Adhering to institutional HIPPA regulations, female donors between 25 and 50 years of age signed informed consent, and a research numerical identifier was assigned to each participant to keep demographic information blinded from the researchers.
One tube of blood was used as a CTC mimic by adding or spiking in the basal-like/triple-negative breast cancer cell lines Hs578T and MDA-MB-231, or the HER2 amplified SkBr3, or the luminal A cell line MCF7. Equal numbers of blood samples (n=6) were spiked with each cell line. Some donors consented for repeated blood draws; therefore some of them were spiked with more than one cell line. These CTC mimic specimens were termed spiked samples. The spiking in process was performed within 20 minutes after blood collection. The other blood specimens were used as non-CTC or negative controls, termed unspiked controls ( Figure 7).
All cell lines were tested for Mycoplasma contamination using the MycoAlert Mycoplasma detection kit (Lonza, Walkersville, MD), and subjected to short tandem repeat genotyping for cell line authentication at the University of Arizona Genetics Core.

CTC enrichment
Both spiked samples and unspiked controls were subjected to CTC enrichment using the ANGLE Parsortix System (ANGLE plc, Guildford, UK). Parsortix cellharvests were recovered in 200μl of 2% bovine serum albumin BSA (Sigma-Aldrich, St. Louis. MO) in PBS. Recovered cells were centrifuged at 400xg for 5 minutes at ambient temperature, and the cell pellet was lysed in 5μl of the Prelude Direct Lysis Module (NuGEN, San Carlos, CA). Cell lysates were stored at -80°C.

Enumeration of harvested CTCs by flow cytometry
To 7.5 ml of peripheral blood we spiked 20 SkBr3 cells and performed CTC enrichment using the Parsortix system as described above. CTC harvested material was incubated with 1μl of FITC mouse anti-human EpCAM IgG (BD Biosciences, San Jose, CA) and 1μl of PE-Cy7 mouse anti-human CD45 IgG (BD Biosciences, San Jose, CA) on ice for 30 minutes. Unbound antibodies were removed by washing twice with 2ml of wash buffer (2% BSA in PBS). Cells were resuspended in 200μl of wash buffer and analyzed using a FACS Aria II Cytometer (BD Biosciences, San Jose, CA). Cells were sorted and collected in 5μl of cell lysis buffer, and stored at -80°C.

NanoString PAM50 nCounter single cell gene expression assay
NanoString nCounter gene expression assay has been previously described [19,23]. Briefly, NanoString PAM50 nCounter assay is a hybridization technique that quantitatively measures the number of mRNA transcripts using a pair of target-specific multiplexed CodeSet probes (capture and reporter probes) (NanoString Technologies, Seattle, WA). The capture probe consists of a 5' to 3', 30-50-base target specific sequence following two 15-base sequences common to all capture probes, and a biotin affinity tag. The reporter probe consists of a 3' to 5', 30-50-base target specific sequence that hybridizes near to the capture probe-complementary-site, followed by 4 tandem repeats of 15 bases common to all reporter probes, and a color-coded barcode DNA/RNA hybrid molecule labeled with a fluorescent dye. After hybridizing the capture and reporter probes to the target mRNA, those mRNA/reporter/capture probe complexes were captured and immobilized via a streptavidin-biotin linkage to the surface of an nCounter cartridge. Cartridges were transferred to the nCounter Digital Analyzer (NanoString Technologies, Seattle, WA) for imaging and data collection. The expression level of a gene was measured by counting the number of times a specific fluorescent barcode molecule was detected. Counts were tabulated, exported and analyzed using internal hybridization negative and positive controls.
NanoString PAM50 targeted single cell gene expression assays were performed on CTC enriched cell lysates from both spiked samples and unspiked controls. NanoString standard protocol for single cell gene expression analysis consisted of a two-step process: cDNA conversion and multiplexed target enrichment (MTE). Initially, input RNA material (5μl of cell lysate) was converted to cDNA by reverse transcription using 1.5μl of the SuperScript VILO Master Mix (Invitrogen, Carlsbad, CA). The reaction was incubated at 25°C for 10 minutes, followed by 42°C for 60 minutes and 85°C for 5 minutes in a T-100 Thermal Cycler (BioRad, Hercules, CA). Obtained cDNA was PCR amplified using 7.5μl of TaqMan PreAmp Master Mix (Applied Biosystems, Foster City, CA) and 1μl of a pool of PAM 50 MTE primers ( Table 2). The amplification process followed a denaturation step of 94°C for 10 minutes, 14 MTE cycles of 94°C for 15 seconds and 60°C for 4 minutes. MTE primer pool allowed specific linear amplification of the 56 genes included in the NanoString PAM50 CodeSet.

Positive and negative controls for NanoString PAM50 gene expression
As NanoString PAM50 single cell gene expression positive controls, RNA from all cell lines was included. Cell line RNA extraction process included cell lysis using TRIzol (Invitrogen, Carlsbad, CA), RNA precipitation with isopropanol (Sigma-Aldrich, St. Louis, MO), and RNA wash with 70% ethanol (Sigma-Aldrich, St. Louis, MO). Isolated RNA was eluted in RNAasefree water, and both quantity and purity were access by spectrometry using NanoDrop 2000 (Thermo Scientific, Carlsbad, CA). A total of 0.1ng of cell line RNA was used for the NanoString PAM50 steps of cDNA synthesis and MTE applying similar conditions used on the spiked samples and the unspiked controls. Additionally, as 'gold standard' gene expression positive control we used 100ng of RNA from the cell lines. Neither cDNA conversion nor MTE steps were performed on the 'gold standard' control.
Peripheral blood (PB) RNA extracted from each unspiked blood specimen was included as negative gene expression control. Immediately after blood draw, 200μl of peripheral blood was stabilized by adding 1ml of RNAlater solution (Ambion, Carlsbad, CA), and stored at -80°C. PB RNA was isolated using the RiboPure-Blood Kit (Ambion, Carlsbad, CA), and 100ng of it was used as input material for the NanoString PAM50 assay. These specimens were termed bulk peripheral blood controls (bulk PB controls) (Figure 7).
Due to the high sensitivity and high specificity of the NanoString PAM50 platform in FFPE, assays on both samples and controls were performed a single time, without technical replicates, per manufacturer's recommendations even for rare samples.

Data normalization and differential gene expression analysis
nCounter data was analyzed using NanoString nSolver Analysis Software v2.5. Following NanoString's recommended protocol, raw count data normalization included: (i) quality control measurements for imaging quality and sample binding saturation using hybridization positive controls, (ii) elimination of experimental variability using reference genes, and (iii) per sample background correction by setting a gene expression threshold or cut-off value of 2 standard deviations over the mean of the internal hybridization negative controls. Log 2transformed data was used for ANOVA testing to compare the mean of the gene expression between all unspiked controls and all peripheral blood RNA samples. Differential gene expression analysis between spiked samples and unspiked controls was initially performed by manual comparison of the background corrected Log 2 gene expression values per gene. Results were subsequently validated using the NanoStringDiff statistical method for gene expression analysis described by Wang, et al [24]. ANOVA testing was done using GraphPad Prism 7 (La Jolla, CA).