Integrated landscape of copy number variation and RNA expression associated with nodal metastasis in invasive ductal breast carcinoma

Background Lymph node metastasis (NM) in breast cancer is a clinical predictor of patient outcomes, but how its genetic underpinnings contribute to aggressive phenotypes is unclear. Our objective was to create the first landscape analysis of CNV-associated NM in ductal breast cancer. To assess the role of copy number variations (CNVs) in NM, we compared CNVs and/or associated mRNA expression in primary tumors of patients with NM to those without metastasis. Results We found CNV loss in chromosomes 1, 3, 9, 18, and 19 and gains in chromosomes 5, 8, 12, 14, 16-17, and 20 that were associated with NM and replicated in both databases. In primary tumors, per-gene CNVs associated with NM were ten times more frequent than mRNA expression; however, there were few CNV-driven changes in mRNA expression that differed by nodal status. Overlapping regions of CNV changes and mRNA expression were evident for the CTAGE5 gene. In 8q12, 11q13-14, 20q1, and 17q14-24 regions, there were gene-specific gains in CNV-driven mRNA expression associated with NM. Methods Data on CNV and mRNA expression from the TCGA and the METABRIC consortium of breast ductal carcinoma were utilized to identify CNV-based features associated with NM. Within each dataset, associations were compared across omic platforms to identify CNV-driven variations in gene expression. Only replications across both datasets were considered as determinants of NM. Conclusions Gains in CTAGE5, NDUFC2, EIF4EBP1, and PSCA genes and their expression may aid in early diagnosis of metastatic breast carcinoma and have potential as therapeutic targets.

In order to examine the effect of CNV on difference in lymph node metastasis, we employed a case-control design. A case was defined as a patient having invasive ductal carcinoma (IDC) and lymph node metastasis at the time of sample collection and diagnosis. All controls were IDC patients with metastasis-free lymph nodes at the time of diagnosis. All statistical analysis was carried out using a set of patients (n = 772) from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC). The same approach was then carried out in a second large set (n = 650) from The Cancer Genome Atlas (TCGA).
Clinical and omic samples for METABRIC were collected only from breast cancer patients. Data comes from 5 separate sources in the EU and Canada; Cambridge Breast Unit at Addenbrooke's Hospital (Cambridge), Guy's Hospital (London) and Nottingham University City Hospital, the Tumor Bank of British Columbia (Vancouver) and the Manitoba Tumor Bank. METABRIC's sample omics feature copy number variation (Affymetrix SNP 6.0) and expression (Illumina HT 12 array) platforms. METABRIC data was provided by Synapse training dataset in the Breast Cancer Challenge (https://www.synapse. org/#!Synapse:syn1688369/wiki/27311). A total of 772 METABRIC patient samples were included in this analysis. TCGA data is comprised of 650 tumor samples from women with invasive breast carcinoma. Omic information used from TCGA included CNV (Affymetrix SNP 6.0) and mRNA (Illumina HiSeq). Level 3 TCGA RNA Seq files used (rsem.genes.results) measure raw expression signal for a gene. Copy number files used (nocnv.seg) were normalized to remove germline CNV. Since METABRIC array data uses human genome 18 (hg18) as annotation, all downloads were hg18. Institutional Review Board approval was obtained for METABRIC Synapse data. TCGA level 3 data is publicly available.
Inclusion criteria and variable selection: In order to qualify for analysis, samples needed to be exclusively invasive ductal carcinoma. This means that not only were all other histologies (e.g. lobular, colloid, tubular) excluded, but also all non-invasive in-situ tumors were left out. Since METABRIC includes a large portion of stage 0 (in-situ), all samples with missing stage were also excluded as a precaution against non-invasive samples.
The European Society of Medical Oncology (ESMO) staging [1] uses tumor size, nodal involvement, and presence of distant metastasis (TNM classification [2]) to categorize breast tumors. ESMO stage 0 indicates a non-invasive, localized yet still malignant tissue sample. All patients were female, with no history of prior malignancy or neoadjuvant treatment. Samples with missing information on NM outcome were excluded from analysis. Unless indicated otherwise, variables with over 10% missing values were also excluded from analysis. Since the METABRIC variable of stage was only available for half of the dataset, we chose to exclude all unstaged participants in an effort to avoid possible misclassification of noninvasive tumors.

Clinical variables
The response variable of lymph node metastasis (NM) was defined for both datasets using the previously mentioned TNM pathologic staging for lymph nodes (N). All TNM N values of 0 were considered controls (NM 0). Any N values greater than 0 were considered cases of NM.
Non-omic data from METABRIC includes age at diagnosis, tumor size at largest dimension, grade, stage, histological type, treatment received, menopausal status, lymph nodes positive, total lymph nodes removed cellularity, and Nottingham Prognostic Index (NPI). In addition, receptor status for estrogen, progesterone, and HER2 was assayed using two methods: immunohistochemistry (for 40%-60% of samples) and expression (100% of samples) [3]. Stage and TNM staging variables were recorded according to the AJCC 7th Edition guidelines [4]. Other clinical and pathologic data include: history of previous malignancy, neoadjuvant therapy given, method of diagnosis, surgical procedure, total lymph nodes examined, total lymph nodes positive for metastasis, histologic diagnosis, menopause status, and age at diagnosis. Receptor status was measured by immunohistochemistry for estrogen, progesterone and HER2 receptors. See Table 1 in paper for a full overview of clinical variables shared between both datasets.

Copy number measures
METABRIC and TCGA share the same CNV platform; Affymetrix Genome-Wide SNP array 6.0. Somatic CNV segments in tumors were identified using The HMM-Dosage method [5]. For CNV alterations on a gene-centric level, annotation files used are Ensembl 54 (hg18) for protein-only probes in Illumina HT-12v3 array. It is important to note that, in METABRIC CNV segment identification; there was not exact matching of tumor to normal pairs. In the discovery set of 997 tumors, 473 normal samples were available for use in a pooled approach. The workflow (not done in this paper) to summarize data in normalized Log2 intensities was accomplished using probe level modelling and normalization with SNP-RMA and aroma-affymetrix software [3]. CNV data for this study comes from Level 3 TCGA Affymetrix Genome-Wide SNP array 6.0. Processing pipeline full details are documented in a Broad Institute GenePattern pipeline [6]. Circular binary segmentation (CBS) [7] was used to create copy number segments, which were then assigned mean log-ratios per segment. This research uses CBS files for each patient to follow the "gene-centric" analysis of CNV used in METABRIC. Annotation files used are Ensembl 54 (hg18) for protein-only probes in Illumina HT-12v3 array.

RNA measures
The Illumina HT-12v3 platform was used in gene expression analysis. Similar to CNV identification, there were 997 tumors matched to 144 normal samples. The resulting workflow included spatial artifact correction, summarization, and normalization of Log2 intensities using beadarray and BASH R packages [8,9]. In TCGA, Normalized mRNA expression counts are derived from the TCGA Level 3 RNAseqV2 expression data. Illumina HiSeq 2000 was the platform used to create the data, and it was processed by the University of North Carolina to produce counts using Map Splice [10] for alignment and RSEM [11] for quantification.

Statistical analysis
Analysis of association between clinical predictor variables and response (NM) was carried out using a chi square test, or F-test when appropriate. Molecular tests of association were covariate adjusted to account for confounding from variables associated with both exposure and outcome. In METABRIC, final covariates were tumor grade, tumor size, patient age at diagnosis, and race. TCGA covariates included receptor subtype, age at diagnosis, tumor size, and race. These covariates are used in both CNV and RNA association tests for all data.
Genome-wide CNV association tests used logistic regression for each gene in the genome. The logistic model allows for a dichotomous response variable, nodal metastasis (NM, positive or negative) and multiple predictor variables. We corrected for multiple testing in the GWAS analyses using false discovery rate (FDR) methods with a type 1 error set at 0.05. CNV data is given in METABRIC as a gene-by-gene summary of normalized segment means. In TCGA, CNV data is available as a perpatient file of segment means with a start/end location on each chromosome. We summarized the segment intensity data for each gene with the average intensity of each gene using the annotation package GenomicRanges [12]. These files were then merged using the "summarizeOverlaps" function to give a METABRIC-like matrix of patient-bygene segment means.
RNA data is available in METABRIC as a probe normalized expression value. We converted from probe to gene level using the "CollapseRows" function in the WGCNA package [13]. For each gene in the genome, we fitted a linear model with the R package limma [14]. Empirical Bayes [15] shrinkage was used in calculating a t-statistic for each gene. Multiple comparisons were corrected for using the Benjamini-Hochberg approach. In TCGA, raw counts of RNA per-gene were compiled across the genome of each patient and then assembled per gene matrix using the edgeR R package. [16] Normalization factors for the raw data matrix were calculated, as well as common dispersion values.
Gene lists from CNV and RNA association tests were then reviewed for any significant overlaps, both between datasets and within them. The statistical testing of this involved a contingency table of two gene lists, the whole set of possible genes (entire genome) and Fisher's exact test. The R package GeneOverlap [17] was used to indicate replication of important by-gene CNV or RNA associations. As mentioned in the approach, further evaluation of consistency in direction of effect was also done.
A copy number association analysis [18,19] was done to examine the effect of per-gene, CNV-related gain/loss upon RNA within the same tumor. To account for differences across NM status, both datasets were split by NM, and the following tests were performed with the iGC Bioconductor package [20]. Gene expression driven by CNV was identified first by grouping all per-gene CNVs as copy gains (log2 ratio ≤ 0.4), copy losses (log2 ratio ≥ -0.4), and between-threshold values as diploid/ neutral. The variations in gene expression between CNVgain genes and diploid normals and CNV-loss genes and diploid normals were measured with an unequal variance Student's t-test. Filtering of results was based on the false discovery rate (FDR) corrected p-value (α = 0.1) and consistent direction of CNV-to-RNA association. A relaxed p value threshold was selected to avoid losing genes that could be false negatives in a stringent testing by the cost of accepting more false positives. CNV-driven gene transcripts unique to NM status were found for both METABRIC and TCGA. In a final step of replication, intersecting genes were then identified across datasets within each NM group.
Two additional validation steps of 1) enrichment analysis and 2) CNV-driven methylation and protein changes in TCGA were performed. Enrichment analysis was performed on all top result CNVs associated with NM, Fischer exact testing was used to indicate the probability of a gene occurring in any set of ontology genes [21]. We utilized the cBIO Portal [22] for additional analysis validating or CNV-driven mRNA results. Using the same TCGA samples in our research, we compared CNV-driven changes in protein and methylation for our CNVs of interest. Omic data was available for four genes; CRELD1, EIF4EBP1, PSMD3, and STARD3. Pearson coefficients were used to measure correlation between CNV and methylation or protein massspectrometry for the same sample.