Identification of copy number alterations in colon cancer from analysis of amplicon-based next generation sequencing data

The objective of this study was to determine the feasibility to detect copy number alterations in colon cancer samples using Next Generation Sequencing data and to elucidate the association between copy number alterations in specific genes and the development of cancer in different colon segments. We report the successful detection of somatic changes in gene copy number in 37 colon cancer patients by analysis of sequencing data through Amplicon CNA Algorithm. Overall, we have found a total of 748 significant copy number alterations in 230 significant genes, of which 143 showed CN losses and 87 showed CN gains. Validation of results was performed on 20 representative genes by quantitative qPCR and/or immunostaining. By this analysis, we have identified 4 genes that were subjected to copy number alterations in tumors arising in all colon segments (defined “common genes”) and the presence of copy number alterations in 14 genes that were significantly associated to one specific site (defined “site-associated genes”). Finally, copy number alterations in ASXL1, TSC1 and IL7R turned out to be clinically relevant since the loss of TSC1 and IL7R was associated with advanced stages and/or reduced survival whereas copy number gain of ASXL1 was associated with good prognosis.


INTRODUCTION
The development of cancer is driven by the acquisition of somatic genetic alterations that include single nucleotide variations (SNVs), gene fusions and copy-number alterations (CNAs). CNAs are somatic changes that cause the gain or loss of DNA fragments [1][2][3], and represent the most common alterations of cancer cells [4][5][6][7][8][9]. They contribute to both onset and progression of cancer by inappropriate activation of proto-oncogenes and/or inactivation of tumor suppressor genes [4,[10][11][12][13][14][15]. Characterization of CNAs in tumors have helped in the identification of relevant oncogenes including ERBB2 and EGFR, as well as tumor suppressors such as pRB and TP53 [16], resulting into better diagnostics and more appropriate therapy [17][18][19].
Previous studies of CNAs detection in CC were carried out by quantitative PCR, fluorescence in situ hybridization (FISH), whole-genome array comparative genomic hybridization (aCGH) or single-nucleotide polymorphism (SNP) arrays [47][48][49]. The recent advent of high-throughput sequencing techniques and the subsequent development of ad hoc algorithms have made available CNAs identification [50,51]. Recently, Grasso and co-workers have developed an algorithm for assessing CNAs from Next Generation Sequencing (NGS) data generated by amplicon-based DNA libraries derived from Formalin Fixed Paraffin-Embedded (FFPE) tumors [52].
In the present study, we applied this algorithm to DNA sequencing data to determine the feasibility to detect CNAs using NGS and elucidate the association between specific CNAs and cancer originating from different anatomical colon segments.

Identification of CNAs in colon cancer by analysis of amplicon-based NGS data
In this manuscript we have applied the Amplicon CNA Algorithm, previously described by Grasso and coworkers [52] to identify somatic CNAs from ampliconbased NGS data generated using the Ion AmpliSeq™ Comprehensive Cancer Panel (CCP), which provides complete whole exon coverage of the 409 most important cancer-associated genes. As described in a parallel manuscript (Oliveira et al., under revision), we have selected 37 patients among those who underwent surgery for CC at the General Surgery Unit of University Magna Graecia of Catanzaro in the years 2013-2015. The samples were resected from multiple anatomical segments of the colon: ascending colon (7 patients), descending colon (7 patients), hepatic flexure (8 patients), splenic flexure (5 patients), transverse colon (4 patients) and cecum (6 patients) (Supplementary Figure 1) Figure 1 illustrates the pipeline analysis of NGS data implemented in R statistical environment [53]. The input for the Amplicon CNA Algorithm was the Binary Alignment Multifasta (BAM) file. The use of pooled normal samples as reference has been reported to be a valid alternative to the one-by-one match between tumors and the corresponding normal tissues [52]. The algorithm output consisted in a list of all CNAs identified for each sample analyzed. Overall, 1904 CNA calls were identified. Copy Number (CN) gains were defined as alterations showing log2 CN ratio ≥ 0.1 and CN losses were defined as alterations showing log2 CN ratio ≤ -0.1. Benjamini-Hochberg procedure was used to reduce false discovery rate and CNAs were considered significant when the q-value was ≤ 0.05 (Supplementary File 1).
On the basis of these parameters we found a total of 785 significant CNA calls, of which 328 (41.7%) were CN gains and 457 (58.2%) were CN losses, distributed along 243 different genes. Finally, for the sake of clarity, we considered significant only the alteration calls that showed concordance in ≥65% of the samples. Using this threshold, we found that the majority of genes with CNAs were concordant (230/243). The remaining 13 genes (DICER1, FGFR1, HOOK3, IGF2, IKBKB,  MARK1, NFKB1, NF1, PTPRD, SMAD2, SYNE1,  TAF1L, TRIP11) presented discordant calls and were not considered in the rest of the manuscript. We will consider only the 230 genes that showed concordant calls. Among the concordant genes, 143 presented CN losses and 87 presented CN gains (Supplementary Table 2). The Circos plot shown in Figure 2 summarizes all detected CNAs. The median number of genes per tumor showing CN gains was 12 (range 1-37) whereas the median number of genes per tumor showing CN losses was 20 (range 1-41). CN gains and losses involved genes located on all chromosomes except chromosome 23. The heatmap in Figure 3 shows all significant CNAs ordered by cytobands (see also Supplementary Table 3), with CN gains (red) and CN losses (green).
In Table 2 are listed the genes with the highest values of CN changes (columns 1-8). Among these, there were CCND1 with a mean CN ratio of 6.0, CCNE1 with a mean CN ratio of 4, MAF with a mean CN ratio of 3.30, BCL2L1 with a mean CN ratio of 2.19 and CDKN2A with a mean CN ratio of 0.2.
Expectedly, most genes showing alterations had already been reported to be associated with the development of colon cancer [22]. However, among the 230 genes that presented significant CN changes in colon cancer we identified at least 10 whose alterations had not been previously associated to colon cancer. Of these 4 genes were subjected to CN gains (DST, KLF6, FANCA, CSMD3) and 6 were subjected to CN losses (TGM7, NKX2-1, RHOH, RNF213, ERG, and CRBN).

Validation of NGS results by Q-PCR analysis and immunohistochemistry analysis
Subsequently, we used Q-PCR to validate the results obtained through the bioinformatics analysis of NGS data. In Figure 4 we reported representative Q-PCR analysis relative to 2 genes showing CN gains (MAF and BCL2L1) and 2 genes with CN losses (SMAD4, CDKN2A). In Supplementary Figure 2 we reported representative Q-PCR relative to additional genes showing significant CNVs. Overall Q-PCR results were consistent with bioinformatics analysis of NGS data, also in those genes that resulted discordant (Supplementary Figure 2). Immunostaining analysis demonstrated that the observed CNA in the gene encoding CCND1 resulted into cyclin D1 protein overexpression in CC27 (see Figure 5).

Comparison of the results with Colon Adenocarcinoma dataset present in public repositories
We validated the results obtained in this study relative to the association of genes with CNAs and localization of tumors in colon segments by analysing the Colon Adenocarcinoma dataset within The Cancer Genome Atlas database (COAD-TCGA). Within this dataset, 358 tumor samples arising from different colon segments were present. Eighty tumors derived from ascending colon, 146 from descending colon, 6 from splenic flexure, 16 from hepatic flexure, 84 from cecum and 27 from transverse colon.
We further validated the analysis of CNAs by investigating the correlation between the 230 genes showing CNAs identified in this study and the corresponding mRNA expression reported in the COAD-TCGA dataset. The analysis was performed by a linear regression model, which allowed the identification of genes presenting direct association between CN changes (i.e. amplification/loss) and RNA expression (i.e. overexpression/under-expression). Among the identified 230 genes with significant CNAs, 69 genes (30%) showed a statistically significant Pearson correlation value higher than 0.45 (R 2 > 0.2, p-value < 0.01). See Supplementary Table 4 for further details. Notably, some among the 69 genes with a significant correlation between CN changes and RNA expression are involved in the development of CC, such as SMAD2, ASXL1, PLCG1, UBR5, TOP1 and MBD1. In Figure 6 are reported two of the genes that showed the most significant correlation between CNAs and mRNA expression (ASXL1 and MBD1).

Clinical-pathological correlations
For clinical analysis we selected those genes that presented concordant CNAs in at least 5% of the samples. We found that 60, among 230 significant genes, presented CNAs in ≥ 5% of patients and correlated them with clinical and pathological parameters such as node status (N), stage, tumor size (T) and/or presence of metastasis (M1) (p-value < 0.05). See Supplementary Table 5. Expectedly, Univariate Cox Regression analysis demonstrated that the parameters T, M and stage were predictors of overall survival (OS) (see Table 3).
Subsequently, the genetic status of CC patients (n = 35) was correlated with OS and the gain of ASXL1, loss of TSC1 or loss of IL7R predicted poor prognosis, as shown by the corresponding Kaplan-Meier curves (see Figure 7).
In particular, the average 4-year survival rate of all CC patients was 86%. Upon stratification for the status of However, none of the covariates that resulted significant in Univariate analysis turned out to be independent prognostic factors by Multivariate Cox Regression analysis.

Genes showing CNAs in colon cancer arising in different anatomical segments
In order to identify a specific pattern of genetic alterations based on tumor site, we stratified genes with CN alterations according to the anatomical localization of the tumor (ascending colon, descending colon, transverse colon, hepatic flexure, splenic flexure and cecum). Genes that presented gains or losses in all colon segments were defined "common genes". Genes that presented gains or losses significantly associated to one anatomic colon segment were defined "site-associated genes". "Common genes" and "site-associated genes" are listed in Supplementary Tables 6 and Table 4    Among "common genes", CN gains were observed in PLCG1 and ASXL1 genes, whereas CN losses were observed in NLRP1 and WHSC1 genes.
"Site-associated genes" were identified by use of a chisquare test with a threshold of significance set at p-value ≤ 0.05. As listed in Table 4, "site-associated genes" that showed significant association with tumors arising in specific colon sites were 14. Among these, CN losses were identified in APC, DCC, MBD1, NOTCH2, PDGFB, PKHD1, PIK3R1, RET, RNF213, SMAD4 and WRN whereas CN gains were identified in BCL2L1, RB1 and UBR5.
Among the 14 "site-associated genes" that showed the most significant association with colon segments, PDGFB, SMAD4, RB1, BCL2L1 showed anatomical position dependency for only one site with highly significant p-values (≤ 0.01). CN loss of PDGFB was observed only in tumors from splenic flexure, CN loss of SMAD4 was observed only in tumors from descending colon, whereas CN gain of RB1 was observed only in tumors from descending colon and CN gain of BCL2L1 was observed only in tumors from splenic flexure.
Given the limited number of samples analysed in the cohort of patients under study here, the significance of the anatomical position dependency shown for the 14 "site-associated genes" described above was investigated using a larger dataset of CC (COAD-TCGA). Notably, we found a significant association for 7 of the "site-associated genes" reported in this study also for samples present in COAD-TCGA (n = 358). In particular, CNAs in MBD1, SMAD4, PIK3R1, DCC, WRN, RB1 were significantly associated with tumors originating in descending colon whereas CNAs in NOTCH2 was associated with tumors originating in splenic flexure (p-value < 0.05). These findings confirmed the significance of the anatomical position dependency shown for at least 7 out of 14 "site-associated genes" reported in this study, indicating that tumors arising in different colon segments may be caused by alterations that occur in different genes.

DISCUSSION
In this manuscript, we have applied a previously described Amplicon CNA Algorithm, to investigate the presence of somatic CNAs in tumors originating in different colon sites [52]. This analysis was performed on NGS data generated using the Ion AmpliSeq™ CCP, which  CNAs in genes such as ASXL1, TSC1 and IL7R, iii) the identification of CNAs in 4 genes in tumors originating from all colon segments ("common genes") and iv) the detection of CNAs in 14 genes associated preferentially to a specific colon site ("site-associated genes"). The main characteristic of the approach described by Grasso et al., and applied here, was to use read counts/ amplicon to identify CNAs from NGS data. Prediction of gene amplification/deletion is possible if sufficient number of amplicons is analyzed [54]. Overall, we have found a total of 785 significant CNAs in 243 different genes, of which 328 were CN gains and 457 were CN losses. The results obtained in silico with the Amplicon CNA Algorithm were validated by quantitative Q-PCR and immunostaining. Further control of our results was performed by combining information from COAD-TCGA database and sequencing data presented in this manuscript. From this analysis it appears that almost 30% of the 230 genes with significant CNAs in CC showed a statistically significant correlation with mRNA expression, at difference with a similar analysis using the COAD-TCGA dataset, in which 20% of genes (3542 out of 17630) presented a positive correlation between CNAs and mRNA expression (R 2 > 0.2; p ≤ 0.01).
Previous studies have reported that in colon cancer chr20 was most frequently subjected to CN gains and chr18 was most frequently subjected to CN losses [55]. In agreement with these previous reports, the genes most frequently subjected to CN gains observed in this study were located on chromosome 20 including ASXL1, PLCG1, TOP1 and PTPRT whereas the genes most frequently subjected to CN losses were located on chromosome 18 and include MBD1, DCC and CDH2. Most of the genes that presented CNAs identified in this manuscript, such as TOP1, ASXL1, PTPRD, DCC, NLRP1 and CDH2 have already been directly associated with the development of CC. In particular, gene amplification and/ or overexpression of TOP1 has been detected in metastatic colon cancer whereas loss of ASXL1 occurs in CC with microsatellite instability [56]. Loss of PTPRD expression was observed in highly invasive cancers and correlated with patient survival [57].  Some of the genes that presented the highest values of CNAs have already been associated with CC. In particular, gene amplification and/or overexpression of CCND1 have been associated with poor prognosis and reduced overall survival in CC patients whereas BCL2L1 has been shown to play a role in the adenomato-carcinoma progression [58,59]. Less clear is the role of MAF having been described either as an oncogene or as a tumor suppressor, depending on the cell context [60].
On the other hand, at least 10, among the 230 genes showing CNAs, were not known to be associated to CC. Among these TGM7, NKX2-1, RHOH, RNF213, ERG, and CRBN presented significant CN losses and thus were potential tumor suppressor genes whereas DST, KLF6, FANCA, CSMD3 presented significant CN gains and can be considered potential oncogenes.
Notably, we observed that CN gain of ASXL1 was associated with an improvement in OS, whereas CN loss of TSC1 and IL7R predicted significantly reduced OS.
An important aim of this study was to determine whether tumors arising in different colon segments presented specific molecular alterations. Overall we have identified 4 "common genes" subjected to CNAs in tumors originating from all colon segments and 14 "site-associated genes" whose alterations are associated to tumors arising in specific colon segments. Among the 4 "common genes" we found CN gains in ASXL1 and PLGC1, which suggest that they act as oncogenes in CC. However previous studies showing the involvement of ASXL1 and PLGC1 in the development of CC were inconsistent. In fact both ASXL1 and PLGC1 have been shown to act either as tumor suppressor genes or oncogenes [56,61]. Among the "common genes" presenting CN losses identified in this study is NLRP1, a protein whose function is apparently involved in gastrointestinal inflammation and tumorigenesis [62].
Among the 14 "site-associated genes" that showed highly significant association with a specific colon segment, PDGFB, SMAD4, RB1, BCL2L1 showed anatomical position dependency for only one colon segment, WRN, NOTCH2, APC and PIK3R1 showed anatomical position dependency for two colon segments whereas the remaining genes PKHD1, RET, MDB1 RNF213, UBR5 and DCC were associated to 3 or more segments. Moreover, a further support to the significance of the association reported in this study, was the finding that 7 out of the 14 "site-associated genes" reported here showed a significant position dependency also in the cohort of tumors present within the COAD-TCGA database. In particular, CNAs in MBD1, SMAD4, PIK3R1, DCC, WRN, RB1 were significantly associated with tumors originating in descending colon whereas CNAs in NOTCH2 was associated with tumors originating in splenic flexure.
It is also of note that loss of "site-associated" genes such as WRN, NOTCH2, MBD1 and PI3KR1 had already been associated to development of human cancer. In particular, WRN has been reported to be frequently deleted in CC, and its deficiency apparently predisposes to various types of cancer [22]. In the present study WNR was found to be lost preferentially in descending colon tumors. Similarly, in agreement with the existing literature [63], we found that NOTCH2 was deleted in CC (preferentially in splenic flexure tumors), which suggested a role as tumor suppressor in this subset of CC. On the other hand, we have found that MBD1 loss is a frequent event in colon carcinogenesis, being associated with descending colon tumors in different studies. This observation is in agreement with previous studies reporting frequent deletion of 18q21 (where MBD1 maps) in CC [64]. Notably, the results described in the present study show a significant association between MBD1 loss and late stage of disease. Finally, loss of PIK3R1 was preferentially observed in descending colon tumors. PIK3R1 represents the p85 regulatory subunit of heterodimeric enzymes, the PI3Ks, which also include a p110 catalytic subunit [65]. PI3Ks are downstream effectors of tyrosine kinase and G-protein-coupled receptors, which coordinate multiple cell functions including proliferation, migration and survival [66,67]. PIK3R1 plays an important role in restraining cell migration. Loss of PIK3R1 was observed in patients with stage III disease (5/12 patients) but in none of the patients with stage I or II disease (0/20), suggesting that its down-regulation promotes aberrant activation of PI3K signalling in colon cancer cells, which would lead to invasion of adjacent tissues and/or regional dissemination.
In conclusion, in this study we report the successful detection of somatic CNAs in 230 genes using NGS data relative to 37 CC samples. Expectedly, most genes showing CNAs had already been reported to be associated with CC [22] whereas at least 10 among the 230 altered genes had not apparently been associated to CC yet. Notably, the analysis reported in this study indicated that CN changes in at least 3 genes (ASXL1, TSC1 and IL7R) were clinically relevant, being their alteration associated with survival. Finally, the analysis of the distribution of genes with CNAs relative to the site of origin of cancer led to the identification of 4 "common genes" that were subjected to CNAs in tumors arising in all 6 colon segments and 14 "site-associated genes" whose CNAs occurs preferentially in tumors originating only in certain colon segments.

Tumor samples
Tumor samples, matched normal mucosa and peripheral blood lymphocytes (PBL) were obtained from patients referring to General Surgery unit of the AOU Mater Domini/University Magna Graecia (Catanzaro, Italy), who underwent surgical resection for colon cancer since January 2013. Biopsies were immediately snap frozen and stored at -80° C. Hematoxylin and eosinstained tissue sections were reviewed by an expert pathologist to confirm diagnosis.

Patients' demographics
General demographic information, histo-patological and clinical parameters, surgical treatment and follow-up data were collected prospectively and are also reported in Oliveira et al. (submitted). However, for sake of clarity we summarize below the clinical characteristics of the patients included in the study.
Among the 37 patients, 13 were women and 24 were males. Mean age of patients was 68.35 years old (range 47-84). Stage was known for 36 of the 37 patients: 7 patients had stage I disease, 13 patients had stage II disease, 12 patients had stage III disease and 4 patients had stage IV disease. Grade was known for 35 out of 37 patients: 1 patient had tumor that was graded G1, 25 patients had tumors that were graded G2 and 9 patients had tumors that were graded G3. Of the patients included in the present study, four presented distant metastasis. None of the patients received chemotherapy or radiation therapy prior to surgery.

Bioinformatic analysis for CNV detection
DNA extraction, library preparation using the Ion AmpliSeq ™ Comprehensive Cancer Panel on the Ion Torrent platform (Thermofisher, MA, USA), sequencing and NGS primary analysis were carried out as described (Oliveira et al., submitted).
To identify CNAs in NGS amplicon-based dataset we replaced average coverage of exon pull-down regions with read counts per amplicon. All reads were aligned to the human reference genome (hg19). For each sample, normalization was performed by dividing the number of reads of each amplicon by the total number of reads. Subsequently, the normalized reads obtained as described from tumor samples were divided by the normalized reads from pool made of blood samples from 13 patients, set as reference. The resulting Log2 values (raw copy number ratios) were corrected for the GC content in each amplicon and the Poisson model was applied using the CNA amplicon algorithm described in Grasso et al. [52] to identify CNAs. Genes were defined significant when the q-value was ≤ 0.05 after the Benjamini-Hochberg correction [68]. CN gains were defined as genes showing log2 CN ratio ≥ 0.1 and CN losses were defined as genes showing log2 CN ratio ≤ −0.1.

Association and survival studies
The association between genes showing CNAs and the clinical-pathological parameters was evaluated by Fisher's exact test and χ 2 test. Overall survival (OS) was calculated from the day of surgery to death or end of follow-up. Kaplan-Meier curves were used for analysis of OS.
Univariate and multivariate survival analyses with calculation of hazard ratios (HR) were performed using Cox's proportional-hazards model. R software was used for statistical analysis and a p-value ≤ 0.05 was considered significant.
Regarding the correlation of CNA with mRNA expression, a gene-level table of copy number values and gene expression data were downloaded from the COAD-TCGA dataset. Using as input CNA and miRNA expression, we describe the relationship between these variables by linear regression analysis setting as significant Pearson correlation value >0.45 (R 2 > 0.2, p-value ≤ 0.01)

Quantitative real-time (Q-PCR)
To validate bioinformatic analysis of CN alterations we performed real-time PCR in selected genes. We used GAPDH to normalize the data and PBLs as reference samples. Reactions were performed using SYBR Green I PCR Master Mix (Thermofisher), which includes the internal reference (ROX). Each qPCR reaction comprised 10 μl 2× SYBR Green PCR Master Mix, forward and reverse primers at final concentration of 500 nM. QPCR reactions were performed using the Quantstudio 12 K Flex (Thermofisher). The reaction profile was: initial step, 50° C for 2 min, denaturation, 95° C for 10 min, then 40 cycles of denaturing at 95° C for 15 sec and combined annealing and extension at 60° C for 60 sec. Each qPCR experiment contained triplicates of the no-template-controls and test samples for all of the primers tested. Three independent experiments were conducted for each analysis. Statistical analysis was performed with one-way ANOVA and Dunnett's multiple comparisons test using Graphpad software. * p < 0.05, ** p < 0.01, *** p < 0.001 and **** p < 0.0001.

Immunoistochemistry
Immunostaining was performed with standard protocols using Bond ™ Polymer Refine Detection Kit (Leica Biosystem, Buffalo Grove, IL) according to the manufacturer's instructions, on formalin-fixed, paraffinembedded tissues. Sections (5 μm) were mounted on slides and stained with hematoxylin and eosin to be evaluated by light microscopy. The antibody used in immunostaining for CCND1 (#3642, DAKO, Carpinteria, CA, USA). www.oncotarget.com

Author contributions
Duarte Mendes Oliveira: Participated in the conception, design and execution of the study; performed the NGS sequencing runs and validation experiments; writing and editing of the manuscript; Gianluca Santamaria: Participated in the analysis and interpretation of data; drafting and editing of the manuscript; Carmelo Laudanna: Participated in bioinformatic analysis and intrepertation of NGS data; Pietro Zoppoli: Participated in bioinformatic analysis and intrepertation of NGS data; Simona Migliozzi: Participated in bioinformatic analysis and intrepertation of NGS data; Michael Quist: Provided bionformatic tools and revised the draft of the manuscript; Catie Grasso: Provided bionformatic tools and revised the draft of the manuscript; Chiara Mignogna: Revision of pathologic slides; Laura Elia: Participated in patient recruitment and collected clinical information; Maria Concetta Faniello: Participated in data collection; Cinzia Marinaro: Participated in data collection; Rosario Sacco: Performed the operations; Donatella Malanga: Participated in data collection, execution of the study and the analysis and interpretation of data; assisted extensively in writing the manuscript and critical revision; Francesco Corcione: Assisted in writing the manuscript and critical revision; Antonia Rizzuto: Performed the operations; participated in the design of the study; collected clinical data of patients; helped in data interpretation; Giuseppe Viglietto: Supervised development of work, helped in data interpretation, assisted extensively in writing the manuscript and critical revision.

CONFLICTS OF INTEREST
The authors declare no conflicts of interest.

GRANT SUPPORT
This work was supported by MIUR (PON01_02782; PON03_0475) and by Centro Interdipartimentale di Servizi of the University Magna Graecia of Catanzaro.