The landscape of somatic mutation in sporadic Chinese colorectal cancer

Colorectal cancer is the fifth prevalent cancer in China. Nevertheless, a large-scale characterization of Chinese colorectal cancer mutation spectrum has not been carried out. In this study, we have performed whole exome-sequencing analysis of 98 patients’ tumor samples with matched pairs of normal colon tissues using Illumina and Complete Genomics high-throughput sequencing platforms. Canonical CRC somatic gene mutations with high prevalence (>10%) have been verified, including TP53, APC, KRAS, SMAD4, FBXW7 and PIK3CA. PEG3 is identified as a novel frequently mutated gene (10.6%). APC and Wnt signaling exhibit significantly lower mutation frequencies than those in TCGA data. Analysis with clinical characteristics indicates that APC gene and Wnt signaling display lower mutation rate in lymph node positive cancer than negative ones, which are not observed in TCGA data. APC gene and Wnt signaling are considered as the key molecule and pathway for colorectal cancer initiation, and these findings greatly undermine their importance in tumor progression for Chinese patients. Taken together, the application of next-generation sequencing has led to the determination of novel somatic mutations and alternative disease mechanisms in colorectal cancer progression, which may be useful for understanding disease mechanism and personalizing treatment for Chinese patients.


INTRODUCTION
Colorectal cancer (CRC) is the third most prevalent type of cancer worldwide [1,2]. CRC has nearly 4 millions new cases and caused 2 millions death in China each year [1,2]. In the last decade, its incidence has increased constantly in China due to the diet and living habit change [1]. CRC patients usually attain promising clinical outcomes of early diagnosis, however, most of them failed to detect the tumor until a late stage. Around 60% of CRC reside on rectum in Chinese patients, while 40% occurs on rectum in Caucasian patients [3].
Currently, various genomic approaches have been devoted to investigate the molecular mechanisms of CRC initiation and progression. A pioneering study identified several somatic mutations in colorectal cancer, such as TP53, KRAS, APC, PIK3CA and FBXW7 [4]. A later TCGA study has shed light on utilizing genomic data to elucidate mutation landscape of human CRC, and several novel somatic mutations were identified as well [5]. A follow-up study revealed several novel somatic mutations, such as TCF7L2, TET2, TET3 and ERBB3, and illustrated possible treatment plan for colorectal cancer [6]. Whole exome sequencing (WES) study on American-African patients discovered significant different somatic mutation genes, indicating alternative disease mechanisms in patients with different ethnic background [7]. Moreover, investigations on Iranian and Japanese patients uncover different somatic gene mutations and alternative mutation frequencies than Caucasian counterparts [8,9]. Therefore, a whole exome sequencing of Chinese patients is essential for novel somatic gene mutation spectrum characterization, which may consequently change our understanding of disease etiology and precision medicine management.
The most relevant study of Chinese CRC using exome sequencing has used whole exome sequencing for the first CRC cancer prognostic study [10]. A few novel somatic mutation genes, such as CDH10, FAT4 and DOCK2, are reported to be susceptible driver mutations. Notwithstanding, due to its limited sample size (22 samples) and low sequencing coverage (<60X) strategy, its power for novel Chinese CRC gene discovery and mutation frequency characterization is limited for this high immune and stromal cell infiltration tumor [11]. Up to now, a genome-wide somatic gene mutation and frequency data are largely unknown for Chinese patients. Also, little was discussed towards analysis of the association between gene mutations and clinical characteristics for Chinese patients.
In this study, we used whole exome sequencing technology to study sporadic Chinese CRC, a single type of CRC taking for around 65-85% of the total CRC patients. WES of 98 sporadic Chinese CRC patients' samples with matched controls is sequenced by Illumina and Complete Genomics platforms. First, we compared the somatic mutations and mutation frequency with TCGA data. Second, we carried out pathway analysis and compared them with TCGA data. Finally, we analyzed clinical characteristics by their association with somatic mutated genes.

Sequencing statistics
To identify genetic variants involved in CRC, we performed whole-exome sequencing of the tumor and matched controls on genomic DNA from 40 (Second affiliated Hospital of Nanchang University) and 58 patients (Wenzhou Medical University) by Illumina HiSeq2000 (ILMN) and Complete Genomics (CG) Blackbird sequencers, respectively. The Illumina and CG platforms attained 117X and 168X average coverage for exon captured regions as shown in Figure 1A. All of the samples achieved 10X and 20X coverage rate more than 95% and 90% of the genome for Illumina and CG platforms. A description of sequencing and mutation statistics can be found in Supplementary Table 2.
It was reported that CG platform has high concordance rate for SNV detection with Illumina platform [12]. However, it is necessary to evaluate system bias from two sequencers by their ability to discover somatic mutations. This was verified by no segregation of mutation numbers in coding region from two platforms ( Figure 1B). Also, no obvious differences were observed between the mutation counts (p = 0.29, Wilcoxon rank test). Moreover, mutation count of each type of SNV (non-synonymous SNV, synonymous SNV, stop gain SNV, splicing SNV) and InDel (non-frame shift deletion InDel, non-frame shift insertion InDel, frame shift deletion InDel, frame shift insertion InDel) does not show statistical differences between the two platforms. Taken together, it is reasonable to merge these two datasets for further study. Mutation rates of tumors are around 3/Mb for each sample ( Figure 1C), which is consistent with previous western and Chinese CRC whole exome sequencing results [5,6,10].
Overall, 13 patients have considerably higher mutation rate ( Figure 1B

Somatic mutational spectrum
In general, dominant somatic SNVs in colorectal cancer are *CpG->T mutations as shown in Figure 2A, consistent with that of TGCA ( Figure 2B). To investigate the origin of somatic mutations, we clustered samples into subgroups based on the six mutation types (Supplementary Appendix A). We stratified the patients into hypermutated and regularly mutated subgroups for Chinese data. Similar results are also derived from TCGA data, indicating the consistency of the clustering results. The differences between the mutation spectra of the two subgroups are quite obvious. Mutations in regularly mutated samples are mainly *CpG->T, while hypermutated samples are dominated by three mutation peaks at TCT>TAT, TTT>TGT and TCG>TTG.

Somatic mutation
We performed somatic gene mutation analysis of the hypermutated samples due to its unique tumorigenisis process and clinical outcome. It is known in the previous research that most of the microsatellite instability (MSI) tumors are hypermutated [5]. This is also validated in our study, and 8 out of 11 the hypermuated samples are MSI. The rest 3 hyper-mutated and microsatellite stability (MSS) samples displays mutations on MSH4 and POLE. Mutations on MSH4 and POLE may interrupt DNA binding and DNA replication, both of which will induce ultra-high mutation numbers. Moreover, canonical CRC genes, including APC, PIK3CA, MSH6 and FAT4 are observed in mutation status. It is interesting to discover TP53 in lower mutation prevalence (2 out of 13) for hypermutated samples, which may indicate alternative disease mechanisms for hypermutated samples. A report of the prevalently mutated genes is shown in Supplementary Table 5.
We then discuss the somatic mutations in nonhypermutated samples. In general, identification of driver mutations from passenger mutations is a challenging task due to the following reasons: a) CRC is a type of cancer with high immune and stromal cells infiltration and somatic mutation signal will be diminished [11]. b) Patients may have sub-clone with different somatic mutated genes, and a mixture of the two or more sub-clone will further decrease somatic mutation signal [13,14]. c) Disease etiology for a fraction of patients may be induced by different somatic mutations and a long tail of genes was postulated to explain the CRC initiation and progression of entire population [11]. d) Consistency between different significantly mutated genes algorithms are not perfect, and genes shown up as statistical significant results from one algorithm may disappear in another one [15,16].
In this study, we used the following rules to discuss the susceptible driver somatic mutations. 1) We used MutsigCV, one of the most widely used algorithms, for significantly mutated gene (SMG) detection [17]. 2) We reported prevalently mutated somatic genes (>5% of the entire population) due to its possibility as diagnostic and treatment targets on population level. 3) We discussed the mutated genes supported by additional literatures. 4) We used two gene expression data (GSE50760 and GSE18105) from GEO database as gene expression evidences. Genes with log fold change>1 and adjusted p-value < 0.05 between tumors and matched controls are considered as differentially expressed genes [18,19]. A report of these genes was presented in Supplementary  Table 6 and Figure 3A. It should be mentioned that the two platforms achieve high consistency in terms of significant somatic mutated gene identification and gene mutation frequency characterization.
Seven significant mutated genes were identified, including 6 classical CRC genes (TP53, APC, KRAS, FBXW7, PIK3CA and SMAD4 [5,6]) and a novel CRC gene (PEG3). PEG3 accounts for 9 (10.6%) of the patients compared with 2.7% mutation prevalence in TCGA data. Both of Illumina and Complete Genomics platforms are able to identify the somatic mutations of PEG3 gene with similar mutation frequencies (3/29 and 6/56) for regularly mutated samples. Illustrations of the mutation sites are also shown in Supplementary Figure 1. Its aberrations on multiple levels, including gene expression, methylation and mutation have been reported in different types of tumors, including myeloma [20], ovarian cancers [21] and cholangiocarcinoma [22]. Moreover, expression of PEG3 in colorectal patients shows statistically significant lower gene expression than matched controls from two independent studies using RNA-Seq and microarray platforms [18,19]. Its reduced expression is also reported to be statistically significant associated with patients' survival in various types of tumors [23]. PEG3 acts as a tumor suppressor gene by binding and promoting the degradation of β-catenin, which will consequently inhibit Wnt Signaling [24]. Its molecular role is to interact with SIAH1 and induce TP53 mediated apoptosis or bind with TRAF2 to initiate NFKB and MAPK pathways [25,26].
Two genes display significantly different mutation frequencies between Chinese and TCGA data. The mutation rate of APC gene is 0.435 and 0.786 in Chinese and TCGA data ( Figure 3C). This result is validated by previous Chinese exome-sequencing results (0.56 mutation frequency, p = 0.0001 from Fisher' exact test) [10]. APC gene carries a hotspot mutation Q1429 in Chinese population, differs from the hotspot mutation (R1450) in TCGA data ( Figure 3D). Two novel hotspot mutations, namely R273C/H, R282W on TP53 gene and R265C/H on SMAD4 gene were shown in this dataset (Supplementary Figure 2). Additionally, KRAS and PIK3CA genes are less frequently mutated but not statistically significant in Chinese patients were also observed in this dataset.
Lastly, we used 1000 Genome Project and an in-house Chinese control dataset (1500 whole exome sequencing) to evaluate the possible systematic effects. It is shown that only 5.4% of somatic SNVs from our data were annotated in 1000 Genome Project, which is highly concordant with the TCGA results (6.8%). Only 2% of the somatic SNVs are found in the in-house data set, and most of them achieve very low mutation frequencies (Supplementary Table 7). This indicates the possible pathogenic nature of these variants.

Pathway analysis of mutated genes
We also investigate the mutation pattern on pathway level by mapping the somatic mutated genes onto the predefined cancer pathways [31,32]. The method reported www.oncotarget.com previously is used to characterize the pathway mutation frequencies [33]. Significantly mutated pathways (SMP) in Chinese population agree with TCGA results, including canonical pathways, such as Wnt/beta-catenin signaling, Cell Cycle/Apoptosis, MAPK signaling, TGF-beta signaling and PI(3)K signaling [5] as shown in Figure 4A [5]. It should be noted that APC pathway is identified as significantly mutated pathway in TCGA data, but not in our data. This is consistent with gene mutation frequency alteration results. Different pathway mutation frequency is observed, such as lower mutations rate on Wnt/betacatenin signaling (  56.8%). Illustration of the pathway mutation frequency and associated genes are shown in Figure 4B and a report of the significantly mutated pathways is shown in Supplementary Table 8. These two platforms achieve quite high consistency in terms of the significant mutated pathways identification.

DISCUSSION
We investigated the Chinese CRC tumorigenesis process by a whole exome sequencing in this study. We validated well-known somatic mutations in CRC, such as TP53, APC, KRAS and discovered a few high prevalent novel somatic mutations, including PEG3 and PTPRT.
Their mutation frequencies were also compared with that in TCGA data, and significant alternative frequency in APC or PEG3 was observed. Pathway analysis of somatic mutations indicates a higher mutation frequency in cell cycle and lower mutation frequency in Wnt or MAPK signaling pathway.
The clinical characteristics association analyses were taken for genes with >5% mutation prevalence in regularly mutated samples, and the association results of TP53, APC, KRAS, PEG3 or PIK3CA genes are shown in Table 1. RP1L1 and SYNE1 genes were significantly enriched in male and female patients. PEG3 and RYR2 displayed a significant higher association with early colorectal cancer (onset age ≤ 45). Of note, mutations on TP53 had a nearly two-fold preference in rectum (70.5%) and left colon (62.5%) than left colon (35.3%) (p-value = 0.044). APC mutation frequency decreased with the increment of TNM stages, I+II (60.5%), III (26.3%) and IV (25%) (p-value = 0.004), and this pattern still persists when combining TNM III and TNM IV for this analysis. In contrast, TCGA data showed consistent mutation frequencies among TNM I+II (80%), III (76%) and IV (78.1%) patients (Supplementary Table 9). Our findings on alterations of the mutation frequency of APC in Chinese patients are consistent with the previous published results [34,35]. Our identification on mutational pathways shows rates of Wnt/beta-catenin signaling in Chinese patients decrease from 0.651 (TNM I+II) to 0.262 (TNM III+IV), while mutational analysis of pathways of DNA damage control, genome integrity and cell Cycle/Apoptosis did not show a significant change. Meanwhile, our analysis of the mutation rate of TCGA data showed a consistence between TNM I+II (0.829) and TNM III+IV (0.793) stages for Wnt/beta-catenin signaling.
It is widely accepted that sequential mutations on APC->KRAS->TP53 is a key CRC development pathway The overall mutation frequencies for each pathway are described on the right of the pathway, coming with that of the Chinese and TCGA cohorts respectively. www.oncotarget.com [36][37][38]. The reduction of APC mutations rate in TNM III+IV patients undermine its importance in Chinese patients tumor progression. This will greatly shape our understanding about tumor progression, and may have direct clinical impact on Chinese CRC precision medicine management, for example, APC has been recommended as prognostic gene for survival prediction in Caucasian patients and our finding of lower mutation rate of APC may diminish its utility in Chinese patients [39]. The possible reason for lower APC mutation frequency in TNM III+IV tumor could be due to complementary mechanisms like PEG3, which has been reported to inhibit Wnt signaling by degrading β-catenin [24]. Another possible reason of decreased mutation frequency of APC and Wnt signaling in lymph node positive cancer may due to tumor cell eradication by immune cell, if these mutated genes are not essential for CRC maintenance.
In conclusion, we carried out the first large-scale (~100 samples) whole exome-sequencing project in Chinese patients. PEG3 has been identified as novel somatic mutated gene, and alternative mutation frequencies from TCGA data have been revealed (APC gene and WNT signaling). These newly identified somatic mutations on the canonical CRC genes worth further analysis. Taken together, the above study will be the genomics reference for the further colorectal cancer research in Chinese patients, and the raw sequences deposited at BIGD database (ID: CRA000626) will make that possible. The further updates on gene structural annotation, especially identification of novel exons, may change the results and derive interesting finding of the variants' pathogenic feature [40,41]. We are interested to see the results derived from the joint analysis of this study with other genomics data.

Sample collection and preparation
During this study, 40

Illumina exome capture and sequencing
Human genomic DNA was extracted from frozen tumor or matched colon tissue with QIAampDNA Mini kits (Qiagen). Covaris system is used to obtain the fragmented genomic DNA between 150 to 200 bp. Adaptors were ligated to both ends of the fragments and adaptor-ligated templates were then purified using AgencourtAMPure SPRI beads. Whole-exome enrichment was performed by the SureSelect Human All Exon 51 M kit (Agilent). Captured DNA libraries are sequenced by

Complete genomics exome capture and sequencing
Whole-exome sequencing was performed by a sequencing-by-ligation method [42]. Briefly, fragmented genomic DNA was generated by Covaris system with length between 200-400 bp. After ligation and PCR amplification, the whole exome is enriched by Exome 59 M kit (BGI). Then, a second round PCR amplification is performed. The resulted DNA fragments are prepared for Complete Genomics Black Bird Sequencer, and 30-35 base paired-end reads are finally obtained.

Illumina variant detection
The following pipeline was carried out with default parameters unless explicitly described. First, clean reads were obtained by removing adapter reads and low quality reads (≥10% of the bases are N, or ≥50% of the bases with Phred score ≤5). Burrows-Wheeler Aligner (BWA) was used to align the clean reads to the human reference genome (UCSC Genome Browser hg19) with parameters "-o 1 -e 50 -m 100000 -t 4 -i 15 -q 10 -I" [43]. SAMtools was used to convert the SAM-formatted alignment results to BAM-formatted alignment files. Local read alignment re-calibration was performed by Genome Analysis Toolkit (GATK IndelRealigner) [44]. Finally, Picard toolkit was used to mark duplicates. [45]. Somatic SNV detection: Mutect v1.1.4 was used to compare tumor BAM files against their matched control BAM files for somatic single nucleotide variants (SNVs) identification [46]. They are filtered with the following requirements: minimum coverage ≥ 10X, mutation allele fraction ≥10% and ≥5 reads. Somatic Indel detection: GATK was used to detect somatic InDels with following parameters: minCoverage = 6, minNormalCoverage = 4, minFraction = 0.3. False positive InDels were finally removed by in-house scripts.

Complete Genomics variant detection
The resulting mate-paired reads were aligned to the reference genome (hg19) and variants are called by the reported methods [47].

Statistical analysis
Qualitative variables were compared by Fisher's exact test. T-test was used for normal distributed data comparison, and Wilcoxon rank test was used for nonnormal distributed data. All of the statistical analyses were performed in R or Bioconductor environment.

Author contributions
JD and XDF designed the project, MY collected the patient samples, ethnical approval documents and clinical information. CY, ZL, YZ, XCL and WL performed the bioinformatics analysis; ZL, YC, YZ, XDF and JD wrote the paper. All of the authors agree the manuscript.

ACKNOWLEDGMENTS
Thanks are given to Rongfa Yuan, Jianghua Shao and Chunmei Piao who have helped us to collect the patient samples, and Prasanth Bhatt who proofread this manuscript.