Mutation-profile-based methods for understanding selection forces in cancer somatic mutations: a comparative analysis

Human genes exhibit different effects on fitness in cancer and normal cells. Here, we present an evolutionary approach to measure the selection pressure on human genes, using the well-known ratio of the nonsynonymous to synonymous substitution rate in both cancer genomes (CN/CS) and normal populations (pN/pS). A new mutation-profile-based method that adopts sample-specific mutation rate profiles instead of conventional substitution models was developed. We found that cancer-specific selection pressure is quite different from the selection pressure at the species and population levels. Both the relaxation of purifying selection on passenger mutations and the positive selection of driver mutations may contribute to the increased CN/CS values of human genes in cancer genomes compared with the pN/pS values in human populations. The CN/CS values also contribute to the improved classification of cancer genes and a better understanding of the onco-functionalization of cancer genes during oncogenesis. The use of our computational pipeline to identify cancer-specific positively and negatively selected genes may provide useful information for understanding the evolution of cancers and identifying possible targets for therapeutic intervention.


INTRODUCTION
Since the pioneering work of Cairns and Nowell [1,2], the evolutionary concept of cancer progression has been widely accepted [3][4][5][6][7]. In this model, cancer cells evolve through random somatic mutations and epigenetic changes that may alter several crucial pathways, a process that is followed by clonal selection of the resulting cells. Consequently, cancer cells can survive and proliferate under deleterious circumstances [8,9]. Therefore, knowledge of evolutionary dynamics will benefit our understanding of cancer initiation and progression.
For example, there are two types of somatic mutations in cancer genomes: driver mutations and passenger mutations [10,11]. Driver mutations are those that confer a selective advantage on cancer cells, as indicated by statistical evidence of positive selection. Passenger mutations do not confer a clonal growth advantage and are usually considered neutral in cancer. However, some passenger mutations in protein-coding regions that would have potentially deleterious effects on cancer cells may be under negative selection in cancer [12,13].
Cancer somatic mutations, especially driver mutations, promote the cancer specific functionalization Research Paper www.impactjournals.com/oncotarget of cancer-associated genes, i.e., onco-functionalization. Onco-functionalization of cancer-associated genes would promote cancer initiation and progression. For example, oncogenes may gain new functions during carcinogenesis, which could be considered cancer-specific neo-functionalization [14]. By contrast, the mutation of tumor suppressor genes to cause a loss or reduction of their function could be considered cancer-specific nonfunctionalization [15].
Analyses of large-scale cancer somatic mutation data have revealed that the effects of positive selection are much stronger on cancer cells than on germline cells [16,17]. Given that many of the positively selected genes in tumor development act as the driving force behind tumor initiation and development and are thus considered "driver genes", it is understandable that almost all previous studies have focused on positively selected genes in cancer genomes [3,[18][19][20][21]. Nevertheless, we have realized that an alternative approach, i.e., identifying cancer-constrained genes that are highly conserved in tumor cell populations (under purifying selection), is also valuable. For example, TP73, a homolog of TP53, is rarely mutated but frequently overexpressed in tumor cells. TP73 has been reported to activate the expression of glucose-6phosphate dehydrogenase and support the proliferation of human cancer cells [22]. As essential genes are crucial for carcinogenesis, progression and metastasis, this idea may be advantageous in addressing issues related to drug resistance in cancer therapies, especially in cancers with high intratumor heterogeneity.
Many previous studies have used the ratio of nonsynonymous to synonymous substitution rates to identify genes that might be under strong positive selection both in organismal evolution and carcinogenesis [11,16,17,[23][24][25][26]. However, most of these studies applied conventional methods, which are usually based on simple nucleotide mutation/substitution models, e.g., the simplest equal-rate model assuming that every mutation or substitution pattern has the same probability [27]. Unfortunately, this may not be a realistic biological model because many recent cancer genomics studies have shown that mutation profiles vary greatly between different cancer samples [17,28]. In addition, context-dependent mutation bias (i.e., base-substitution profiles that are influenced by the flanking 5′ and 3′ bases of each mutated base) should also be considered [28,29].
In this study, we describe a mutation-profile-based method to estimate the selective constraint for each gene in pan-cancer samples and human populations. In brief, the new method discards an unrealistic assumption inherent in the equal-rate model that every mutation or substitution pattern has the same probability [27]. This assumption can lead to nontrivial biased estimations when it is significantly violated. By contrast, our method implements an empirical nucleotide mutation model that simultaneously considers account several factors, including single-base mutation patterns, local-specific effects of surrounding DNA regions, and tissue/cancer types. Using simple somatic mutations from 9,155 tumornormal paired whole-exome/genome sequences (ICGC Release 20), as well as rare germline substitutions from 6,500 exome sequences from the National Heart, Lung, and Blood Institute (NHLBI) Grant Opportunity (GO) Exome Sequencing Project (ESP), as references, we used this mutation-profile-based method to identify selective constraints on human genes, especially cancer-associated genes, in cancer cells. Our results may provide useful information for the precise classification of known cancerassociated genes and for an improved understanding of the evolution of cancers.

The mutation rate profiles in cancer genomes and human populations differ
Estimating evolutionary selective pressure on human genes is a practical method for inferring the functional importance of genes in a specific population. By comparing selective pressures, we may be able to identify different functional and fitness effects of human genes in cancer and normal cells. The conventional method for measuring selective pressure is to calculate the ratio of nonsynonymous to synonymous substitution rates using the equal-rate method [27], which assumes equal substitution rates among different nucleotides. In this study, we used the cancer somatic mutations from 9,155 tumor-normal pairs from ICGC (Release 20) as well as rare variants (minor allele frequency < 0.01%) from 6,500 exome sequences from ESP as a reference. We used these data to compare the empirical mutation rate profiles of cancer somatic mutations and germline substitutions using 96 substitution classifications [28,29]. The empirical mutation rate profiles reveal the prevalence of each substitution pattern for point mutations and present not only the substitution types but also the sequence context (see Methods). The exonic mutation profiles of cancer somatic mutations and germline substitutions are both enriched in C-to-T transitions ( Figure 1). The mutation rates for each trinucleotide context differ from each other, and the ratio of transition to transversion for each trinucleotide context is much greater than 1:2 for both cancer somatic mutations (ratio = 2.70 ± 0.47) and germline rare variants (ratio = 3.28 ± 0.53) (Supplementary Figure 1). These different mutation profiles may lead to different biological progressions in carcinogenesis, as depicted in several publications [19,28]. For example, the mutation profiles of melanoma are highly enriched in C-to-T transitions, indicating a direct mutagenic role of ultraviolet (UV) light in melanoma pathogenesis [30]. Thus, it is inappropriate to use conventional methods such as the equal-rate model to measure selective pressure www.impactjournals.com/oncotarget because this approach ignores the mutation bias of different nucleotide substitution types.

Measuring selective pressure on human genes in cancer and germline cells using the mutationprofile-based method
We therefore formulated an evolutionary approach that was designed specifically to estimate the selective pressure imposed on human genes in cancer cells and then identify genes that had undergone positive and purifying selection in cancer cells compared with in normal cells (see Figure 2 for an illustration). In cancer genomics, distinguishing synonymous from nonsynonymous somatic mutations is straightforward. We developed the mutation-profile-based method to estimate the C N /C S ratio of each human gene based on the mutation profiles of cancer somatic mutations and the p N /p S ratio for germline substitutions. In contrast to the equal-rate method [27], our method considers differences in substitution rates and uses the overall mutation rate profile as the weight matrix ( Figure 1).
We calculated the expected number of nonsynonymous and synonymous sites based on the exonic mutation rate profiles. We then counted the number of nonsynonymous and synonymous substitutions in the protein-coding region of each human gene for all cancer somatic mutations or germline substitutions. A χ 2 test was performed to identify the genes whose C N /C S values were either significantly greater than one or less than one, which indicates positive or negative (purifying) selection, respectively. Of the 16,953 genes with at least one germline substitution and cancer somatic mutation, the overall C N /C S value for cancer somatic mutations (mean ± s.e. = 1.199 ± 0.008) was much greater than the overall p N /p S of germline substitutions (mean ± s.e. = 0.738 ± 0.005) (Wilcoxon test, p < 2.2×10 -16 ) (Table 1A,  Supplementary Table 1). In the cancer genomes, 365 genes had C N /C S values significantly greater than one, and 923 genes had C N /C S values significantly less than one (χ 2 test, p < 0.01, FDR < 0.1). By contrast, germline substitutions included only 24 genes with p N /p S values significantly greater than one, whereas 4,897 genes had p N /p S values significantly less than one (χ 2 test, p < 0.01, FDR < 0.1). Of these 365 cancer positively selected genes, only one gene (RSRC1) also exhibited positive selection whereas 117 genes exhibited negative selection in germline substitutions. Additionally, 500 cancer negatively selected genes did not exhibit significant negative selection in germline substitutions. These genes may therefore be Previous studies have attributed elevated C N /C S values to the relaxation of purifying selection [16] or increased positive selection of globally expressed genes [17]. Our results show that the number of genes under positive selection increased, whereas the number of genes under negative selection decreased, in cancer genomes compared with germline genomes. This result indicates that both the relaxation of purifying selection on passenger mutations and the positive selection of driver mutations may contribute to the increased C N /C S values of human genes in cancer genomes.

Selection pressures on cancer-associated genes
The Cancer Gene Census (CGC) [31, 32] contains more than 500 cancer-associated genes that have been reported in the literature to exhibit mutations and that are causally implicated in cancer development. Of those genes, 553 were included in the 16,953 genes that we tested. These known cancer genes have significantly greater C N /C S values (Wilcoxon test, p = 2.9 × 10 -10 ) for cancer somatic mutations but significantly lower p N /p S values for germline substitutions (Wilcoxon test, p < 2.2 × 10 -16 ) than other genes (Table 1A). For selection over longer evolutionary time scales, we extracted the d N /d S values between human-mouse orthologs from the Ensembl database (Release 75) [33]. The known cancer genes have significantly lower human-mouse d N /d S values than other human genes (Wilcoxon test, p < 2.2 × 10 -16 ). These results support the work of Thomas et al. [34], who showed that known cancer genes may be more constrained and more important than other genes at the species and population levels, especially for oncogenes. By contrast, known cancer genes are more likely to gain oncofunctional somatic mutations in cancer than other genes.
Among the 365 cancer positively selected genes, 45 (12.3%) genes are known cancer genes, indicating that cancer genes are significantly enriched in cancer positively selected genes (Fisher's Exact Test, p = 6.7 × 10 -15 ). When we choose a more stringent cutoff of p < 10 -5 , 17 of the 29 (58.6%) positively selected genes are known cancer genes, according to the CGC, and the work of Lawrence et al. [20] and Kandoth et al. [35], such as the well-known cancer drivers TP53, KRAS, PIK3CA, and BRAF. (Supplementary Table 2). In addition, the 29 strong positively selected genes are significantly enriched in biological processes related to cancer, according to the functional analysis using DAVID  Table 3). Some cancer genes also show negative selection in cancer genomes, such as the oncogene MLLT3 (C N /C S = 0.11, p = 3.14 × 10 -44 , FDR = 5.52 × 10 -41 ). The MLL-MLLT3 gene fusion is the main mutation type of MLLT3 that drives tumorigenesis in acute leukemia [37]. Interestingly, MLLT3 has recurrent synonymous mutations at amino acid positions 166 to 168 (S166S, 8/9155; S167S, 33/9155; S168S, 23/9155).
Using the C N /C S values, we classified known cancer genes according to the selection pressure on these genes in cancer cells, as well as their onco-functionalization in oncogenesis ( Table 2). The most important two classes are oncogenes and tumor suppressor genes that are under strong positive selection, such as TP53, the most famous tumor suppressor gene [38], which shows strong positive selection pressure (C N /C S =32.57, p = 1.06 × 10 -159 , FDR = 6.55 × 10 -156 ). The non-synonymous mutations of TP53 with onco-nonfunctionalization are distributed in a wide range of cancers. The oncogene KRAS [39] also showed a strong positive selection pressure (C N /C S = 45.88, p = 4.25 × 10 -87 , FDR = 1.74 × 10 -83 ). Recurrent nonsynonymous mutation with onco-neofunctionalization of KRAS are highly enriched in codons 12 and 13; mutations in these codons represent 79.4% and 8.0% of all nonsynonymous mutations of KRAS.
We also observed 12 cancer positively selected genes (p < 10 -5 ) that have not been reported as cancerassociated genes. These genes are recurrently mutated in several tumor types and are potential cancer driver genes. According to the mouse insertional mutagenesis experiments [40], three of these genes (DMD, MYO9A, and COL5A2) have been identified as cancer-causing genes [41][42][43][44].
When we chose a more stringent cut-off of p < 10 -5 for cancer negatively selected genes, we found 112 genes that showed an enrichment in the Notch signaling pathway (Supplementary Table 3). Fortyseven of the 112 negatively selected genes showed more stringent selective constraint in cancer cells than in normal cells (p N /p S > C N /C S , p > 0.05 for p N /p S ). It would be quite valuable to uncover the roles of these evolutionarily conserved genes in cancer cells. Out of the 47 genes, 14 genes showed a significantly increased expression level in cancers than in normal tissues (fold change > 2, p < 10 -4 ) (Supplementary Table 4). For example, SPRR3, a member of the small proline-rich protein family, is under purifying selection in cancer cells (C N /C S = 0.27, p = 5.73 × 10 -11 , FDR = 1.91 × 10 -8 ) and neutral selection in germline cells (p N /p S = 0.88, p = 0.75, FDR = 0.37). It has been reported that SPRR3 is overexpressed in several tumor types, and is associated with tumor cell proliferation and invasion. Therefore, SPRR3 could be a potential biomarker and novel therapeutic target [45][46][47].
We also examined essential genes during human development and cancer development. We extracted 2,452 human orthologs of mouse essential genes from DEG10 (the Database of Essential Genes) [48]. These genes, which are human orthologs of known essential genes in mice [49], are critical for cell survival and are therefore more conserved than other genes at the species and population levels. Here, we found that human orthologs of mouse essential genes have significantly lower d N /d S values (measured between human-mouse orthologs) and lower p N /p S values for germline substitutions but similar C N /C S values for cancer somatic mutations compared with the values for non-essential genes (Table 1A). Human orthologs of mouse essential genes are also enriched among cancer positively selected genes. Eighteen of the twenty-nine (62.1%) positively selected genes (p < 10 -5 ) are human orthologs of mouse essential genes (Supplementary Table 2). We also used the human orthologs of mouse essential genes from OGEE (the database of Online GEne Essentiality) [50] to confirm these results (Supplementary Table 2).
Cancer essential genes were identified by performing genome-scale pooled RNAi screens. RNAi screens with the 45k shRNA pool in 12 cancer cell lines, including small-cell lung cancer, non-small-cell lung cancer, glioblastoma, chronic myelogenous leukemia, and lymphocytic leukemia, revealed 268 common essential genes [51]. Compared to other human genes, these cancer essential genes have significantly lower d N /d S values and lower p N /p S values for germline substitutions and greater C N /C S values for cancer somatic mutations (Table 1A), suggesting a functional shift of these genes in human populations and cancer cells.
We further tested the correlations of the d N /d S , p N /p S and C N /C S values of human genes for human-mouse orthologs, germline substitutions and cancer somatic mutations to compare selective pressures among species, populations and cancer cells (Table 1B). For different gene sets, the d N /d S values show a weak positive correlation with the p N /p S values, but no significant correlation with C N /C S values. The p N /p S values and C N /C S values also do not have significant correlation for different gene sets. These results indicate that the cancer-specific selection pressure is quite different from the selection pressure at the species and population levels.

Selection pressure among different cancer types
As cancer is highly heterogeneous, we further analyzed the selection pressure of human genes in different cancer types. The 9,155 tumor samples from the ICGC database could be classified as 20 cancer types according to the primary site. The overall C N /C S values for the cancer somatic mutations in the different cancer types ranged from 1.078 ± 0.022 to 1.827 ± 0.013 (mean ± s.e., Table 3). The detected positively and negatively selected genes (χ 2 test, p < 0.01) varied in the different cancer types (Supplementary Table 5). Due to the limited number of tumor samples and somatic mutations for each cancer type, particularly in the www.impactjournals.com/oncotarget cancer types with low mutation rates, our method might not be sensitive enough to detect the selection pressure for each gene. For example, only one positively selected gene was detected in bone cancer (IDH1) and nervous system cancer (ALK), respectively. There were also three genes (TP53, PIK3CA and KRAS) that showed positive selection in more than five cancer types. In particular, TP53 showed positive selection in 15 cancer types. On the other hand, more genes (164/188, 87.2%) were under positive selection in only one cancer type. We also found that six genes (TBP, EP400, DSPP, MUC21, MLLT3, and MUC2) were under negative selection in more than five cancer types. These genes also showed negative selection at the species and population levels. Furthermore, 85.8% (2,417/2,817) of genes showed negative selection in only one cancer type. These results indicate the divergence of selection pressure in different cancer types.

Comparison of the equal-rate model and empirical mutation profile model
Considering that different nucleotide substitution models might provide varying estimates, we used the equal-rate method [27] as the simplest model to calculate     Figure 3A). Consequently, the number of genes with C N /C S values > 1 (χ 2 test, p < 0.01, FDR < 0.1) is much lower than those calculated using the exonic mutation profiles (37 versus 365), whereas the number of genes with C N /C S values < 1 (χ 2 test, p < 0.01, FDR < 0.1) is much greater (2851 versus 923) ( Figure 3B and 3C). We also used the intergenic mutation rate profile from 2,900 tumor-normal whole genome sequences, which are included in the 9,155 cancer samples of ICGC database, to calculate the C N /C S value for cancer somatic mutations. The overall C N /C S value (mean ± s.e. = 1.503 ± 0.010) is greater than that calculated from the exonic mutation rate profile (mean ± s.e.=1.199 ± 0.008) (Wilcoxon test, p < 2.2 × 10 -16 ), resulting in more positively selected genes (1526 versus 365) and fewer negatively selected genes (298 versus 923) ( Figure 3B and 3C).

d N /d S vs p N /p S d N /d S vs C N /C S p N /p S vs C N /C S r p-value r p-value r p-value
The equal-rate method ignores the mutation rate bias between different substitution types, especially the ratio of transition to transversion, leading to underestimation of the C N /C S ratio. Therefore, the equal-rate method is strict for positive selection detection but relaxed for the detection of negative selection [52]. In contrast, the mutation-profilebased method considers the mutation bias, which can be depicted as the internal variance between mutation rates of different substitution types. Thus, the mutation-profilebased method can correct the underestimation of the C N /C S ratio estimated by the equal-rate method. Furthermore, the mutation-profile-based method would also increase the false-positive results for detecting positively selected genes but be more conservative in detecting negatively selected genes. The mutation bias may simulate the detection of genes under strong selection pressure but may suppress the detection of genes under weak selection pressure.

DISCUSSION
A key goal of cancer research is to identify cancer-associated genes, such as oncogenes and tumor suppressor genes, that might promote tumor occurrence and progression when mutated [28]. Instead of searching for cancer-causing genes with multiple driver mutations, an alternative approach is to identify cancer essential genes in tumor cell populations because they are crucial for carcinogenesis, progression and metastasis. Cancer essential genes are important for the growth and survival of cancer cells [51] and are expected to be highly conserved in cancer cells. In this study, we aimed to detect both cancer-specific positively and negatively selected genes using a molecular evolution approach. Based on analyses of large-scale cancer somatic mutation data derived from The Cancer Genome Atlas (TCGA) or International Cancer Genome Consortium (ICGC), previous studies identified important differences between the evolutionary dynamics of cancer somatic cells and whole organisms [6,16,18]. However, these studies applied canonical nucleotide substitution models to identify the molecular signatures of natural selection in cancer cells or human populations and neglected the apparently different mutation profiles of these cell types.
Here, we developed a new mutation-profile-based method to calculate the C N /C S values of human genes for cancer somatic mutations. In our results, a large number of known cancer genes did not show significant positive selection according to our analysis. One possible reason for this finding suggests that positive selection for driver mutations is obscured by the relaxed purifying selection of passenger mutations. Additionally, among the strong positively selected genes, more than half are known cancer genes. Another possible reason might be that the main mutation type of more than 300 cancer-associated genes is translocation or copy number variation, rather than point mutation. Furthermore, some of the positively selected genes might also be related to cancer, such as DMD, MYO9A, and COL5A2, which have been identified as cancer-causing genes based on mouse insertional mutagenesis experiments [40]. Two prerequisites are crucial to properly apply the mutation-profile-based method. First, a large number of samples with similar mutation profiles are necessary to increase the power of selection pressure detection. Second, a subset of nucleotide substitutions should be chosen to represent the background neutral mutation profiles of the samples. In this study, because of the limited number of cancer samples, especially the number of whole-genome sequenced tumor-normal tissue pairs, we pooled all samples to analyze pan-cancer-level selection pressures. However, cancer somatic mutation profiles are well known to be heterogeneous among different cancer types, even for samples with the same tissue origin [19,20,28,35]. As the number of sequenced cancer genomes increases, we will be able to classify cancer samples by their specific mutation profiles and infer evolutionarily selective pressures more precisely using the mutation-profile-based method.
Background neutral mutation profiles can be calculated based on intergenic regions from the corresponding samples. In this study, we assumed that most of the exonic somatic mutations in the cancer samples do not have significant effects on the fitness of cancer cells. Under this assumption, we can apply the mutation profiles of coding regions to approximate the background. The exonic mutation profiles used in our mutation-profile-based method considered the weight of the 96 substitution classifications within the cancer exomes, which may reflect the mutation bias of different substitution types within the protein-coding regions. This method would correct the underestimation of the C N /C S value that occurs with the equal-rate method [52]. The mutation-profile-based method is more sensitive for the detection of positive selection but more conservative for the detection of negative selection compared with the equal-rate method. As more tumor-normal whole genome sequence data become available, it would be better to choose suitable mutation profiles for the mutation-profilebased method. With the expansion of these data in the future, we may apply more precise methods to identify neutral background mutation properties.
Human gene sequences and annotations were extracted from the Ensembl database (GRCh37, Release 75) [33]. For each gene, only the longest transcript was selected for the subsequent analyses. The d N /d S values between human-mouse orthologs were extracted from the Ensembl database. The HGNC (HUGO Gene Nomenclature Committee) database [53] (http://www.genenames.org/) and the Genecards database [54] (http://www.genecards. org) were used to map the gene IDs from different datasets. DAVID v6.7 was utilized for the functional annotation analysis [36]. ANNOVAR was utilized to perform biological and functional annotations of the cancer somatic mutations and germline substitutions [55]. The Oncomine database [56] (https://www.oncomine.org) was used to compare the gene expression level of negatively selected genes between cancer and normal tissues. The human orthologs of mouse essential genes were extracted from the DEG 10 [48] and the OGEE v2 [50] databases.

Statistical measure for gene-specific selection pressure in cancer evolution (C N /C S )
In cancer genomics, distinguishing synonymous from nonsynonymous somatic mutations is straightforward. Thus, given a set of independent cancer samples, the ratio of nonsynonymous counts (N) to synonymous counts (S) of a gene, denoted by N/S, is simply given by the sum over all samples, under the assumption of no double mutations at the same nucleotide site (e.g., for the observed mutation A > C, the mutation path A > G > C is almost impossible in cancers). To further explore the underlying mechanisms, the N/S ratio must be normalized by L N /L S , that is, where L N is the number of expected nonsynonymous sites and L S is the number of expected synonymous sites. Note that C N /C S is specific for cancer somatic mutations, to avoid notation confusions with d N /d S in molecular evolution and p N /p S in population genetics. To avoid a calculation error for the small sample size, 0.5 was added to each parameter for the calculation of C N /C S if N or S was equal to zero.
The calculation of L N and L S from the nucleotide sequence is not a trivial task. For instance, in the codon TTT (coding for amino acid Phe), the first two positions are counted as nonsynonymous sites because no synonymous changes can occur at these positions. At the third position, the transition change (T > C) is synonymous, whereas the remaining two transversion changes (T > A and T > G) are nonsynonymous. Apparently, the weight of the third position of codon TTT as synonymous (w S ) or nonsynonymous (w N ) depends on the pattern of somatic mutations. At one extreme, if the transition mutation is dominant, this position should nearly be counted as a synonymous site (w S = 1); at the other extreme (transversion dominant), this position would be counted as a nonsynonymous site (w S = 0).

Equal-rate model
The weight of a nucleotide as synonymous (w S ) is simple when the rate of base change is the same. Let I S be the number of possible synonymous changes at a site. This is counted as w S =I S /3 synonymous and (1-I S /3) nonsynonymous. For instance, in the codon TTT (Phe), the first two positions are counted as nonsynonymous sites because no synonymous changes can occur at these positions (w S = 0). The third position of codon TTT is then counted as one third of a synonymous site (w S = 1/3) and two-thirds of a nonsynonymous site (w N = 2/3) because only one of the three possible changes is synonymous. It is then straightforward to calculate the numbers of synonymous and nonsynonymous sites.

Empirical mutation profile model
Substantial evidence has demonstrated that the rate of somatic mutations in cancer depends on not only the nucleotide site (e.g., synonymous or nonsynonymous sites) and the mutation type (e.g., transition or transversion) but also on the sequence context of each mutated site, i.e., the effects of near-by nucleotides on somatic mutations are nontrivial. Recent studies [28,29,57] proposed an empirical mutation profile of any position with base P, considering two immediate neighbor nucleotides (x, y) of a trinucleotide string denoted by xPy. Since base P has six base-change patterns (under Watson-Crick pairing) and both x and y have four types of bases, there are a total of 4 × 6 × 4 = 96 substitution classifications, with the empirical profile denoted by M(xPy > xP i *y), where P i * (i = 1, 2, 3) for the other three bases instead of P.
To determine the probability of the mutation type (xPy > xP i *y), we divided the number of mutations in that trinucleotide context (xPy > xP i *y) by the number of occurrences of the trinucleotide (xPy). Our computational pipeline is illustrated by the following example.
In the encoding sequence with two codons … TTT-ATG…., we consider the third position of codon TTT (Phe). Under the trinucleotide TTA for the mutation profile (not the codon), the corresponding three substitution configurations are given by M(TTA > TCA), M(TTA > TAA) and M(TTA > TGA), respectively, and the number of occurrences of TTA is M(TTA). Next, we consider codon TTT. Because TTT and TTC are synonymous codons but TTA and TTG are not, the probabilities that this site will be synonymous and nonsynonymous are simply given by the following: We counted all somatic mutations in the proteincoding regions of the 9,155 tumor-normal paired cancer samples, as well as all the rare protein-coding variants of the ESP6500 dataset. The mutation profiles were depicted as the mutation rate of each mutation type according to the 96 substitution classifications.
The ratio of transition to transversion for each trinucleotide context was calculated based on the mutation rate of transitions and transversions. For example, the ratio of transition to transversion for ACA = M(ACA > ATA)/ (M(ACA > AAA) + M(ACA > AGA)).

Detection of positive and negative selections
The χ 2 test was used to compare the number of nonsynonymous and synonymous substitutions to the number of nonsynonymous and synonymous sites for each gene to test the statistical significance of the difference between the C N /C S values and one. Genes with C N /C S values significantly greater than one were classified as under positive selection in tumors, whereas genes with C N /C S values significantly less than one were classified as under negative, or purifying, selection. The falsediscovery rate was estimated using the qvalue package from Bioconductor [58]. The software tool R was used for statistical analysis (http://www.r-project.org/).