Distinct distributions of genomic features of the 5’ and 3’ partners of coding somatic cancer gene fusions: arising mechanisms and functional implications

The genomic features and arising mechanisms of coding cancer somatic gene fusions (CSGFs) largely remain elusive. In this study, we show the gene origin stratification pattern of CSGF partners that fusion partners in human cancers are significantly enriched for genes with the gene age ofEuteleostomes and with the gene family age of Bilateria. GC skew (a measurement of G, C nucleotide content bias, (G-C)/(G+C)) is a useful measurement to indicate the DNA leading strand, lagging strand, replication origin, and replication terminal and DNA-RNA R-loop formation. We find that GC skew bias at the 5 prime (5′) but not the 3 prime (3’) partners of CSGFs, coincident with the polarity feature of gene expression breadth that the 5’ partners are more ubiquitous while the 3’ fusion partners are more tissue specific in general. We reveal distinct length and composition distributions of 5’ and 3’ of CSGFs, including sequence features corresponded to the 5’ untranslated regions (UTRs), 3’ UTRs, and the N-terminal sequences of the encoded proteins. Oncogenic somatic gene fusions are most enriched for the 5’ and 3’ genes’ somatic amplification alongside a substantial proportion of other types of combinations. At the function level, 5’ partners of CSGFs appear more likely to be tumour suppressor genes while many 3’ partners appear to be proto-oncogene. Such distinct polarities of CSGFs at the evolutionary, structural, genomic and functional levels indicate the heterogeneous arsing mechanisms of CSGFs including R-loops and suggest potential novel targeted therapeutics specific to CSGF functional categories.


INTRODUCTION
Somatic gene fusions are a common feature of many human cancers and have been found prevalent in leukemia, sarcoma, as well as epithelial cancers such as thyroid, prostate, colorectal, and breast cancer [1,2]. CSGF data have been well-curated and publically available through the Mitelman Database of Chromosome Aberrations and Gene Fusions in cancer [2], the COSMIC database [3], the TICdb [4], as well as in publications [5][6][7]. Genome-wide studies of CSGFs, especially in hematological malignant diseases, have considered the involvement of chromatin structure [8] and timing during replication [9] as putative genomic susceptibilities for the arising of CSGFs. Recently, CSGFs have been systematically identified in tumor samples of The Cancer Genome Atlas (TCGA) project [6] and in cancer cell lines [5]. These highly heterogeneous data have been offering an unprecedented opportunity to study the arising mechanisms and function of CSGFs.
Cancers evolve by a reiterative somatic process of clonal expansion, genetic diversification and clonal selection within the adaptive landscapes of tissue ecosystems. [10,11], relying on the somatic dynamics of genomic features and functions. Individual oncogenic fusions can also have inherent tumor suppression property [12]. Meanwhile, the natural history of genes links to diseases, especially for cancers [13,14]. It appears that arsing mechanisms and genomic functions of CSGFs are highly heterogeneous [15][16][17][18][19][20][21]. Meanwhile, little known is whether there are distinct phylogenomic and genomic features of CSGFs involved genes. In this study, we sought to link the phylogenomic and genomic features of CSGFs involved genes to potential arsing mechanisms and function categories of CSGFs by leveraging the rich resources of omics data.

CSGF data sets and annotation settings
We collected six curated datasets of CSGFs (Table 1), including the COSMIC [22], the cancer cell line fusion data (Genetech) [5], TICdb [4], the ensemble cancer landscape fusion data [7], Mitelman fusion data (Mitelman) [23], and the TCGA fusion data (TCGA) [6]. Toward a comprehensive annotation of CSGFs involved genes, we also collected a number of functional genomic data sets (Table 1), including the reference human genomic ENSEMBL data [24], protein-protein interactions [25], the FANTOM expression data [26,27], GTEX data [28], the standard gene expression data set [29], Illumina's body map [30], the draft human proteome [31,32], the gene and gene family age curation [14], the human gene expression atlas [33], as well as the TCGA data [34][35][36]. Of note, these data sets have been well recognized for their quality. Many are standard setting yet highly heterogeneous, raising substantial challenge for signals capturing in terms of the following analysis. Thus, if significant signal captured, it may have much implications.

Genome wide preference pattern of genes as 5ʹ partner (5ʹFG) or 3ʹ partner (3ʹFG) of CSGFs
Little is known about the genome wide preference pattern of being 5ʹFG or 3ʹFG of CSGFs. Of the 12,849 unique CSGFs collected from the resources listed in Table 1, we found 5,980 genes at 5ʹ (or N terminal at the protein coding level) and 7,150 genes at 3ʹ (or C-terminal at the protein coding level), as well as 3,357 genes observed at both directions. The top 20 most frequent CSGFs involved genes have distinct preferences of being involved in 5ʹFG or 3ʹFG ( Figure 1A). For examples, ALK is frequently fused at the 3′ end while RARA is more frequently fused at the 5′ end ( Figure 1A). The genome-wide distribution is shown in the Bland-Altman plot, illustrating the global asymmetry (n = 12,848, Kolmogorov-Smirnov test D = 0.54, p-value < 2.2e-16) ( Figure 1B). Thus, it appears that being as 5ʹFG or 3ʹFG is highly asymmetric at both the global and individual gene levels, indicating a potential link of gene features to the arising of CSGFs.

Euteleostomes genes and Bilateria gene families are more likely to be involved in CSGFs
To exploit phylogenomic profile of CSGFs involved genes, we collected three phylogenomic profile data sets, including the gene age database Protein historian [37], as well as two additional curated data sets [14,38]. As shown Table 2, the origins of most recurrent fusions involved genes are almost of a vertebrate origin. CSGF involved genes are significantly enriched of the Bilateria origin in terms of gene family ( Figure 2A) and Euteleostomes origin ( Figure 2B) of gene age. The 3′ portion of driver CSGFs set (COC3, from the COSMIC data set) is among the highest enriched set of genes. Consistently, CSGFs involved genes are more related to whole genome duplication events ( Figure 3), but not small scale genome duplication events ( Figure 4). These phylogenomic data appear to support a view that cancer is more likely a vertebrate-specific disease [39].

The asymmetric sequence pattern of CSGFs involved genes
Notably, there are at least four mechanisms by which how translocations arise: synthesis mediated end joining, breakage-fusion-bridge cycles, RAG-mediated translocation and AID-mediated translocation [40]. To gain insights into the arising mechanisms of CSGFs, we carried out global analysis of genomic sequence features of CSGFs involved genes, including the GC content defined as (G+C)/(T+G+C+A), the GC skew defined as the ratio of (G−C)/(G+C), and the AT skew defined as the ratio of (A−T)/(A+T)), as well as the S defined as the summary of GC skew and AT skew. An emerging asymmetric pattern of these basic sequence measurements can be utilized for prediction of R-loop formation [41]. R loops are threestranded nucleic acid structures comprised of nascent RNA hybridized with DNA template while leaving the non-template DNA single-stranded [42]. R loops are considered more unstable intermediates of RNA-DNA structure and are preferentially formed when the nontemplate strand is G rich [43].
It appears to be AT rich in the 5′ partners of the CSGF driver set (COC, the COSMIC CSGF drivers) and GC rich in the 3′ partners ( Figure 5A, scaled heatmap plot). Due to the highly heterogeneous nature of our collected data, the GC and AT skew features of CSGFs involved genes are highly heterogeneous ( Figure 5B-5D). www.impactjournals.com/oncotarget These data include six gene fusion sets and thirteen annotation data settings. However, distinct strand asymmetric patterns of fusion genes were still observed (kernel density plot, Figure 5E) by using three parameters for measuring the strand asymmetries, including GC skew, AT skew and S composition. There is a significant GC skew towards the 5′ fusion partners ( Figure 5E, P < 2.2e-16). Moreover, the AT-skew distribution shows that all fusion-involved genes are more AT asymmetric when compared to the background coding sequences (P < 2.2e-16, Figure 5E).
Consistently, the S composition shows a significant asymmetric distribution of CSGFs involved genes (P < 2.2e-16, Figure 5E). Since the DNA breakpoint sequence of CSGFs can reveal distinct translocation mechanisms, we set out to examine the motif patterns of known CSGFs breakpoints. The breakpoint sequences were downloaded from TICdb (Release 3.3, August 2013). Of note, we first analysis the global pattern of these breakpoints (Supplementary Figure S1), indicating an asymmetric pattern of the tetranucleotide distribution. Further, we performed detained key CSGFs analysis. As shown in Figure 6A, there is a direct correspondence between R-loop signature (GGG) with IgH class switch recombination signature (WGCW) (Spearman correlation coefficient ρ = 0.51, p = 1.57e-90). However, it is only a subset of CSGFs characterized by R-loops and an IgH switch signature ( Figure 6B). Related CSGFs were listed, including Ig-MYC ( Figure 6C). Thus, given these signals, our sequence analysis results support the potential of R-loops involved in arising mechanism in a subset of CSGFs.

Distinct genomic features including gene expression breath of CSGFs involved in genes
The polarity is also found in other genomic events, such as alternative splicing, disordered region, and the human-mouse non-synonymous evolution rate (dN) ( Figure 7D). Gene expression breadth is a measurement of the number of tissues in which a given gene is expressed. Based on the Unigene reference dataset comprised of 45 body sites, we found that fusion genes' 5′ partners have greater gene expression breadth than their 3′ partners, independent of tissue selectivity or specificity ( Figure 8A-8C). Therefore, fusion genes' 3′ partners are more tissue -specific.
We further examined the asymmetric pattern at the promoter, 5′ UTR and 3′ UTR of transcripts, as well as encoded protein levels of the CSFGs. Core promoter regions, the 1.   Figure S2C), and C4 (CCCC) counts (Supplementary Figure S2D). The significant sequence features between CSGFs involved 5′ genes and 3′ genes (Two-sample Kolmogorov-Smirnov test p-value < 2.2e- 16) indicate that 5′ sequence features might be related to the formation of CSGFs. Indeed, 5′ partners in CSGFs show lower overall GC content ( Figure 8D, Wilcoxon rank sum test, W = 257,110, p-value = 1.95e-05) but 3ʹUTR partners in CSGFs doesn't show significant difference in GC content ( Figure 8E, Wilcoxon rank sum test with continuity correction, W = 129,190, p-value = 0.53) and promoters ( Figure 8F, W = 18,896, p-value = 0.94). Intriguingly, protein stability related motif of the N-terminal sequences of FG5 and FG3 reveal an asymmetric patterns the second to the 10 amino acid of CSGFs (COSMIC driver fusions) 5ʹ partners (A) (Figure 9). Since the N-terminal amino acids are crucial protein degradation [44], indicating protein stability might be another critical level of cancer gene fusion functions in turn cancer somatic evolution.

The connection of somatic copy number alterations and CSGFs
Somatic copy number alterations (SCNAs) play critical roles in activating oncogenes and inactivating tumour suppressors via affecting a larger fraction of the genome in cancers than any other type of somatic genetic alterations [45][46][47]. However, little is known about the connection between SCNAs and CSGFs.
Here we intersect the SCNAs and CSGFs at individual  Table 1 while the '5' or '3' postfix indicates the 5ʹ or 3ʹ involved genes respectively. The "Genome" category represents the average level of genes of the human genome. primary tumour sample level of the 13 type of cancers in TCGA. Intriguingly, among the nine potential combinations, the amplification-amplification (A-A) type is significantly enriched (Figure 10, around 3.8-fold overrepresentation, P = 0, binomial test. Details are presented in Supplementary Table S1). However, the amplificationdeletion (A-D) and deletion-amplification (D-A) types are significantly under-represented with 0.11 (BH adjusted P = 7.2e-222) and 0.12 (BH adjusted P = 8e-217) fold of the expectations, respectively (Figure 10). The result is detailed in Supplementary Table S1. The finding suggests that CSGFs are dynamically evolved in amplification. Hence, we conclude that a large number of CSGFs with amplification are selected during somatic evolution of tumours.

Functional exploration of CSGFs and CSGFs involved genes
We further examined the asymmetric patterns of functional combinations in CSGFs. We interrogated the TCGA 13-cancer data sets with known kinase (KI) genes [48] and transcription factors (TF) [49]. As shown in the Supplementary Table S2, there are a substantial number of combinations, yet the TF-KI combination is more enriched that is KI-TF (with 3.3-and 2.1-fold enrichment, respectively, and adjusted BH p values 5.5e-22 and 3.3e-7, successively). Since most kinases are related to protoncogene function while TFs are enriched in tumor suppressors, it is most likely that 5ʹ partners have higher propensity to be tumour suppressor while 3ʹ partners have a larger likelihood to be proto-oncogene. Intriguingly, we further examined the asymmetric pattern of FG5 and FG3 at other levels, including intrinsic protein disorder and cancer signalling pathways. As shown in Supplementary Figure S3, the F5Gs (cancer somatic gene fusions involved 5ʹ genes) have a higher gene expression breadth and higher intrinsic disorder region score while the F5Gs have more transcripts and longer gene length. Meanwhile F5Gs and F3Gs (cancer somatic gene fusions involved 3ʹ genes) were asymmetrically mapped to the key cancer gene pathways (Supplementary Figure S4).  Table 1 while the '5' or '3' postfix indicates the 5ʹ or 3ʹ involved genes respectively.
Thus, it appears to be a functional stratification of CSGFs in terms of F5Gs and F3Gs.
Given that most cancer deaths are due to the development of metastases [50,51], we examined whether CSGFs involved genes are enriched metastasis genes. Based on PubMed search and collected gene settings, we curated a tumour metastasis gene set [52,53]. As shown in Supplementary Figure S7, there is a 3.7-fold enrichment of the metastasis genes in the fusion genes setting collected in the COSMIC database (P < 5.4e-20, hypergeometric test). The 63 CSGFs involved genes are also listed in the Supplementary Table S3. It appears that CSGFs may have critical role during tumour metastasis.

Network properties of CSGF involved genes
Next we explored the network properties of CSGF involved genes in the human protein interaction networks. The centrality distribution of cancer related genes, including SMG (somatically mutated cancer driver genes), CPG (cancer predisposition genes), GCG (GWAS cancer associated genes), GMG (HGMD cancer genes), and WGG (human genome genes), are shown in Supplementary Figure S5. CSGF involved genes have a similar distribution pattern compared to putative cancer genes (Supplementary Figure S5). Meanwhile, there is an asymmetric pattern of the centrality of 5ʹ and 3ʹ partners (Supplementary Figure S6) and the 5ʹ partners have less degree that their 3ʹ counterparts (Supplementary Figure S6, Kolmogorov-Smirnov test, n = 5155, D^= 0.5, p-value < 2.2e-16). It is likely that those physical interacted pairs are susceptible to become gene fusion pairs. Thus, CSGFs favor gene pairs with strong physical interactions, suggesting a mechanism similar to evolution by "tinkering" [54].

DISCUSSION
Distinct phylogenomic and genomic features identified through evolutionary studies of the emergence The abbreviation of each category is described to Table 1 while the '5' or '3' postfix indicates the 5ʹ or 3ʹ involved genes respectively. www.impactjournals.com/oncotarget (G) A plot of the distribution of S composition measurement (S = GC_skew +AT_skew), show significant asymmetric distribution of CSFiGs (P < 2.2E-16). genome: the human genome, common: coding sequences involved as both 5ʹ partner and 3ʹ partner in the CSGF data sets, F3Gu: coding sequences involved as the 3ʹ partner, F5Gu: coding sequences involved as the 5ʹ partner. The vertical intersected lines indicate the median value of the variables of each category. www.impactjournals.com/oncotarget of cancer genes have provided mechanistic insights into the complexity of cancer progression in human. However, less is known about CSGFs involved genes at the genomewide level since most existing studies focus on discovering novel fusions and testing individual molecular mechanisms [2,40]. In this study, we show that a substantial number of CSGF involved genes arise from vertebrate whole genome duplications while the involved gene families are of bilateral origin. Consistently, conserved protein domain combinations of the cancer somatic fusion genes appear to be vertebrate specific. Our results provide molecular evidence supporting the hypothesis that the earliest cancer occurred in vertebrates [39]. Indeed, the earliest known unequivocal neoplastic case was found on the partial skeleton of a North American lower carboniferous (about 300 million years BP) fossil fish, Phanerosteon mirabile [55].
Distinct genomic features of CSGFs may have both theoretical and practical implications. There are clear patterns of the polarity of CSGFs: i) F5Gs appear to be less globally expressed in contrast to a recent     To extract a set of high confidence CNVs, a threshold of 0.2 in segment mean value for amplifications and -0.2 for deletions were employed. The observed and expected nine types of combinations (A stands for amplification; N, normal; D, deletion) were plotted. Chisquared (X 2 ) test, X 2 = 4375, df = 8, P < 2.2e-16. The amplification-amplification (A-A) type is significantly enriched (around 3.8 fold ovepresentation, P = 0, binomial test). Meanwhile, both the amplification-deletion (A-D) and deletion-amplification (D-A) types are significantly underrepresented with BH adjusted P = 7.22e-222 and P = 7.98e-217, respectively. study which showed more global expression of cancer genes; ii) the selection signatures of 3′ UTRs of 5′ UTRs are linked to an miRNA regulation network [9,56]; iii) at the protein level, there are functional selection footprints of N-terminal degradation rules [44]; iv) there are intrinsic disordered signatures at the joint site of CSGFs [56]. Such observed structure and function relationships not only have the potential to enable us to better understand the functions of CSGFs and more accurately predict CSGFs but also implicate novel therapeutics including targeting cancer somatic fused tyrosine kinase (TK) fusions and activating multiple CSGFs simultaneously [12].
DNA breakpoint signatures provide new insights into the mechanisms underlying CSGFs. However, most studies focused upon germline translocation breakpoints, involving four distinct mechanisms including nonhomologous end joining (NHEJ), non-allelic homologous recombination (NAHR), transposition, DNA replication mechanisms [40]. For somatic cases, especially for those in carcinomas, arising mechanisms have been largely unknown (26,28). These overrepresented motifs further highlight the key role of inflammation process that might cause genome instability. Given that genome instability can trigger inflammation, there may be a positive feedback loop between two hallmarks of cancer, genome instability and immune disorder [57]. Our study also revealed the connection of CSGFs to promoters, enhancers, H3K4me3, replication time, as well as fragile sites susceptible to rearrangements and translocations. Another important finding from our study is that a significant number of CSGFs are involved in metastasis, the core of cancer mortality.
Taken together, our findings underscore the prerequisites, causes, and consequences of CSGFs and further our knowledge of CSGFs at the evolutionary, structural and functional levels. The striking features uncovered by integration of phylogenomic, functional genomic, protein interaction data have both theoretical and clinical implications for further testing.

CSGF data sets and annotations
CSGFs datasets (Table 1), including the COSMIC [22], Genetech cell line fusions [5], TICdb [4], Cancer landscape fusions [7], Mitelman fusions [23], and a recent curation of the TCGA fusions [6], were collected. Specially, the cancer landscape fusions and COSMIC fusions are focused on driver fusions, while other data settings are mixture of a variety of different somatic fusions found in cancer.
Functional genomic data sets (Table 1) including the reference human genomic ENSEMBL data [24], refined human protein coding genes (general and tissue-specific protein-protein interactions [25], FANTOM expression data [26,27], GTEX data, the standard gene expression data set [29], Illumina's body map, draft human proteome, gene age curation [14], human gene expression atlas [33], as well as the TCGA data [34,35], were also collected. Gene expression data including both mRNA level and protein level as well as protein interaction network data were tabulated in Table 1. Evolutionary data including gene and gene family age settings have also been listed.
The TCGA SCNAs data set was downloaded from the TCGA site. For a set of high-quality and robust CNVs extracting, the thresholds of 0.2 and -0.2 for segment mean value were used to determine amplifications and deletions, respectively. For CSGFs breakpoint and TGCA SNCAs overlapping, the Granges Bioconductor class was applied in the hg19 reference genome [58].

Asymmetry statistics
Statistical comparisons were carried out with a nonparameter model with R functions including Wilcoxon Signed Rank or Kruskal Wallis test. The Bland-Altman plot was employed for asymmetry analysis and signal capture [59]. The two-sample Kolmogorov-Smirnov test was used for statistically significant test with the function 'ks.test'.

DNA breakpoint sequence motifs analysis
Pan-cancer breakpoint sequences were downloaded from the TICdb (release 3.3, August 2013, http://www.unav.es/genetica/TICdb/) [4]. Meanwhile, a set of breakpoint sequences was got from literatures [5,6,60,61], and sequences were mapped to the hg19 human genome with GenomicAlignment [34]. Motif analysis library was carried out by using the method described in the literature [62]. The null model of overrepresentation was based on a Fisher's exact test by using the R script.

Protein-protein interaction networks
Gene-gene connection data was downloaded from HumanNet V.1 [63], a probabilistic functional gene network of 18,714 validated protein-encoding genes of Homo sapiens (by NCBI March 2007). HumanNet was constructed by a modified Bayesian integration of 21 types of 'omics' data from multiple organisms, with each data type weighted according to how well it links genes that are known to function together in H. Sapiens. Protein-protein interaction data and tissue specific protein interaction data networks were downloaded from the related web sites [25,[64][65][66][67][68][69][70]. The "igraph" package [71] was used to compute network properties including, degree distribution,