Transcriptome-wide identification and study of cancer-specific splicing events across multiple tumors.

Dysregulation of alternative splicing (AS) is one of the molecular hallmarks of cancer, with splicing alteration of numerous genes in cancer patients. However, studying splicing mis-regulation in cancer is complicated by the large noise generated from tissue-specific splicing. To obtain a global picture of cancer-specific splicing, we analyzed transcriptome sequencing data from 1149 patients in The Cancer Genome Atlas project, producing a core set of AS events significantly altered across multiple cancer types. These cancer-specific AS events are highly conserved, are more likely to maintain protein reading frame, and mainly function in cell cycle, cell adhesion/migration, and insulin signaling pathways. Furthermore, these events can serve as new molecular biomarkers to distinguish cancer from normal tissues, to separate cancer subtypes, and to predict patient survival. We also found that most genes whose expression is closely associated with cancer-specific splicing are key regulators of the cell cycle. This study uncovers a common set of cancer-specific AS events altered across multiple cancers, providing mechanistic insight into how splicing is mis-regulated in cancers.


Analyses of protein-protein-interaction among cancer-specific AS events
The genes containing cancer-specific AS events (or genes whose expression is associated with cancer-specific AS events) were obtained and submitted to the STRING database [4,5] (http://string-db.org/) for protein-protein interactions (PPI) analysis. We used the combined score of 0.4 as a cutoff and included five white nodes for network continuity. We used Cytoscape [6] to visualize the PPI network and the MCODE algorithm [7] to identify highly connected clusters within the network. See supplementary Table 2 and 3 for detailed parameters.

Calculation of evolutionary score
Sequence evolutionary score was downloaded from UCSC phastCons100 (http://hgdownload.cse.ucsc. edu/goldenPath/hg19/phastCons100way/) [8]. Based on multiple sequence alignments of 100 vertebrate species, each nucleotide was given an evolutionary conservation score ranging from 0 to 1. Highly conserved regions are assigned with a higher score. PhastCons estimates the probability that each nucleotide belongs to a conserved element based on multiple alignments using a hidden Markov model. For each SE event, we extracted sequences from different regions near the alternative exon to calculate average conservation score in a sliding window of 8 nt across all cancer-specific SE events and control events.

Motif enrichment analysis
To analyze the enriched sequence motifs near the splice sites of the 163 cancerspecific AS events, we first obtained nucleotide sequences from three splicing regulatory regions: upstream intron (300 nt), exon and downstream intron (300 nt) as shown in Figure S3. When obtaining the sequences, we excluded the first 25 nucleotides right upstream of the skipped exon, the first 10 nucleotides right downstream of the skipped exon and the first and the last two nucleotides within the exon. We then calculated the frequency and Z-score of each 5-mer sequence from all 163 sequences in three regulatory regions using methods described in [9]. All 5-mer sequences with Z-score larger than 2.5 were then clustered by sequence similarity and multiply aligned by using CLUSTALW to identify candidate motifs. At a cutoff dissimilarity score of 2.65, 2.7 and 2.7, we obtained 5, 7 and 5 clusters of at least four sequences in each cluster for upstream intron, exon and downstream intron respectively. Finally, we plotted the consensus sequence for each cluster for all three regulatory regions (Fig. S3).

Principal component analysis (PCA)
PCA is a data analysis technique commonly applied for dimension reduction, exploratory analysis and feature selection. PSI values of the 163 cancer-specific AS events were used to form the data vector for PCA. For each cancer type, the PSI vectors across all normal and tumor samples were then combined and used as the input data matrix to perform PCA using the prcomp() function in R. We also conducted PCA by combining the PSI values across all samples from three cancer types. The distribution of normal and cancer samples across the first two components were plotted.

Survival analysis for breast cancer patient
We obtained the overall survival data of breast cancer patients from the UCSC Cancer Browser (727 patients). If a patient deceased (event happened), the "days_to_death" was used as the time variable; if a patient is still living, the time variable is the maximum of "days_ to_last_known_alive" and "days_to_last_followup". The patient samples were split into two groups according

SUPPLEMENTARY METHODS
www.impactjournals.com/oncotarget/ to the top or bottom quantile of PSI values for each of the 163 cancer-specific events. The resulted two patient groups are compared for their probability of survival using a Kaplan-Meier survival plot and the logrank P values are calculated. This process was repeated for every cancerspecific event.

Correlation between gene expression and AS
Correlations between genes and AS events were calculated using two matrices. One matrix consists of the PSI values of 163 cancer-specific AS events across 1319 cancer and normal samples. Another matrix contains the expression level of every gene across 1319 samples. We computed the spearman rank correlation, ρ (rho), between every two vectors from the two matrices using cor.test() in R. Each pair with |ρ| > = 0.4 and p < = 0.005 was considered as a highly correlated event-gene pair. We considered genes that are highly correlated with more than 30 cancer-specific AS events as potential regulators through a direct or indirect regulation. We then used STRING database [4,5] (http://string-db.org/) to extract PPIs between these potential regulators (304 genes), and Cytoscape and MCODE to visualize and cluster the interaction networks.