Ranking novel cancer driving synthetic lethal gene pairs using TCGA data

Synthetic lethality (SL) has emerged as a promising approach to cancer therapy. In contrast to the costly and labour-intensive genome-wide siRNA or CRISPR-based human cell line screening approaches, computational approaches to prioritize potential synthetic lethality pairs for further experimental validation represent an attractive alternative. In this study, we propose an efficient and comprehensive in-silico pipeline to rank novel SL gene pairs by mining vast amounts of accumulated tumor high-throughput sequencing data in The Cancer Genome Atlas (TCGA), coupled with other protein interaction networks and cell line information. Our pipeline integrates three significant features, including mutation coverage in TCGA, driver mutation probability and the quantified cancer network information centrality, into a ranking model for SL gene pair identification, which is presented as the first learning-based method for SL identification. As a result, 107 potential SL gene pairs were obtained from the top 10 results covering 11 cancers. Functional analysis of these genes indicated that several promising pathways were identified, including the DNA repair related Fanconi Anemia pathway and HIF-1 signaling pathway. In addition, 4 SL pairs, mTOR-TP53, VEGFR2-TP53, EGFR-TP53, ATM-PRKCA, were validated using drug sensitivity information in the cancer cell line databases CCLE or NCI60. Interestingly, significant differences in the cell growth of mTOR siRNA or EGFR siRNA knock-down were detected between cancer cells with wild type TP53 and mutant TP53. Our study indicates that the pre-screening of potential SL gene pairs based on the large genomics data repertoire of tumor tissues and cancer cell lines could substantially expedite the identification of synthetic lethal gene pairs for cancer therapy.


INTRODUCTION
Synthetic lethality describes the genetic interaction by which the combination of two separately non-lethal mutations results in lethality. The phenomenon was first described by Calvin Bridges in 1922 [1], who noticed that some combinations of mutations in the model organism Drosophila melanogaster confer lethality. Generally, the ablation of two genes located in parallel pathways (leading to cell survival or a common essential product) is one of the important patterns causing synthetic lethality (SL) [2].
Cancer is fundamentally a genetic disease with numerous gene mutations involved. Some of these genetic mutations serve as biomarkers in cancers. In particular, notable advances have been made in cancer therapy for example, with the discovery of Herceptin to treat breast cancer patients with HER2 amplification, and with Iressa for the treatment of non-small cell lung cancer patients with an EGFR mutation. However, developing drugs that selectively kill cancer cells without harming normal cells remains a big challenge in oncology therapy. Given that genetic mutations underpin differences

Research Paper
between cancer cells and healthy cells, Hartwell [3] was the first to suggest the use of chemical and genetic synthetic lethality screening for cancer therapy. Since then, this approach has attracted great attention from cancer biologists as it provides a promising perspective for oncology medicine discovery [4,5]. For example, targeting the PARP-1 enzyme using Olaparib in ovarian cancer patients carrying a tumor BRCA1/2 mutation achieved milestone success in this area [6]. Ultimately, siRNA and CRISPR screenings are the most reliable methods for detecting SL gene pairs. However, compared to model genetic systems (such as yeast or fruit flies), human cell systems hold greater challenges for genomewide siRNA or CRISPR screening. For this reason, several computational approaches have been proposed to facilitate the systematic detection of SL gene pairs in cancer. Briefly, these methods can be divided into three categories according to their targeted data resources: (i) inferring human ortholog gene pairs from yeast SL genes [7]; (ii) using the robustness features to evaluate the importance of gene pairs in the cancer PPI network [8]; (iii) calculating mutual exclusivity using statistical models from gene mutation/transcriptional expression data [9][10][11][12]. More recently, Livnat et al. [13] proposed DAISY to identify SL gene pairs. This approach combines somatic copy number alteration, siRNA screening as well as cell survival and gene co-expression information. Derived from comprehensive in-house data, this approach achieved a promising performance in data-driving SL gene pair identification. Nevertheless, we comprehensively compared the four available predicted SL data sets from previously developed methods (including DAISY) [8,10,12,13] on SL gene pair prediction. The concordance of predicted SL gene pairs among those different methods is extremely low (see details in Discussions). This inconsistency across different methods may indicate that the in silico SL gene pair identification methods are far from mature. In addition, none of the previous methods was learning-based, that is, SL gene pair identification was based on the screening of certain criteria rather than training and prediction. We noticed that a portion of known SL gene pairs have been accumulated, and the investigation of the characteristics of these SL gene pairs are expected to derive significant features which can quantitatively depict the common mechanisms of the SL. Therefore, in our study, we designed a learning-based pipeline to rank novel SL gene pairs based on the known SL gene pairs, together with other unknown ones. By mining the accumulated TCGA mutation and gene expression data, as well as the gene properties in the protein-protein interaction network, our pipeline can be treated as an integration of the traditional strategies, and ranked a list of potential SL gene pairs. In contrast to the lack of experimental validation in most previous methods, we implemented further siRNA knockdown experiments to evaluate our results.

Brief results of 10 times 5-fold cross validation
We evaluated the ranking performance in 11 cancers through 10 times 5-fold cross validation, the other cancers in TCGA failed due to the limited number of overlapping samples between mutation data and the expression data or limited coverage of positive SL pairs. The brief results were listed in Table 1. Herein we didn't intend to describe the details of each pair, Kidney renal clear cell carcinoma (KIRC) would be picked out as an example for illustration (see details in Supplementary Table S1). Currently, TCGA mutation data contains 417 KIRC patients. In addition, the gene-expression data on 400 KIRC patients is available in the TCGA dataset. At first, 528 genes with mutation rate >= 1% were selected from the gene mutation data. Following the workflow described in methods, 1014 candidate SL gene pairs were generated with the chisquare test p value <=0.05 and mutation exclusivity ≥0.8. Three features (Gene pair mutation coverage, Driver mutation probability, Network information centrality) were subsequently calculated. The same calculation process was used on the 119 positive SL gene pairs covered by KIRC mutation and expression data and the cancer network. During the 10 times 5-fold cross validation procedure, the test set contained 23 or 24 SL pairs. Then the top 25 results were used for NDCG calculation and enrichment evaluation, respectively. alpha optimization is a critical process in data manifolds ranking algorithm, which can directly influence the ranking performance.
As it was shown in Figure 1, NDCG@25 was gradually increased from 0 to 0.9768, as the increase of alpha; while the enrichment p value was decreased. This means that the ranking performance was better at a larger alpha. Finally, the optimized alpha = 0.84 was achieved when NDCG@25 reached the peak value. Then, after all of the positive SL pairs were imported as the training set with the optimized alpha, we generated a ranking list for the 1014 candidate pairs according to their relevance to positive pairs. The same process was implemented on 10 other cancer types. Finally, we generated a SL network in Figure  2, which is comprised of 107 predicted SL gene pairs from the top 10 results in the 11 cancer types. (See all of the ranking results in Supplementary Table S2) Function analysis of the genes in the predicted novel SL pairs Pathway enrichment was utilized to decipher the biological functions of the gene list. Specifically, 73/107 genes were mapped to KEGG pathways. From Table 2, we can see that these genes are mainly enriched in 15 pathways (covering 61.64% of the mapped genes) involved in six biological process categories, namely, replication and repair, signal transduction, the endocrine system, cell growth and death, cellular immunity and development. In particular, the Fanconi Anemia pathway ranked No.1 as a potential pathway for identifying novel anticancer therapies by exploiting synthetic lethal relationships [14]. The most famous example of this is the synthetic lethality relationship between the BRCA1/2 gene in the Fanconi Anemia pathways and PARP [15]. Recently, the first-in-class PARP inhibitor, Olaparib, was approved by the U.S. FDA for use in advanced ovarian cancer patients with BRCA mutations [16]. Based on our findings, several other biological pathways for synthetic lethality exploration were identified. For example, the HIF-1 signalling pathway (which activates the transcription of genes involved in angiogenesis, cell survival, glucose metabolism and invasion), was used as a screening resource in discovering synthetic lethal gene pairs [17]. RAS signalling [18], P53 signalling [19], PI3K-AKT signalling [20], are also widely considered to be promising  pathways for synthetic lethal pair identification, and have previously attracted considerable research interest.

In-vitro drug sensitivity of cell lines in which one of the genes in each synthetic lethality pair is targeted
In the process of validating the SL gene pairs in the cell lines, two strict criteria are applied in the cell response information for each SL gene pair: 1. One of the genes in a SL pair should be the target of a drug in the database. 2. Several cancer cell lines treated by the drug must possess the mutation of the other gene in the SL pair. Finally, only 37 predicted SL gene pairs could be well annotated by the drug sensitivity data in CCLE [21] and NCI60 [22] databases (see Table 3 and Table 4). We compared the drug sensitivity of the two types of cancer cell lines, namely, the cell lines with the other gene mutation and the cell lines without the other gene mutation in the SL pair. We found 4 pairs: mTOR-TP53, VEGFR2-TP53, EGFR-TP53, ATM-PRKCA which showed significantly higher drug sensitivity (targeting one of the genes in the SL pairs) in the cell lines with the other gene mutations (P value <= 0.05), than the cell lines without the other gene mutation in these predicted SL pairs.

Possible molecular mechanisms of the 4 positive pairs
According to Kaelin [23]'s synthetic lethality model, synthetic lethality occurs via 4 different mechanisms: The cellular organizational units may be uniquely redundant and their roles are essential (type A), subunits of an essential multi-protein complex (type B), interconnected components in an essential linear pathway (type C), or they may participate in parallel pathways that are together essential (type D). The 4 pairs were consistent with either type D or type C.

mTOR-TP53
mTOR can integrate nutrient and mitogen signals to activate cell growth (increase cell mass and cell size) and cell division [24,25], whilst one of the most important functions of TP53 is its ability to activate apoptosis [26]. Cell growth and apoptosis may provide parallel functions in cancer pathology. mTOR and TP53 may be considered synthetic lethality targets.

VEGFR2-TP53, EGFR-TP53
EGFR is a hot target for cancer therapy with many currently FDA approved drugs, and can activate at least 4 major downstream signalling cascades including; RAS-RAF-MEK-ERK, PI3 kinase-AKT, PLCgamma-PKC and STATs modules. Those signalling cascades can ultimately lead to a series of cellular events such as cell proliferation, inhibition of apoptosis, angiogenesis, migration, adhesion and invasion [27]. Furthermore, VEGF's can specifically induce blood and lymphatic vessel development and homeostasis [28]. Inhibition of EGFR can block the angiogenesis process. Double knock out VEGFR2 and TP53 may lead to synthetic lethality through angiogenesis and apoptosis.

ATM-PRKCA
ATM [29] is a key regulator of multiple signaling cascades that respond to DNA damage. These responses involve the activation of cell cycle checkpoint factors, DNA repair and apoptosis. PRKCA has long been recognized to participate in activating tumour growth and development across different cancers [30]. In addition, PRKCA activation can result in increased cell motility in several in vivo and in vitro cancer models, the effect of which may be reversed with PRKCA inhibition [31,32].
Hence, ATM and PRKCA knock-out, coupled with loss of function of apoptosis and the cell migration process, may generate synthetic lethality.

Validation through siRNA knock-down in cancer cell lines
Regarding the mutation information in the CancerDR database [33], three cancer cell lines were selected (Table 5). MCF-7 is a breast cancer cell line, carrying wild type TP53 and PRKCA. FaDu is a human   Figure 3.

siRNA knock-down validation on TP53-mTOR
According to the results of mRNA expression detection, there was no significant difference in the mTOR knock-down rate between MCF7 and FaDu (average mTOR knock-down rate: MCF7 62.26%; FaDu 60.67%). As shown in Figure 3A, the cell growth inhibitory effect of mTOR knock-down was observed in both MCF7 and FaDu. 96 hours after mTOR siRNA knock-down, the relative cell growth of FaDu slightly decreased from 1.1596 to 0.8178, while the relative cell growth of MCF7 dramatically decreased from 0.9509 to 0.4581. 24 hours after mTOR knock-down, the relative cell growth of MCF7 was always significantly lower than FaDu with a t-test P value of less than 0.01. The lowest inhibition of cell growth (54.19%) was achieved after 96 hours of mTOR knockdown in the MCF7 cells. This may indicate that wild type TP53 in the MCF7 line can possibly strongly enhance the cell growth inhibition effects of mTOR knock-down, compared to the mutant TP53 in FaDu.

siRNA knock-down validation on TP53-EGFR
Also, no significant difference in the EGFR siRNA knock-down rate was detected between MCF7 and FaDu. The optimal knock-down rates were 82.81% and 88.87% in FaDu and MCF7, respectively. Figure 3B displays the relative cell growth of MCF7 and FaDu cells after EGFR siRNA knock-down within 96 hours. No inhibition of cell growth using EGFR knock-down was observed in FaDu cells. The average relative cell growth was maintained at around 0.9325 ~ 1.0300 upon EGFR knock-down, whilst the relative cell growth of MCF7 strongly decreased from 0.9595 to 0.6884 following EGFR knock-down. 24 hours after EGFR knock-down, the relative cell growth of the MCF7 line was always significantly lower than the FaDu line with a t-test P value of less than 0.01. The wild type TP53 in MCF7 cells with EGFR knock-down could lead to cell growth inhibition.    Figure 3C shows the relative cell growth of MCF7 and SW48 cells with ATM siRNA knockdown. The relative cell growth of MCF cells decreased slightly from 1.030 to 0.9208. Considering the high ATM siRNA knock-down efficiency, this suggests that ATM knock-down has very limited inhibition effects on MCF cell growth. For SW48 cells, the relative cell growth decreased from 0.9936 to 0.8495. 96 hours after ATM knock-down, the relative cell growth of SW48 was significantly lower than MCF7 (0.8495 ± 0.0209 vs. 0.9208 ±0.0636, t test P value = 0.01296). This may partly indicate that the mutant PRKCA in SW48 cells can enhance the inhibition of cell growth on ATM knock-down, compared to the wild type PRKCA in MCF7. The slight inhibition observed on SW48 cell growth may be caused by the low ATM siRNA knockdown rate in the cell. It may be a novel SL gene pair with further rigorous validation.

DISCUSSION
Regarding the predicted SL gene pairs: mTOR-TP53 and EGFR-TP53, we found that mTOR knock-down or EGFR knock-down could cause much stronger cell growth inhibition in cell lines with wild type TP53, compared to mutant TP53. It seems that the relationships between mTOR-TP53 and EGFR-TP53 are exactly opposed to the concept of synthetic lethality. Indeed, the multifunctional nature of mutant TP53 needs to be better understood. A series of earlier studies [34][35][36][37] had suggested that mutant TP53 not only represents the equivalent of wild type TP53 functional loss, but also acquires new functions in driving cell migration, invasion and metastasis. The significant differences in cell growth between cancer cell lines with wild type TP53 and mutant TP53 could partly suggest that there is a special relationship between TP53-mTOR and TP53-EGFR. Notably, TP53 may be a promising biomarker in the development of cancer drugs targeting the mTOR and EGFR pathways in precision medicine. siRNA knock-down is limited by both the expression level of the gene in the cell and the knockdown efficiency of the siRNA. In our study, we failed to validate VEGFR2-TP53 due to the low expression levels of VEGFR2 in the FaDu cell line. The low knock-down rate (45%) of ATM may weaken the inhibitory effects on cell growth. In response to these issues, the latest CRISPR technology [38,39], which can provide considerable gene editing power, may provide a more reliable approach to further validate SL gene pairs.
Detecting SL gene pairs in humans is a challenging problem due to the highly evolved, complex and redundant signalling pathways within human cells. The influence of a loss of function caused by gene mutation can often be complemented by parallel pathway signalling. Various computational methods can provide potential SL gene pairs from different perspectives, such as the correlation of gene expression with mutation, robustness in the cancer network or gene co-expression in related biological processes. In this study, we compared the 107 predicted SL pairs with the results of four previous methods (see details in Supplementary Table S3). As shown in Figure 4, 12.15% (13 pairs) of our predicted results overlapped with Wang's [10] or Kranthi's [8] prediction. Importantly, TP53-mTOR and EGFR-TP53 pairs validated by drug sensitivity data were included in the overlapping pairs. This may suggest that overlapping predictions from different methods may provide more reliable results. Interestingly, we also found that no overlap occurred between Livnat's [13] predictions or any of the other four methods. The original input data may be one of the important factors in influencing the final predictions. Kranthi's method [8] started with the human protein-protein interaction database HPRD [40] as well as CancerGenes [22]. Wang's prediction [10] was based on the profiling data of glioblastoma multiforme from TCGA as well as the p53 mutation information from the Trust Sanger Institute. Srihari [12] used the copy-number and gene-expression profiling data of four cancers (breast, prostate, ovarian and uterine) in TCGA as input data for their method. The different features of these input data across the methods may generate bias in SL gene pair predictions. Since there was no priori knowledge of cancer targets in the NCI-60 database, CancerGenes or Metacore were used to filter the input data in Livnat's model [13], the potential SL gene pairs from the cancer lines may lead to distinct prediction results from others. In addition, due to the low concordance of results between different methods, further efforts to explore such complex SL relationships in a human system may be required.

CONCLUSIONS
In this study, we proposed a semi-supervised ranking pipeline to rank novel SL gene pairs based on the vast amounts of accumulated TCGA data. 107 novel potential SL gene pairs were predicted from the top 10 results covering 11 cancers. In particular, 4 SL pairs: mTOR-TP53, VEGFR2-TP53, EGFR-TP53, ATM-PRKCA, could be validated using drug sensitivity information in the cancer cell line databases CCLE or NCI60. Furthermore, the results of siRNA knock-down experiments indicated that significant differences in the cell growth of mTOR or EGFR siRNA knock-down were detected between the cancer cells with wild type TP53 and mutant TP53. The TP53 mutation may serve as a biomarker for cancer therapy in drugs targeting mTOR or EGFR. More promisingly, a recent study [41] has proposed P53 as a biomarker for predicting the progression free survival (PFS) of pancreatic cancer patients being treated with erlotinib (EGFR inhibitor). Taken together, these data underscore the potential of investigating the role of P53 as a predictive biomarker in other cancer types.

SL gene pair prediction pipeline
In this study, we designed a semi-supervised learning model [42] to rank the similarities between positive SL gene pairs and candidate SL gene pairs, mainly using 3 defined features namely, gene pair mutation coverage, driver mutation probability and the quantified network information centrality. More specifically, we used three features to describe both the known SL gene pairs and candidate SL gene pairs. Then the semi-supervised method could rank the candidate SL gene pairs according to the similarity of these features with the known SL gene pairs. Herein, gene pair mutation coverage was defined as the percentage of samples containing at least one gene mutation in the pair. Furthermore, in order to get more reliable results from TCGA mutation data, the mutations of genes in candidate SL pairs should be covered by a certain number of samples. Driver mutations play vital roles in cancer development.
Regarding the cancer specific SL pairs, we hypothesised that the mutation of genes playing an important role in cancer progression are more likely to be driver mutations. Last, the network information centrality helps to identify the potential nodes, which are crucial for the proper functioning of the system. Since simultaneously mutating two genes in a SL gene pair could dramatically influence the cellular process and cause cell death, network information centrality was used to calculate the influence of knocking-out a node pair on system stability. This approach inherently mimics the synthetic lethality mechanism well.
The brief workflow of the SL prediction pipeline is shown in Figure 5. In the first step, cancer biomarkers were collected from COSMIC [43] and MetaCore [44], which were used as a filter to select raw cancer related SL pairs. Next, the positive SL pairs were generated from yeast SL pairs, followed by homolog gene transformation, cancer biomarker filtering as well as the application of evidence in human cell lines obtained from literature mining. The candidate genes were selected from TCGA mutation data. The raw candidate SL pairs were then composed based on a candidate gene and a gene within a cancer network. Then, a Chi-square test (implemented by chi2_contingency in python package Scipy) was used to evaluate whether the mutations of the two genes is an independent event in each raw candidate SL pair. In addition, the mutation exclusivity was also calculated, which was defined as the percentage of samples carrying one of the mutant genes in the SL gene pair [9]. Only those independent gene mutations with high mutation exclusivity were selected as candidate SL pairs for further calculation. Subsequently, three features of both candidate SL pairs and positive SL pairs were calculated and normalized before being exported into a learning model. Finally, the novel SL pairs were detected with an optimized parameter which was obtained from 10 times 5-fold cross validation.

TCGA mutation and expression data processing
We downloaded TCGA mutation and expression profiling data from the UCSC Cancer Genomics Browser (https://genome-cancer.ucsc.edu), which provides wellannotated and interactive visualizations of TCGA genomic, phenotypic, and clinical data [45]. We then obtained two matrices. Each row of the matrices represented a gene, and each column indicated a sample. The values in the cells represented the expression and mutation status in the gene expression matrix and the gene mutation matrix, respectively. Finally, data from 11 cancers, containing both the gene expression matrix and the corresponding gene mutation matrix, were used in our study.

Positive synthetic lethality gene pairs
The collective data on yeast SL (synthetic lethal) genes based on high throughput genetic screening is available at BioGRID [46]. However, no curated database of human SL gene pairs has been established yet. In this study, BioGRID was used as the primary resource to retrieve human cancer related SL gene pairs. The phylogenetic inference from yeast to human genes was obtained from the Ensemble database (http://useast. ensembl.org/). Then, homolog human SL pairs were filtered by cancer biomarkers in MetaCore (https://portal. genego.com/) and driver genes in COSMIC [43]. Only homolog human SL gene pairs with both of the genes covered by cancer biomarkers or driver genes were kept for downstream analysis. In order to reduce the false positive rate as much as possible, for each homolog human SL gene pair, we checked the evidence available in the PubMed literature. Finally, 399 positive SL pairs were identified with the evidence of synthetic lethality in human cell lines or animal models in the literature (see Supplementary Table S4).

Cancer network
307,066 protein-protein interactions were downloaded from HPRD [40]. Then, we used cancer biomarkers from MetaCore and COSMIC [43] to filter them. In details, we searched the keywords 'cancer, tumor, Figure 5: Workflow of the SL prediction pipeline. The three types of original data (gene mutation and expression, cancer biomarker, SL-pairs in yeast) were downloaded from TCGA, COSMIC and MetaCore as well as BIOGRID, respectively. The cancer network was built from the interaction of cancer biomarkers in the protein-protein interaction database HPRD. Each raw candidate SL pair was composed of a highly mutated gene and another gene in the cancer network. Then a chi-square test on the mutation exclusivity with p value <= 0.05 was utilized to generate the candidate SL pairs, while the positive SL pairs were derived from the yeasts' SL pairs followed by homology transformation, cancer biomarker filtration and literature evidence identification on human cell lines. For each pair in of candidate SL and positive SL data, three features were generated for them. The first feature was calculated from the mutation coverage of each SL gene pair in the TCGA mutation data. The driver mutation probability was calculated by R package DriverNet. The third feature was defined to evaluate the influence on stability of the cancer network, after removing the two genes from an SL pair. Then normalized features of each SL pair were imported into a manifolds ranking model to generate a ranking list of potential SL pairs. carcinoma' in MetaCore and retrieved 4,296 cancer related biomarkers. At the same time, we also downloaded the 507 driver mutation genes collected in the Cancer Gene Census from the website of COSMIC. All of these gene mutations in Cancer Gene Census have been proved to causally implicate in cancer. Then, for each proteinprotein interaction, only if both proteins are included in MetaCore cancer biomarkers or Cancer Gene Census in COSMIC, would the protein-protein interaction be kept. Finally, we obtained 11,925 protein-protein interaction pairs, corresponding to 2,869 individual proteins. The cancer network could be built with edge presented by the protein-protein interaction, as well as the node displayed by a protein.

Candidate SL pairs generation
We calculated the mutation rate of each gene among the samples in the TCGA mutation data. Herein, 1% was utilized as the cut-off threshold to select the candidate genes. Each raw candidate SL gene pair was generated by selecting a candidate gene as well as the other gene from the cancer network. Subsequently, we tested whether gene A mutation and gene B mutation are independent events based on the mutation data. In detail, the null hypothesis is that gene A mutation and gene B mutation are independent of each other. A Chisquare test was implemented on a 2×2 contingency table (see Table 6). M represents the number of samples carrying both gene A and gene B mutations; N represents the number of samples carrying the gene A mutation without the gene B mutation; X represents the number of samples carrying the gene B mutation without the gene A mutation; Y is the number of samples that containing both wild type gene A and wild type gene B.
The raw candidate SL gene pairs with Chi-square test p value <=0.05 means the mutation of gene A and gene B are not independent. Maybe some relationships exist between mutation of gene A and gene B. In addition, the mutation exclusivity of gene A and gene B could be calculated as (X+N)/(M+N+X). The higher mutation exclusivity indicates the gene A and gene B are more likely to be mutually exclusive mutations. Herein, only candidate SL pairs with both Chi-square test P value ≤ 0.05 and mutation exclusivity ≥ 0.8 were selected for the downstream processing.

Features calculation Gene pair mutation coverage
It was defined as the percentage of samples containing at least one gene mutation in the pair. For example, gene A is mutated in samples s1,s3,s6, gene B is mutated in samples s3,s8,s9. n is the total number of samples. The mutation coverage of the pair (gene A, gene B) is 5/n.

Driver mutation probability
Herein, we utilized the R package DriverNet [47] to evaluate the driver mutation probability of genes based on the relationship between mutation and consequent changes in gene expression. The input data of DriverNet comes from two matrices, namely a mutation matrix and its corresponding gene expression matrix. Each column of the two matrices is a sample, whilst each row represents the mutation status or expression level of a gene among the samples. The output of DriverNet is the P value of each gene that will likely be a driver of gene mutation. The smaller P value of the two genes from a SL pair was transformed to a negative log10 (P value) indicating the strength of the driver mutation for the pair.

Network information centrality
If G refers to the cancer network mentioned above, and G' refers to the cancer network after removing gene A and gene B, then the network information centrality of gene A and gene B could be defined as formula I: Where E(G) is the efficiency of the network. It could be calculated in the formula II: Herein, if gene i could reach gene j in cancer network, d ij is the length of the shortest path between the gene i and gene j (calculated by shortest_path_length in python package networkx), otherwise, d ij is equal to D(G) + 1. D(G) represents the diameter of cancer network, which is defined as the largest distance across all of the shortest paths in the cancer network (calculated through diameter in python package networkx). Finally, normalization of the three features was taken to transform the values of each feature between 0-1 in formula III. x is the original value of a feature. x' is the normalized value.

A semi-supervised ranking model
The principle of our ranking model, which is referred to as a manifold ranking algorithm [42,48] can be intuitively explained: the problem is defined in two datasets, a true sample set and an unknown sample set (background); and the goal is to rank the individual members of the unknown sample set according to their relevance to the true samples. This model is well suited to address our problem scenario, which is that we only have few known SL pairs in hand (known positive data samples), and we want to prioritize the largest possible Δ gene pair combinations based on their possibility to be the true SL pairs. In detail, we used three features to describe each SL pair. Then 1-cosine angle distance was calculated to represent the relevance between candidate SL pairs and true SL pairs.
Input: A set of points X = (x 1 ... x q , x q + 1 ... x n ) representing the SL pairs. The first q points are true SL pairs, while, the others are candidate SL pairs. The initial score y was defined as (1…1,0 …0). (The true SL pairs are corresponding to 1, candidate SL pairs are assigned as 0.) Define f 0 = y; α is a parameter of the algorithm.
Output: A ranked list of X, where higher ranked gene pairs are more likely to be SL gene pairs.
1. Define the similarity matrix W ij = 1-cosine(i,j) and W ii = 0. 4. Let f * be the converged function f t ; and rank all the points X in the decreasing order of their f * values.
It has been shown [42] that f * could be calculated as formula IV.

Evaluation test design 10 times 5-fold cross validation
For each case, the positive SL pairs were divided into five segments. Four of them were used as training sets, while the rest of the segments were used for evaluation. Next, positive SL pairs were shuffled 10 times, the overall performance was determined by the average results of these 10 shuffling events.

Ranking performance evaluation
Normalized discounted cumulative gain (NDCG) [49] was originally used to evaluate web search engine algorithms in the field of information retrieval. It can measure the usefulness of a document based on its position in the result list. Here we used NDCG to measure the effectiveness of ranking performance for each case's predicted results (see formula V ). In the top 5 results, ( 1 log (2 1)   1 log (3 1) ) 0.4907 2 2 In addition, the positive enrichment of SL pairs in the top n ranking position are also used to evaluate our prediction performance. Herein, a hypergeometric test is

Comparison of drug sensitivities between two groups of cancer cell lines
The original drug sensitivity data, drug targets as well as the mutation backgrounds of cancer cell lines in CCLE [21], NCI60 [22] were downloaded from broadinstitute.org/ccle/home and discover.nci.nih.gov/ cellminer/, respectively. Regarding a SL gene pair, if one of the genes in the SL pair is targeted by a drug, we compared the drug sensitivities on the cell lines carrying the mutation of the other genes in the pair and the cell lines containing the wild type of the other genes. The lower GI50 or IC50 value means higher drug sensitivity.

Further validation through siRNA knock down on cell lines
In order to get more reliable validation of the predicted SL gene pairs, we conducted siRNA knockdown experiments on cancer cell lines. The influence on cell growth of different genetic background cell lines would indicate SL relationships. For example, regarding a SL gene pair gene agene b, two cancer cell lines were selected. The first cell line carried mutant gene a, while the wild type gene a was carried in the other cell line. Then, the siRNA of gene b was transfected into the two cell lines. We recorded the cell growth of two cell lines at the time points of 0h, 24h, 48h, 72h and 96h on two different treatments: placebo, siRNA-knockdown of gene b, respectively. Herein, we did 8 parallel experiments at each time point. The relative cell growth was calculated through the formula VII.
= VII Relative cell growth Growth of cell with siRNA treatment Growth of cell with placebo treatment ( )