Predicting cancer-relevant proteins using an improved molecular similarity ensemble approach

In this study, we proposed an improved algorithm for identifying proteins relevant to cancer. The algorithm was named two-layer molecular similarity ensemble approach (TL-SEA). We applied TL-SEA to analyzing the correlation between anticancer compounds (against cell lines K562, MCF7 and A549) and active compounds against separate target proteins listed in BindingDB. Several associations between cancer types and related proteins were revealed using this chemoinformatics approach. An analysis of the literature showed that 26 of 35 predicted proteins were correlated with cancer cell proliferation, apoptosis or differentiation. Additionally, interactions between proteins in BindingDB and anticancer chemicals were also predicted. We discuss the roles of the most important predicted proteins in cancer biology and conclude that TL-SEA could be a useful tool for inferring novel proteins involved in cancer and revealing underlying molecular mechanisms.


INTRODUCTION
In 1990, approximately 6 million people died of cancer globally, while in 2010 the number had risen to about 8 million [1]. Moreover, the incidence of cancer is increasing rapidly [2]. Current cancer treatments include surgical therapy, radiation therapy, and chemotherapy; however, these treatments also damage normal tissues. Many researchers are focused on developing targeted molecular therapies that specifically damage tumor tissues with little damage to normal tissue [3]. Comprehensive understanding of the molecular mechanisms underlying cancer is necessary for designing efficacious drugs.
The molecular mechanisms of cancer development are very complicated, and include lasting proliferation, malfunction of cell death programs, induction of angiogenesis, metastasis of cancer cells, changes of cellular energy metabolism, and evasion of immune destruction [4]. This complexity, along with a lack of reliable methods for the large-scale identification of cancer-related proteins, causes cancer progression to remain a puzzle and greatly hampers the development of effective therapies. Traditional experimental methods are not suitable for the large-scale identification of cancer-related proteins because they are both timeconsuming and inefficient. Computational methods rely on systematic comparisons between the genomes of cancer cells and normal cells, using correlation analysis to search for mutated genes associated with tumorigenesis [5]. However, many of these methods only utilize DNA sequence information [6][7][8], and overlook protein information that is more closely related to biological pathways and phenotype.
In 2015, Chien-Hung Huang et al. developed a prediction model of cancer proteins based on proteomic data [9]. They demonstrated the accuracy of this method on two independent datasets of lung cancer and lung cancer microarray. Their prediction achieved hit ratios of 89.4% and 72.8%, respectively. Two other groups predicted cancer-related proteins as new drug targets for cancer treatment with network analyses [10,11]. These predictions were mainly based on protein interaction networks, protein sequences, or motif information. Importantly, no cancer-related small-molecule activity data were considered in these studies. Taking into consideration the limitations of these approaches, there is a great demand for new methods to infer key proteins in cancer. www.impactjournals.com/oncotarget With the open-source drug discovery campaign, massive bioactivity data have been accumulated using assays ranging from phenotypes to enzymes. For example, the National Cancer Institute (NCI) has launched many programs for screening compounds against cancer cell lines and has acquired a large list of active anticancer chemicals [12]. On the other hand, BindingDB catalogues an enormous number of active compounds that act on various proteins [13].
Chemoinformatics approaches can be applied to compare anticancer compounds from the NCI database and bioactive compounds against different proteins from BindingDB, and infer proteins that are involved in the genesis and proliferation of cancer cells. The underlying assumption, named similarity principle or general neighborhood behavior, is that structurally similar molecules are likely to possess similar biological activities [14][15][16][17][18]. Provided that anticancer chemical a is very akin to chemical b, which is active against a protein, it is possible that a can also change the state of this protein and the anticancer effects of a depend on the same protein. The more similar molecule pairs are found, the more correct is the assumption. Therefore, the association between these proteins and cancer development can be confirmed through a systematic statistics probability analysis on massive activity data about the phenotype and the proteins.
However, the similarity between the ligand set of a protein and that of a phenotype is very weak and proteinphenotype relationships can be inferred only with very careful statistical analysis. In 2007, Michael J Keiser et al. proposed a Similarity Ensemble Approach (SEA) to infer relationships among receptors [19]. The SEA algorithm can be applied to two compound sets at the same level [20,21], but is not suitable for systems at two different levels, such as active compounds against cancer cell lines and against proteins. Therefore, in this work, we proposed a modified association algorithm, named two-layer SEA (TL-SEA), and applied the algorithm to the analysis of the activity data from the NCI database and BindingDB. Three cell lines, K562, MCF7 and A549 were used as example systems. The K562 cell line was derived from the blastic phase of chronic myelogenous leukemia. It also has some characteristics of chronic leukemia and acute leukemia [22]. MCF7 and A549 were derived from human breast cancer and human lung cancer, respectively. Using TL-SEA, we attempted to infer which proteins play roles in the genesis and proliferation of these cancer cells.

Prediction of cancer-related proteins
Important cancer-related proteins were successfully predicted with our algorithm (TL-SEA) based on the existing active compounds against the three cancer cell lines and BindingDB proteins. Proteins with a smaller association value (AS score) were more likely to impact on the development of cancer. In this study, we selected the proteins with AS scores smaller than 0.03 for further analysis, resulting in a total of 35 cancer-related proteins (31,35, and 28 proteins for K562, MCF7 and A549 cell lines, respectively; Table 1). There were 25 common proteins in the three systems. Most of the predicted proteins were human proteins or their close homologs except luciferin 4-monooxygenase of firefly. According to previously published literature, 26 of the 35 proteins are relevant to the proliferation, apoptosis, or differentiation of cancer cells. The references are listed in the last column of Table 1.
Among the 26 proteins, melatonin receptor type 1B occurs twice. One of them is from chicken (ranked first in all the cell lines) and the other one from human (ranked 22nd, 24th, 22nd in the 3 cell lines, respectively). BLASTP showed that these two proteins were very similar with E-Value = 7e -150 , sequence identity = 71% and sequence cover = 100%. Melatonin receptors play an important role in cancer development [23][24][25][26][27], and have anticancer functions through binding with melatonin [26]. Melatonin is involved in redox processes of cells, augments natural killer cell activity, stimulates cytokine production (IL-2 and IL-6), and protects hematopoietic precursors from the toxic effect of chemotherapy and radiotherapy [27]. Studies revealed that breast cancer cell differentiation is regulated by the MT-1 signaling pathway [28,29], while the anticancer function of melatonin is mediated by MT-1 receptor and G protein-coupled signal transduction in liver cancer cells [30]. Clinical data also showed high MT-1 expression is associated with cancer resistance in people with lower melatonin levels [31]. Melatonin may also protect against gastric cancer in mice by up-regulation of membrane receptor MT-1 and MT-2 expression [32].
The second and the third proteins in the predicted list are tubulin beta-1 chain and tissue factor (TF). Tubulin beta-1 chain is the primary component of microtubules.
Microtubules play a key role in the process of mitosis [33], which is necessary for cancer cell proliferation. Thus, disruption of cell mitosis can block the increase in cancer cells. As early as 2004, there was research on microtubules as targets for anticancer drugs [34]. Similarly, TF expression in the cell surface accelerates tumor progression [35,36]. TF accelerates malignant tumor growth, invasion, and metastasis mainly by promoting vascular endothelial growth factor (VEGF) release to regulate tumor cell angiogenesis [37]. Interestingly, the VEGF receptor 2 is ranked 23rd in the predicted protein list. Reduced TF expression can decrease cancer cell growth, and selective reduction of TF expression with mRNAi in colorectal cancer cells reduced tumor growth in mice [38]. These results have been replicated in vitro [39], and higher TF expression was found in primary carcinoma of the rectum, breast cancer and pancreatic cancer. Thus, www.impactjournals.com/oncotarget TF expression is related to the invasiveness of cancer [40], and multiple experimental models have demonstrated that increasing TF expression promotes tumor growth [41]. For those proteins without direct evidence regarding their involvement in cancer development, there is a great chance that they also play important roles in cancer-related cellular pathways. Of course, this hypothesis remains to be confirmed with further studies. We analyzed protein Q8CA95 (cAMP and cAMP-inhibited cGMP 3′,5′-cyclic phosphodiesterase 10A), which is ranked 5th in all the predicted proteins and first in the proteins without direct proof in the literature. The protein hydrolyzes both cAMP and cGMP, regulating the intracellular concentration of cyclic nucleotides in the striatum [42]. As a target for signal transduction regulation, it has not been reported to have anticancer effects; however, cAMP mediates the translation of cancer cells into healthy cells [43,44].
In our algorithm, we needed to calculate the significance twice, P Z and P O , for the anticancer compounds-protein association (initial score, I) and the cell line-protein association (original score, P O ). Because I was summed at different length (m, number of active compounds against a specific protein), it was translated into comparable Z score with formula 4. The constants (a, b, k) were obtained by fitting the initial score and the number of active compounds with formulae 2 and 3 ( Figure 1). The results showed a linear correlation between the initial score I and compound number m, indicating the feasibility of the SEA algorithm in such a system. As mentioned in the methods section, it is unnecessary to standardize the original score, P O .

Analysis of chemical-protein matrices
In the process of inferring cancer-related proteins, three anticancer compounds vs BindingDB proteins association matrices (P Z ) emerged. The matrices contained the significant scores (P Z ) between the active chemicals targeting the cancer cell lines and the BindingDB proteins. P Z can be used to deduce whether a compound can interact with a protein. Smaller P Z indicated higher possibility of interaction. By retaining the matrix elements with P Z less than 0.0001, the matrices were translated into three networks as shown in Figure 2 and Supplementary Figure S1. The nodes in the networks represent proteins or chemicals, while the edges denote their association. The three networks are presented with the same layout. The position of the nodes was optimized with forces according to the reciprocal of the P Z . For nodes that were missed in the cell lines, they were fade out to gray. The other nodes were highlighted with distinguishable colors. Through this way, the differences between the cell lines can be visually analyzed. For example, the major prion protein (node P23097, highlighted with the red rectangle in Figure 2A) did not existed in K562 but existed in other two cell lines. Experimental studies showed that the overexpression of P23097 failed to protect DNA fragmentation in leukemia cancer cell line but it converted TNF-sensitive cells into TNF-resistant cells in MCF7 breast cancer cell line [45][46]. Moreover, the expression of major prion protein were associated with increased lung colonization [47]. These results are consistent with our predictions.
Some of the predicted interactions in Figure 2 were reported earlier and truly existed. For example, the P Z score between anticancer compound thapsia villosa (NCI_ ID 299934) and sarcoplasmic/endoplasmic reticulum calcium ATPase 1 (Uniprot_ID P04191, SERCA 1) was 1.500 × 10 -7 , ranking first in MCF7 dataset (see Supplementary information Table S2). This compound is indeed a potential inhibitor of the SERCA pump [48]. In addition, the P Z score between compound GW805758X (NCI_ID 756364) and protein O14920 (inhibitor of nuclear factor kappa-B kinase subunit beta) was 1.015 × 10 -4 . Through database searching, this interaction can be found in ChEMBL (http://ebi.ac.uk/ chembl, ChEMBL Assay ID: CHEMBL2007663).
We analyzed the proteins linked to more than 15 compounds in the three networks (Table 2). Protein Q61614 (Endothelin-1 receptor) linked to 59 compounds in the MCF7 dataset, ranked first in all the proteins. The AS score (P O ) between this protein and MCF7 is 0.0033, indicating that this protein is a potential target for MCF7 cells. Kusuhara M et al. found that breast cancer cells can release Endothelin-1 [49]. The binding of Endothelin-1 (ET-1) to ET-1 receptor can stimulate growth of breast cancer cells by autocrine and paracrine signaling, and

P48974
Vasopressin V1b receptor Rat 0.0299 (31)  The list was sorted by K562 significance (AS score), and then by MCF7. References regarding to the proteins related to proliferation, apoptosis, or differentiation of cancer cells were listed in the last column. * Uniprot ID of the proteins [77].
-AS score larger than 0.03. www.impactjournals.com/oncotarget increased expression of ET-1, Endothelin A receptor (ETAR), and Endothelin B receptor (ETBR) in breast cancer patients lowers disease-free survival time and overall survival [50]. Among the 27 proteins in Table 2, only 4 proteins (Q9H4B7, P41586, Q61614 and O02747) were associated with the cell lines with an AS score (P O ) less than 0.03. This result was caused by different thresholds used in the original score calculation process and in Figure 2, which illustrates the protein-compound interactions with a visual network. Only connections with Pz < 0.0001 were retained to simplify the networks. For P O calculation, more information was needed for association analysis. Thus, we used a threshold of Pz < 0.01. If the threshold of the connections was changed to Pz < 0.001, all the 11 proteins with more than 70 connections were associated with the cell lines (P O < 0.03). The results proved the capability of our algorithm for deep data mining. That is, the association score was deduced with large numbers of weak similarities between the active compounds of the cell lines and the proteins instead of fewer but stronger similarities.
We also analyzed two sub-networks ( Figure 2C', 2C") extracted from the interaction network of MCF7 active compounds and the proteins ( Figure 2C). Figure 2C' shows the predicted interactions between a MCF7 active compound (NCI_ID 727680) and 7 proteins, while Figure 2C" shows the interactions between a BindingDB protein (Uniprot_ID P05227) and 17 anticancer compounds for the MCF7 cell line. Detailed information about the subnetworks, including proteins, compounds and P Z between them can be found in Supplementary  Table S1. Full information on P Z < 0.0001 data in MCF7 can be found in Table S2. We randomly selected 4 active compounds (NCI_ID 353, 8591, 695939, 743862) from these 17 anticancer compounds in Figure 2C'' and calculated the similarity between these NCI compounds and the active compounds against the BindingDB proteins (Uniprot_ID P05227). Most of the similarities were around 0.2, except very few high similarity scores (Supplementary  Table S3). This result is consistent with Keiser's research, which found that for most ligand pairs the similarity was low, ranging from 0.2 to 0.3 [19]. This result also indicates the necessity to use strict statistical algorithms in such systems and confirms our previous deductions.

DISCUSSION
Prior methods for large-scale identification of cancer-related genes or proteins were primarily based on the discrepancies between the genomes of cancer cells and normal cells, and rarely took into consideration ligandprotein interactions. Our research employed the activity data of the chemicals targeting proteins or cancer cells in existing databases, enabling us to analyze the mechanisms underlying tumorigenesis from the perspective of chemistry. A chemoinformatics approach (TL-SEA) was proposed to compare anticancer compounds with active chemicals binding to a particular protein target. By this means, possible associations between cancer cell lines and proteins were predicted if the two groups of chemicals showed similarity. Literature searches showed that most of the high-ranked proteins were related to proliferation, apoptosis, or differentiation of cancer cells.
Additionally, a chemical-protein interaction matrix was produced, which can help explain the mechanism of the anticancer drugs and also boost the repurposing of anticancer drugs to other fields. In fact, the drugs active against the predicted proteins are also potential active compounds against cancer. This can be quantitatively measured using the TL-SEA algorithm in reverse, to calculate the association between active compounds against the predicted protein targets and NCI cell lines. The above results confirmed the effectiveness of our algorithm. Of course, further laboratory experiments are needed to validate the predicted associations/interactions. Pathway analysis and systems biology simulation can also be performed to interpret the roles of the proteins in tumorigenesis.
The applications of this chemoinformatics approach can be expanded to elucidate the molecular mechanisms of other diseases. For instance, by comparing the active compounds against a bacterium with those targeting a variety of proteins, it is possible to deduce important proteins for the growth of this bacterium. The primary mission in the post-genomic era is to illuminate the relationships among genes, proteins, diseases, pathways and chemicals at an -omics level. It is impossible to finish this  -44  35  O02667  -31  22  P23526  -26  23  P05227  -19  22  P15328  -18  20  P28647  -18  -O00142  -15  -P41148  -15  -Q62645  -15  17  P48544  --20  P48549  --19  Q01782  --15 -with less than 15 linked compounds. * predicted as a cancer related protein. www.impactjournals.com/oncotarget work using traditional methods. Currently, a large number of small molecule activity data are becoming available to the public, such as ChEMBL and Pubchem [51,52]. These datasets include the results of high throughput screening at the molecular level and all kinds of phenotypic activity. The relationship between the phenotypes (diseases) and proteins can be inferred using the TL-SEA algorithm proposed here. Compared with traditional approaches, our approach is economically feasible and fast, and therefore suitable for rapid preliminary screening before further validation. Compared with conventional genome correlation analysis, TL-SEA utilizes the activity data directly, reflecting protein function in the organism. Thus, the results of TL-SEA can be interpreted more easily with pathway analysis. However, the limitations of this approach should not be ignored. The method relies on the activity data of small molecules and only applies to the diseases or proteins whose active molecules are known.

General study protocol
The overall protocol of this study is illustrated in Figure 3. Activity data and structures of small molecules against cancer cell lines and against a variety of proteins were collected from the NCI database and BindingDB. Physicochemical properties and activity thresholds were used to filter off inactive or non-druglike compounds. Then the similarity matrix (target similarity matrix) formed by these two groups of active compounds was calculated with ECFP_4 molecular fingerprint and the Tanimoto coefficient. At the same time, large numbers of chemicals satisfying aforementioned physicochemical properties were randomly sampled from BindingDB. The similarity matrix (random similarity matrix) formed by these random BindingDB chemicals and the NCI active compounds of the corresponding cell line was also calculated in the same manner. Finally, the TL-SEA algorithm was employed to compare the target similarity matrix with the random similarity matrix, and therefore give the association score (AS) between each protein and the cell line. The AS score was used to infer whether a given protein plays a role in the growth of cancer cells. Details about the association algorithm are described below.

NCI database
Activity and structural data of cancer cell line inhibitors were downloaded from NCI website (https://wiki. nci.nih.gov/display/NCIDTPdata/Chemical+Data). NCI database uses GI 50 (growth inhibition of 50%) as an endpoint, which is the drug concentration giving a 50% reduction in the cancer cell proliferation. According to the distribution of GI 50 value of the cell lines, leukemia cell lines were generally more sensitive than other cell lines. Therefore, K562 leukemia cells (September 2012 release) were selected as the study material and its activity data were used for the next analysis. We also selected two other cancer cell lines, MCF7 and A549 (September 2014 release), to confirm the stability and effectiveness of our algorithm. The three cancer cell lines were analyzed separately.
Activity data sets of K562, MCF7, and A549 cell lines contained 47,497; 36,801; and 51,170 entries, respectively. 2D structures of the corresponding active compounds were extracted and linked with the activity values. The distribution of compounds' GI 50 values, ranging from micromole to nanomole, was analyzed with the cumulative frequency plot (Supplementary Figure S2). 90% of the active compounds possessed a pGI 50 (the negative logarithm of GI 50 values in base 10) less than 6 (GI 50 ≥ 10 -6 mol/L). Therefore, compounds with pGI 50 over 6 were defined as active, which included 3658, 3744 and 4646 compounds in the three data sets, respectively. Inactive compounds were discarded.
The distributions of molecular weight and AlogP (oil water distribution coefficient) of the anticancer active compounds were compared and analyzed (Supplementary Figure S3). AlogP thresholds were set to [2,7]

BindingDB database
Active data against proteins were obtained from BindingDB (http://bindingdb.org/bind/index.jsp, accessed on 2 March 2014). The binding data and 2D structures of small molecules were collected. There were four types of endpoints used in BindingDB, i.e. K i , IC 50 , K d and EC 50 . Compounds were defined as active when any of these values were smaller than 10 -6 mol/L. To ensure the consistency of physical and chemical properties, BindingDB compounds were also filtered with the property criteria as discussed above. The final BindingDB active ligand set contained 505,600 compounds.

Generation of the similarity matrices
A similarity matrix (M) was generated by calculating the similarities between the NCI and BindingDB active compounds. Each column of the matrix corresponds to a NCI active compound, while each row corresponds to a BindingDB active compound. The similarity was calculated with ECFP_4 molecular fingerprint and Tanimoto coefficient. ECFP is Extended-Connectivity Fingerprints based on the Morgan algorithm [78]. It is a circular topological fingerprint designed for molecular characterization, similarity calculation, and virtual screening. The diameter of a circular atom neighborhood Figure 4: Schematic representation of the TL-SEA algorithm. First, the target protein similarity matrix (M t ) was extracted from the overall NCI-BindingDB similarity matrix (M). Then, the matrix was translated into an initial score vector. Next, the initial score vector was normalized to the Z score vector through random column sampling. Finally, the association score (AS) was calculated based on the Z score vector and another random sampling of random similarity matrixes. Here, n is the number of the active compounds of a NCI cell line. S and S' are the similarity value between NCI compound and BindingDB compound. I is the sum of the similarity values over 0.15 in the corresponding column. Refer to the text for a detailed description. is set to 4. Tanimoto coefficient (S t ) is one of the most widely used similarity indices and is defined as S t = C/ (A + B -C). Here, A and B are the numbers of fingerprint bits of molecules A and B, and C is the number of bits coexisting in both molecules.
As background sampling, around 50,000 compounds were randomly selected from BindingDB. The compounds were filtered using physicochemical properties as described above. Similarity matrices of the randomly selected BindingDB compounds against all NCI active compounds of each cell line (M r ) were also calculated. Three random compound sets (similarity matrices) were prepared with different property thresholds for the three NCI cell lines.

Calculating association scores using the TL-SEA algorithm
The similarity between two unrelated compound sets was usually very weak. Therefore, a sensitive association recognition algorithm with careful statistical inference was required to predict cancer-related proteins. This algorithm needed to be able to find out the implicit association of cancer cells with particular proteins using their active molecules. Based on the SEA algorithm, which was originally proposed by Keiser [19], we proposed an improved two-layer approach (TL-SEA). The protocol of this algorithm is described as follows (Figure 4).

Extract the row vectors corresponding to a BindingDB protein active set
To analyze the association between the NCI cell line and a certain protein, the row similarity vectors corresponding to the protein's active compounds were extracted from M. The extracted row vectors composed a target sub-matrix (M t ). Proteins with less than 10 active compounds (vectors) were discarded.

Translate the similarity matrix M t into an initial score (I) vector by adding up all the similarity values over a threshold in each column
The frequency histograms of the random similarity values (in M r ) were analyzed and shown in Supplementary Figure S4. By analyzing the distribution of random similarity values in the random similarity matrices, we chose 0.15 as the threshold and used it to filter off weak similarity noise.
Then, by summing up all the similarity values over 0.15 in each column (NCI active compound i), the similarity matrix (M t ) was translated in to a row vector. The element of the vector was defined as the initial score (I) between the protein and a particular anticancer compound.
Here, the summation was made over all the similarities between the active compounds against the protein (j) and the anticancer compound.
Translate the initial score vector into significance score (P Z ) The initial score reflected the association between each protein and the corresponding anticancer compound. However, the initial score varied when the number of active compounds changed. For proteins with more active compounds, their initial scores were generally higher than proteins with less active compounds. Therefore, the initial score was translated into a comparable standardized score (Z) and significance score (P Z ). This was achieved with row sampling and linear regression. a) Random sampling was conducted against all columns in the random similarity matrix (M r ). For each randomly selected column, the sampling was performed with 2000 different lengths (m). The lengths were randomly selected from 1 to 10,000. The operation was repeated for 5000 times, which produced 10 million random compound sets. b) As we did in step 2, for each sampling, an initial score vector was obtained by summing up the similarity values larger than the threshold (0.15). Then, the distribution of the initial score over different sampling length (m) was obtained through analysis of the 10 million similarity sets (2000 × 5000). Here we used the same protocol to Keiser's [19]. First, by fitting the distribution pattern over sampling length with the following equations, constants (a, b and k) were obtained.
Here, μ m and σ m are the mean and the standard deviation of I in each group with sampling length m. a, b, k are the constants of the distribution of I in different length m. Then, a standardized score (Z) for each initial score was calculated with the following formula: Here, â, b , k , are the estimated values for a, b and k. The standardized score (Z) was comparable between proteins with different numbers of active compounds. Finally, the standard score was translated into a significance score (P Z ) with the formula, Here, N(Z r ) is the total number of Z scores of the random sampling, which is equal to 10 7 . Z i is the standardized score of the th column in the target similarity matrix of protein (M t ). N(Z r ≥ Z i ) is the number of Z r which exceeds or equal Z i . P Z is the statistical significance of the ith column, which is equal to the probability of getting a Z score no smaller than Z i from random sampling. Therefore, the association between a cell line active compound and a certain protein can be estimated with P Z .
Translate the standardized score (Z) vector into the association score (P O ) In the above steps, we compared the active compounds against specific proteins and each cell lines active compounds, producing the Z score rating the relationship of this protein to every anticancer compound. To analyze the association between the protein and the cell line, the Z score vectors were merged into a comparable association score based on random row sampling.
First, a threshold (c) of Z corresponding to an acceptable confidence level (probability P Z = 0.01) was determined. By summing up the Z values not less than c, the original association score (O) of a certain protein was obtained.

( )
Then, random rows were extracted from the random similarity matrix (M r ). Similarly to the last step, the matrix sampling was performed with 2000 different numbers and repeated for 5000 times for each number. This sampling formed 10 million sub-matrices in total. For each matrix, the initial score and Z score of each column were calculated as previously described. And, the original score (O r ) of random selected compounds was calculated. There were 10 million O r values in total.
Because every original score (O) was calculated in the same length that was equal to the number of active compounds against the cell line, it was unnecessary to standardize this value. The significance of the original score, here defined as the association score (AS score, P O ), was calculated as the probability of getting an random O r that is not less than O score in random sampling. Molecular property and similarity calculation and automatic data processing were performed with Pipeline Pilot (version 8.5). The TL-SEA algorithm was implemented with a MATLAB script (7.14, 2012a).

Protein-compound association networks
During the process of AS score (P O ) calculation, a P-value (P Z ) matrix between the NCI compounds and the BindingDB proteins was formed. For matrix elements with very low P Z value, there was a great chance that the corresponding NCI compound and the protein can bind to each other. To illustrate the relationships between NCI chemicals and BindingDB proteins clearly, we retained the matrix elements with P Z lower than 0.0001 and converted the matrix into a chemical-protein interaction network. The network was graphically presented, rendered with Cytoscape [79] (version 2.8.2) by Force-Directed BioLayout. The edge was weighted by the reciprocal of the negative of the common logarithm of P Z .

ACKNOWLEDGMENTS AND FUNDING
This work was supported by the Fundamental Research Funds for the Central Universities (2662014PY007) and the National Natural Science Foundation of China (21075046, 21275061). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.