Identifying drug-pathway association pairs based on L1L2,1-integrative penalized matrix decomposition

The traditional methods of drug discovery follow the “one drug-one target” approach, which ignores the cellular and physiological environment of the action mechanism of drugs. However, pathway-based drug discovery methods can overcome this limitation. This kind of method, such as the Integrative Penalized Matrix Decomposition (iPaD) method, identifies the drug-pathway associations by taking the lasso-type penalty on the regularization term. Moreover, instead of imposing the L1-norm regularization, the L2,1-Integrative Penalized Matrix Decomposition (L2,1-iPaD) method imposes the L2,1-norm penalty on the regularization term. In this paper, based on the iPaD and L2,1-iPaD methods, we propose a novel method named L1L2,1-iPaD (L1L2,1-Integrative Penalized Matrix Decomposition), which takes the sum of the L1-norm and L2,1-norm penalties on the regularization term. Besides, we perform permutation test to assess the significance of the identified drug-pathway association pairs and compute the P-values. Compared with the existing methods, our method can identify more drug-pathway association pairs which have been validated in the CancerResource database. In order to identify drug-pathway associations which are not validated in the CancerResource database, we retrieve published papers to prove these associations. The results on two real datasets prove that our method can achieve better enrichment for identified association pairs than the iPaD and L2,1-iPaD methods.


INTRODUCTION
With the rapid development of data generated from genetic analyses and functional genomics, identifying drug targets has become more and more feasible [1]. The modern drug discovery has found thousands of drug targets and the clinical testing agents. These agents can produce curative effects via regulating different targets in various biological and related diseases of pathways. Traditional drug discovery methods follow the "one drugone target" line. But many complex diseases are related to dysfunction of multiple pathways rather than individual genes [2]. These methods ignore the relationship among genes and the systemic nature of human diseases. In general, pathways are defined as the interaction of multiple genes, that is, pathways can be regarded as the target of drugs [3,4]. Abnormal biological pathways can provide views for the aberrant imbalance underlying diseases and find targets for complex diseases intervention [5].
Compared with the "one drug-one target" methods, the systematic biology approach takes the drug effects into the global physiological environment account [6]. These existing computational methods for identifying drug targets include the Gene Set Enrichment Analysis (GSEA) method [7], the FacPad method [2], the iFad method [5] and the iPaD method [8], etc. The GSEA www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No. 29), pp: 48075-48085 Research Paper www.impactjournals.com/oncotarget method has several disadvantages. Firstly, for every paired drug-pathway association, calculation must be done once at every turn. Secondly, the genes which belong to the same pathway will have the common weights, therefore, when a subblock of genes serve as the critical interaction groups for a specific drug. In summary, the GSEA method is time-consuming and inaccurate. The FacPad method has been proposed to identify drug-pathway association pairs, and it develops a sparse Bayesian factor analysis model to deal with treatment response data that are derived from microarray platforms [2]. And the iFad method is also a sparse Bayesian factor model, which is proposed to infer the drug-pathway association using gene expression and drug sensitivity data. In literature [5], the authors apply this method on the NCI-60 data set. The gene expression and drug sensitivity data are downloaded from the CellMiner database [9] (http://discover.nci.nih.gov/ cellminer). The iFad method is effective in identifying paired drug-pathway associations. Since this method uses the Markov Chain Monte Carlo (MCMC) [10] to perform statistical inferences, it is computationally expensive. Also it has many turning parameters which should be specified by users. In order to improve the algorithm speed and performance, a method named iPaD (integrative Penalized Matrix Decomposition) is proposed by Li et al. [8] to identify paired drug-pathway associations. The iPaD method applies the integrative penalized matrix decomposition method to analyze the gene expression data and the drug sensitivity data. The scalable bi-convex optimization algorithm is used to solve the objective function. Since the L 1 -norm penalty can produce sparsity, the L 1 -norm regularization item is added on the drugpathway association matrix. By applying the iPaD method on the NCI-60 and Cancer Cell Line Encyclopedia (CCLE) datasets, Li et al. [8] prove that the iPaD method performs better than the iFad method in computational efficiency and identifying drug-pathway association pairs. Since the L 2 1 , -norm regularization can penalize each row of the matrix as a whole and can enhance the sparsity among the rows [11], based on this theory, a method named L 2 1 , -iPaD (L 2 1 , -integrative Penalized Matrix Decomposition) is proposed to identify paired drug-pathway associations [12]. The L 1 -norm penalization can produce scattered and unstructured sparsity matrix, yet L 2 1 , -norm penalization can produce structured row sparsity matrix [11]. Moreover, the sum of the L 1 -norm and L 2 1 , -norm penalization can produce row structure with intra-row sparsity. In this paper, for the purpose of enhancing the sparsity of the drug-pathway association matrix and improving the performance of the method, we use the sum of the L 1 -norm regularization and L 2 1 , -norm regularization instead of the L 1 -norm regularization. In this article, a novel method named "L L 1 21 , -iPaD" is proposed to identify paired drug-pathway associations. Our method has the following advantages. Firstly, for the first time, we propose the L L 1 21 , -iPaD method by replacing the L 1 -norm regularization with the sum of the L 1 -norm regularization and L 2 1 , -norm regularization. Secondly, the L L 1 21 , -iPaD method can be used to analyze gene expression and drug sensitivity data. Thirdly, it gives an effective method to identify drug-pathway association pairs. Experimental results in two real datasets prove that the L L 1 21 , -iPaD method is effective.
The remainder of this paper is organized as follows. Firstly, we will introduce two real datasets, give out the experimental results and show the comparison of our method with the state-of-art methods. Then we will provide discussion of this article and outline the future works. And thirdly, we will introduce the notations and definitions in this paper. Finally, we will describe the related works and methodology of the L L 1 21 , -iPaD.

THE INTRODUCTION FOR DATASETS CCLE data set
The CCLE data set is downloaded from the CCLE project, which is a competitive resource to implement a detailed genetic characterization for a large panel of human cancer cell lines. The CCLE project (http:// software.broadinstitute.org/software/cprg/?q=node/11) can provide public access analysis, visualization of DNA copy number, mutation data, mRNA expression, and so on, for about 1046 cancer cell lines. The CCLE data set consists of 480 cell lines with drug sensitivity data for 22 drugs and transcription data for 1802 genes covering 58 KEGG pathways. In fact, the CCLE data set in the CCLE project contains 18988 genes and 24 chemical drugs. The detailed data processing can be found in [8]. The drug sensitivities are measured by area over the dose-response curve ('activity area'). EC 50 , IC 50 and maximum activity area (' A max ') can also measure drug sensitivities. But activity area has two advantages. Firstly, it has less missing values. Secondly, it can both reflect drug potency and efficacy. But EC 50 and IC 50 can only reflect drug potency, and A max can only reflect its efficacy. In this paper, the density of the gene-pathway association prior knowledge matrix H 1 ( ) is 3.95%. And for the drug-pathway association prior knowledge matrix H 2 ( ) , we set it to zero.

NCI-60 data set
The NCI-60 data set contains 57 cell lines with gene expression data for 1863 genes covering 58 KEGG [13] pathways and drug sensitivity data for 101 drugs. The NCI-60 data set is from the NCI-60 project, which provides various type of 'Omics' features of 60 cell lines with 9 different cancer types. The gene expression data and the drug sensitivity profiles are downloaded from the CellMiner database [9]. The detailed data processing can be found in [5]. The drug sensitivity data are obtained by GI 50 values, those values can reflect the potency of drugs. www.impactjournals.com/oncotarget The GI 50 is defined as the needed concentration for 50% of maximum cell growth inhibition.

RESULTS
In this Section, we evaluate the performance of our proposed L L 1 21 , -iPaD method by applying to the CCLE and NCI-60 datasets. To show the effectiveness of our proposed method, we also compare our method with the iPaD and L 2 1 , -iPaD methods in this Section.

The results on the CCLE data set
In this experiment, we first use the five-fold crossvalidation to obtain the optimal λ 1 and λ 2 values, and then obtain the sparse drug-pathway association matrix B 2 ( ) corresponding to the optimal parameter values.
Similar to the iPaD and L 2 1 , -iPaD methods, our method also can assess the relative importance of the coefficients in the drug-pathway association matrix B 2 ( ) via solving the L L 1 21 , -problem for a decreasing λ sequence. For each λ value, we record the order of the coefficients in which they become nonzero. In general, the more important coefficients ought to become nonzero earlier than the less important coefficients. However, this procedure cannot be used to assess the significance of the coefficients. Therefore, we perform permutation test to assess the significance of the coefficients in the drugpathway association matrix B 2 ( ) . In this paper, we run 2000 permutations to obtain the P-values. The P-values of our method, the L 2 1 , -iPaD and iPaD methods on the CCLE data set are listed in Table 1, in which the superior results are shown in bold type. In this paper, the known drug-pathway associations are served as the validated information. In Table 1, the drug named Nutlin-3 is related with the Chronic myeloid leukemia pathway. A published paper suggests that Nutlin-3 can up-regulate the expression of Notch1 in both lymphoid and myeloid leukemic cells for a part of the negative feedback antiapoptotic mechanism [14]. Besides, the authors in [15] confirm that IAP inhibition using a small synthetic inhibitor (LBW-242) increases the sensitivity of CML cells to TKI. This drug-pathway association pair is not validated in the CancerResource, but our method can find their associations. And in Table 1, only 5 drug-pathway association pairs are validated in the CancerResource. For the rest 10 drug-pathway associations which are not validated in the CancerResource, we retrieve published papers to prove their associations. Only one association pair is not found from published papers. Besides, similarly, the authors in [16] suggest that cotreatment with LBH589 and 17-AAGcan induce more apoptosis of IM-resistant primary CML-BC and acute myeloidleukemia cells than treatments with either agent alone. And in this paper, our new method also can find that 17-AAG is associated with the Chronic myeloid leukemia pathway. Moreover, the drug-pathway pairs corresponding to nonzero elements in the matrix B 2 ( ) are selected as the identified drugpathway association pairs. In this experiment, our method identifies 413 drug-pathway pairs that have p-value no more than 0.05, and 70 drug-pathway pairs are validated in the CancerResource database. The L 2 1 , -iPaD method identifies 368 drug-pathway pairs that have p-value no more than 0.05, and 66 drug-pathway pairs are validated in the CancerResource database. However, iPaD identifies 88 drug-pathway pairs that have p-value no more than 0.05, and only 25 drug-pathway pairs are validated in the CancerResource database. When we set the P-value cutoff as 0.005, 51 drug-pathway association pairs are identified by the iPaD method, with 16 association pairs validated in the CancerResource database. But for our method and the L 2 1 , -iPaD method, 53 drug-pathway association pairs are identified, with 16 association pairs validated in the CancerResource database. In addition, in [8], we can easily find that the results of iFad and GSEA methods are poorer than the iPaD method, so in this paper, we does not compare our method with the iFad and GSEA methods.
Then we compute the identification and verification rates of drug-pathway association pairs on the CCLE data set. Specifically, we make the ratio of number of identification (or verification) and total number of drugpathway pairs as the identification (or verification) rate. The identification and verification rates of drug-pathway association pairs on the CCLE data set for the L L 1 21 , -iPaD, L 2 1 , -iPaD and iPaD methods can be found in Table 2  and Table 3. It is obvious that our method can identify more drug-pathway association pairs than other existing methods.

The results on the NCI-60 data set
Similar to the L 2 1 , -iPaD and iPaD methods, we also apply our method to the NCI-60 data set. For the NCI-60 data set, we also run 2000 permutation to evaluate the significance of the coefficients in the drug-pathway association matrix B 2 ( ) . The P-values of our method, the L 2 1 , -iPaD and iPaD methods on the NCI-60 data set are listed in Table 4. The authors in [17] prove that the mechanism of action of the EA derivatives prepared in this study is more complex than the inhibition of glutathione S-transferase p ascribed as unique effect to EA and might help to overcome tumor resistances. And in Table 4, Melphalan is associated with T cell receptor signaling pathway. A study published in 2003 suggests that Melphalan can control the expression of T cell receptor signaling pathway [18]. In Table 4, only   Only one pair associations are not found from published papers. So, our method can also identify associations, which are not validated in CancerResource, in the NCI-60 data set between drugs and pathways. For example, a paper studied in 2000 writes that an in vitro study the effects of MPA (Mycophenolic acid) on human peripheral blood lymphocyte activation markers and on cell cycle characteristics are investigated [19]. Moreover, the drugpathway pairs corresponding to nonzero elements in the matrix B 2 ( ) are selected as the identified drug-pathway association pairs. In this experiment, our method identifies 593 drug-pathway pairs that have p-value no more than 0.05, and 163 drug-pathway pairs are validated in the CancerResource database. The L 2 1 , -iPaD method identifies 562 drug-pathway pairs that have p-value no more than 0.05, and 163 drug-pathway pairs are validated in the CancerResource database. However, iPaD identifies 247 drug-pathway pairs that have p-value no more than 0.05, and only 74 drug-pathway pairs are validated in the CancerResource database. And when we set the P-value cutoff as 0.005, the results of our method is similar to the L 2 1 , -iPaD method, but the number of identification and verification are higher than the iPaD method. Similar to the CCLE data set, we also compute the identification and verification rates of drug-pathway association pairs on the NCI-60 data set. The identification and verification rates of drug-pathway association pairs on the NCI-60 data set for the L L 1 21 , -iPaD, L 2 1 , -iPaD and iPaD methods can be found in Table 5 and Table 6.

DISCUSSION
Identifying drug-targets is a momentous issue for the bioinformatics and a crucial step for the drug discovery. In this paper, we proposed a novel method named "L L 1 21 , -iPaD" to identify the paired drug-pathway associations, and used it to jointly analyze the gene expression and drug sensitivity data. In addition, for the L L 1 21 , -iPaD method, two parameters need to adjust. So, we use five-fold cross-validation to choose the optimal parameters. Besides, we perform permutation test to assess the significance of the identified drug-pathway association pairs. In order to evaluate the performance of our method, we apply it to two real datasets, including the CCLE and NCI-60 datasets. Moreover, we compare our method with the iPaD and L 2 1 , -iPaD methods. The experimental results demonstrate that our method is superior to the iPaD and L 2 1 , -iPaD methods. Our method can identify more drug-pathway association pairs which are validated in the CancerResource database than other methods. Besides, for the rest drug-pathway associations which are not validated in the CancerResource database, we retrieve published papers to prove their associations.
With the development of the high-throughput technology, more and more genomic data sets are generated from various fields. At present, one of our central task is to develop the feasible and efficient method to analyze them. Our method is one of the useful ways for identifying drug-pathway association pairs. In the future, we will develop more efficient and robust methods to jointly analyze high dimensional data and to solve the computational challenges.

NOTATIONS AND DEFINITIONS
In this Section, we summarize the definition of norms, which will be used in the methods. Given a matrix For the matrix M ∈ × R d n , the L 1 -norm of a matrix was first presented in [20], whose definition can be written as ( The Frobenius norm of the matrix M can be defined as

M m
And the L 2 1 , -norm of a matrix was first introduced in [21]. Until now, the L 2 1 , -norm has been used in many fields, such as the feature extraction [22] and the image processing [23,24] (6) where ⋅ F denotes the Frobenius norm. For the Eq.(6), the optimization model of iPaD method [8] is defined as follows: indicates that the j-th gene is not associated with the i -th pathway. The authors in [8] assume that the known gene-pathway associations are complete, so, they pay attention to discover the novel drug-pathway associations. Therefore, the second constraint condition is used to incorporate known gene-pathway associations. Besides, the first constraint condition is used to guarantee that the model is identical.

L 2,1 -iPaD method
Since the L 2 1 , -norm penalty can produce rows sparsity [25], another effective method named L 2 1 , -iPaD is proposed to identify the novel drug-pathway association pairs [12]. In this paper, the L 2 1 , -norm regularization is used to replace the L 1 -norm regularization to enforce sparse constraint on rows. And it modifies the optimization problem (7) as follows:

METHODOLOGY
Since the gene-pathway association information is available and complete, the central interest lies in the inference of the matrix B 2 ( ) , that is, the paired drugpathway associations. We further consider enhancing the sparsity of the matrix B 2 ( ) . Thus, we add the L 1 -norm regularization to impose the penalty among all elements in the matrix B 2 ( ) and propose our new L L 1 21 , -iPaD method. The objective function of the L L 1 21 , -iPaD method is defined as follows: where λ 1 and λ 2 are two adjustable parameters, which can increase or decrease the sparsity of the matrix B where . The iPaD method [8] used the traditional gradient descent method to optimize X. Here, we also apply the gradient descent method to solve this problem. The objective function of Eq.(10) can be written as follows: So that X can be updated by Here, μ is the iteration step size. At every iteration, we check whether X k +1 is located in the feasible fields X X : , , , . If X k +1 satisfies this condition, we perform next step, otherwise, we make www.impactjournals.com/oncotarget X X X k k k + + + = 1 1 1 . The Nesterov's method [26] can achieve a convergence rate becoming 1 2 t (The original convergence rate is 1 t). So, when we update X, we also apply this method to quicken the convergence speed.
Least squares method is a common optimization algorithm to solve the unconstrained optimization problems. In order to use the prior knowledge matrix H where we omit the symbol of the objective function. The objective function can be converted into the following equation: where I is a unit matrix with the size of K K × , and J 1 ⋅ ( ) Hence, we can obtain: For the part of B  In order to simplify the optimization problem (22), we set X XD The objective function is equal to the following equation: And then we compute the derivative of J 2 B 1 ( ), then set its result to zero, we have: We sum up the L L 1 21 , -iPaD method as the Algorithm 1. Besides, in this paper, we also use the softimpute algorithm to deal with the missing values problems in solving X. The detailed steps can be found from the iPaD [8] and L 2 1 , -iPaD methods [12].

Parameter selection and significance test
Compared with the iPaD and L 2 1 , -iPaD methods, our new L L 1 21 , -iPaD method has two adjustable parameters (λ 1 and λ 2 ), which can control the sparsity of the drugpathway association matrix B 2 ( ) . In this paper, We perform five-fold cross-validation to find the suitable parameters. In the five-fold cross-validation, we divide the matrix Y 2 ( ) into 5 folds. At each round of the cross-validation, we make each of the 5 folds as missing values, and the rest of 4 folds as training data. It is obvious that it is difficult to find the optimal parameters λ 1 and λ 2 , simultaneously. Therefore, we solve the L L 1 21 , -iPaD problem to produce a λ 1 sequence by fixing λ 2 , and then we use five-fold crossvalidation to find an optimal parameter λ 1 . We treat the minimum residual sum of squares (RSS) corresponding to the λ 1 value as the optimal value. Similarly, we find an optimal parameter λ 2 by fixing λ 1 . Since our method is a sparse optimization algorithm, there are many zero coefficients in the matrix B 2 ( ) . We usually treat those nonzero coefficients as the identified core drug-pathway association pairs. After searching for the optimal parameters, we perform permutation test [27] to assess the significance of the coefficients. We first obtain the estimated value of the matrix B

ACKNOWLEDGMENTS
This work was supported in part by the grants of the National Science Foundation of China, Nos. 61572284 and 61502272.