Accurate prediction of protein-protein interactions by integrating potential evolutionary information embedded in PSSM profile and discriminative vector machine classifier

Identification of protein-protein interactions (PPIs) is of critical importance for deciphering the underlying mechanisms of almost all biological processes of cell and providing great insight into the study of human disease. Although much effort has been devoted to identifying PPIs from various organisms, existing high-throughput biological techniques are time-consuming, expensive, and have high false positive and negative results. Thus it is highly urgent to develop in silico methods to predict PPIs efficiently and accurately in this post genomic era. In this article, we report a novel computational model combining our newly developed discriminative vector machine classifier (DVM) and an improved Weber local descriptor (IWLD) for the prediction of PPIs. Two components, differential excitation and orientation, are exploited to build evolutionary features for each protein sequence. The main characteristics of the proposed method lies in introducing an effective feature descriptor IWLD which can capture highly discriminative evolutionary information from position-specific scoring matrixes (PSSM) of protein data, and employing the powerful and robust DVM classifier. When applying the proposed method to Yeast and H. pylori data sets, we obtained excellent prediction accuracies as high as 96.52% and 91.80%, respectively, which are significantly better than the previous methods. Extensive experiments were then performed for predicting cross-species PPIs and the predictive results were also pretty promising. To further validate the performance of the proposed method, we compared it with the state-of-the-art support vector machine (SVM) classifier on Human data set. The experimental results obtained indicate that our method is highly effective for PPIs prediction and can be taken as a supplementary tool for future proteomics research.


INTRODUCTION
In this post-genomic era, protein-protein interactions (PPIs) can provide great insights into the intrinsic mechanisms of biological processes within a cell and so the PPI networks have been drawing increasing attention. Recently, a number of high-throughput biological techniques, such as yeast two hybrid screens [1], mass www.impactjournals.com/oncotarget/ Oncotarget, 2017, Vol. 8, (No. 14), pp: 23638-23649 www.impactjournals.com/oncotarget spectrometric protein complex identification (MS-PCI) [2] and protein chips [3], have been proposed to identify interactions between proteins. Therefore, a large amount of PPI data from various kinds of organisms has been collected, and a number of databases, like DIP [4], BIND [5] and MINT [6], have also been constructed. However, such experimental methods for identifying PPIs are usually labor-intensive and time-consuming. The PPI pairs identified by these traditional techniques only account for a small part of the entire PPIs network [7,8]. What's worse, those high-throughput techniques suffer from high rates of false positive and false negative results. All these limitations require robust and effective in silico methods as a complement to biological experimental techniques for protein-protein interactions prediction.
As a beneficial supplement to biological methods, a number of computational methods have been developed to predict protein interactions through different source of information, such as protein domains, phylogenetic profiles, gene co-expression and secondary structures [9][10][11][12]. However, such methods need specific domain knowledge which prevents their further applications. Evolutionary information embedded in proteins sequence has good capability for predicting PPIs [13]. Zahiri et al. [14] proposed a novel algorithm named PPIevo for detecting PPIs, which extracted the evolutionary feature from position-specific scoring matrixes (PSSM) of protein sequence. Hamp et al. [15] combined evolutionary profiles from protein sequence with profile-kernel support vector machines (SVM) to predict PPIs and obtained good results. An et al. [16] reported RVM-BiGP prediction model to predict PPIs from protein sequences and the results are very promising. Nevertheless, there is still room to improve the performance of the state-of-the-art prediction methods. This paper is an extension of our previous work [17]. In this study, we report a novel computational model to predict PPIs using the evolutionary information of protein. The main improvements of the proposed method lie in introducing an effective feature extraction method, namely improved Weber local descriptor (IWLD) and using our newly developed discriminative vector machine (DVM) classifier. Specifically, given a protein sequence of length , it would first be converted to an L-by-20 position-specific scoring matrix (PSSM). Then, an IWLD descriptor is used to extract discriminative evolutionary information from PSSM and a 256-dimensional histogram feature vector for each protein is constructed accordingly. Next, we combined two histogram vectors from corresponding protein pair into a 512-dimensional feature vector. Furthermore, the dimensionality reduction tool PCA (principal component analysis) is employed to extract the highly discriminatory information and reduce noise information. At last, the DVM classifier is used to carry out classification prediction. In this work, we first evaluated the proposed method on two PPIs data sets, Yeast and H. pylori and obtained good predictive accuracies of 96.52% and 91.80% respectively. Then, extensive experiments were performed to compare the proposed method with the state-of-the-art SVM classifier based on Human data set. Besides, comparisons between our method and other previous methods were also carried out. All the experimental results obtained indicate that the proposed method is impressively effective for PPIs prediction.

Evaluation of predictive ability
To decrease data dependence and avoid over-fitting of prediction model, five-fold cross validation strategy was used in our study. Namely, the whole data set was evenly divided into five subsets, four of which were randomly chosen for training, and the rest for testing. To validate the validity of the proposed method, the random selection was repeated for five times, and five training sets and five validation sets were generated respectively. To be fair, parameters of DVM in different experiments were set to the same values. The predictive results of the proposed method on Yeast and H. pylori PPIs data sets are shown in Table 1 and Table 2.
It can be observed from Table 1 that when applied to Yeast data set, the average accuracy, sensitivity, precision and MCC of the proposed method are 96.52%, 94.86%, 98.11%, and 93.08%, respectively. Similarly, Table 2 shows the results on H. pylori data set, it can be observed that the average accuracy obtained using our method is 91.80%, with an average sensitivity of 92.15%, an average precision of 91.47%, and an average MCC of 83.60%. In addition, it can be noticed that the standard deviations of them are also relatively low. For Yeast data set, the average standard deviations of accuracy, sensitivity, precision and MCC are 0.46%, 0.59%, 0.48% and 0.92%, respectively. The average standard deviations of accuracy, sensitivity, precision and MCC on H. pylori data set are 0.85%, 1.54%, 0.91% and 1.69%, respectively. The ROC curves using five-fold cross-validation on Yeast and H. pylori data sets are illustrated in Figure 1 and Figure 2, respectively.
From Table 1 and Table 2, it can be drawn that the proposed predictive model combing DVM and IWLD descriptor is accurate and effective for the prediction of PPIs from the two data sets. In our predictive model, PSSM not only provides the order information of protein sequence but also retains sufficient evolutionary information. Next, by using differential excitation and orientation component, the IWLD descriptor has strong ability to maintain local highly discriminative information for PPIs prediction. Besides, the application of PCA reduces the dimensions of IWLD vector, decreases the impact of noise and accelerates the predictive process. www.impactjournals.com/oncotarget   Consequently, our proposed method is suitable for predicting PPIs from the two data sets.

Comparison with SVM classification model
Support vector machine (SVM) is one of the most widely used classification models for PPIs prediction. In this study, we used LIBSVM toolbox to carry out the prediction of PPIs (available at http://www.csie.ntu.edu. tw/~cjlin/libsvm/). To further verify the performance of the proposed method, we applied SVM to predict PPIs of Human data set and compared its performance with DVM. To be fair, the two predictive models adopted same feature extraction method. Here, Gaussian function was chosen by SVM as the kernel function. A general grid search method was employed to optimize SVM's two parameters (kernel width parameter , regularization parameter ) and they were tuned to =0.01 and =0.6 respectively.  The predictive results of the two methods are illustrated in Table 3. When using DVM classifier to identify the PPIs on Human data set, we got promising results with average accuracy, sensitivity, precision and MCC of 97.30%, 95.70%, 98.61% and 94.63%, respectively. Meanwhile, SVM-based method had relatively poor performance with lower average accuracy, sensitivity, precision and MCC of 90.60%, 91.61%, 89.01% and 81.22%, which indicate that DVM has better performance than SVM for predicting PPIs. In addition, it can be observed that DVM is more stable than SVM because the former has lower standard deviations of evaluation criteria than the latter. Specifically, DVMbased method yielded standard deviations of accuracy,  sensitivity, precision and MCC as low as 0.60%, 0.87%, 0.80% and 1.20%, which is less than the corresponding values of 0.95%, 0.89%, 1.72% and 1.82% of SVM-based method. Furthermore, Figure 3 and Figure 4 show the ROC curves performed by DVM and SVM, respectively. It can be observed that DVM yielded higher average AUC (area under an ROC curve) value than that of SVM classifier.
By analyzing the experimental results, we can conclude that DVM is more effective and robust than SVM in predicting PPIs. There are two possible explanations for the results. (1) Based on k nearest neighbors (kNNs), the robust M-estimator and manifold regularization, DVM decreases the influence of outliers and overcomes the shortcoming of the kernel function required to satisfy the Mercer condition. (2) Although there are three parameters (β, γ, and θ) to be tuned in DVM, those parameters slightly affect the performance of DVM if they are adjusted in suitable ranges. Therefore, DVM is more suitable for predicting PPIs than SVM.    Table 4. The basis of this hypothesis is that homologs tend to be similar functional behavior and so they preserve the same PPI [18]. When applying the proposed method to the prediction of PPIs from these five species, the average accuracies of them vary from 76.23 to 92.72. On the one hand, these promising results obtained indicate that Yeast protein may have a similar interacting mechanism with other five species and its sequence data is sufficient for the prediction of PPIs from other species; on the other hand, it demonstrates the proposed method has good generalization ability. In addition, the prediction results fully demonstrate that it is possible that PPIs in one species can be employed to identify PPIs in other species.

Comparison with other methods
So far, a variety of machine-learning based computational methods have been proposed for PPIs prediction. To further validate the effectiveness of our method, we also compared our DVM-based predictive model using IWLD descriptor with several other previous methods (see Table 5 and Table 6) on Yeast and H. pylori data sets. In Table 5, the prediction accuracy of other previous methods on Yeast data set varies from 75.08% to 93.92%, while our proposed method achieves higher value of 96.52%. Similarly, for sensitivity and precision, our predictive model yields better performance than the others. Moreover, the corresponding standard deviations indicate the proposed method is stable and robust. Considering ensemble classifier usually has better performance than single classifier, although RF + PR-LPQ method has lower standard deviations, our method can also be viewed as one of the most competitive computational methods for predicting PPIs.
The similar results of different methods on H. pylori data set can also be found in Table 6. The accuracies of other methods vary from 83.00% to 89.47% while our proposed method attains relatively higher value of 91.80%. The same is true for precision, sensitivity and MCC. The predictive results in Table 5 and Table  6 indicate that the DVM-based classifier incorporating IWLD descriptor can improve the performance of PPIs compared with the state-of-the-art methods. The promising prediction results of our method may contribute to the novel feature extraction method which can provide highly discriminative information, and the selection of DVM classifier which has been demonstrated to be robust and powerful [19].

CONCLUSIONS
In this work, we put forward a novel evolutionary information based computational model for predicting PPIs, which combines our newly developed discriminative vector machine classifier (DVM) and an improved Weber local descriptor (IWLD) to capture highly discriminative information. To minimize data dependence and avoid the over-fitting, five-fold crossvalidation was adopted accordingly. When applied to Yeast and H.Pylori data sets, the model achieves promising prediction accuracies of 96.52% and 91.80%, respectively. Additionally, to evaluate the generalization capability of the proposed method, extensive experiments are performed to predict the PPIs on five other species data sets. Besides, it is compared with SVM-based model and other previous works. The achieved results show that the proposed method is very competitive for predicting PPIs and can be taken as a useful supplementary tool to the traditional experimental methods for future proteomics research.

Golden standard data sets
In this study, we verified the proposed method on a high-confidence PPIs data set Yeast, gathered from the publicly available database of interaction proteins (DIP), version DIP_20070219 [4]. All protein pairs were aligned by a multiple sequence alignment tool, CD-HIT [28]. To reduce fragments and similarity, those protein pairs with ≤50 residues or ≥40% sequence identity were all removed. Then the remaining 5594 interacting protein pairs form the positive data set and 5594 additional protein pairs from different subcellular localizations were chosen to construct the negative data set. Therefore, the data set of Yeast finally contains 11188 protein pairs of which half are positive samples and half negative samples.
To further test the generality of the proposed method, we also evaluate it on two other PPIs data sets: Human and H. pylori. The first data set Human comes from the human protein references database (HPRD). By using the aforementioned steps, we selected 3899 protein pairs as the positive data set and 4262 additional protein pairs from different subcellular localizations as negative data set. As a result, the Human data set finally consists of 8161 protein pairs. Similarly, the second data set H. pylori consists of 2916 protein pairs, of which half are interacting pairs and half non-interacting pairs, as described by Martin et al.

Improved Weber local descriptor
Inspired by Weber's Law, Chen et al. [29] proposed the original Weber local descriptor (WLD) for image recognition, which contains two components, namely differential excitation and orientation. Differential excitation component of WLD is the ratio between two terms: One is the relative intensity differences of an interest point against its neighbors; the other is the intensity of itself. We first calculate the intensity differences between and its neighbors with the filter (see Figure 5): where represents the neighbor of and is the number of its neighbors. We then calculate the ratio of the intensity differences and : ( where is the output of the filter (see Figure 5). As described before, is just the original intensity of . Next, the arctangent function is employed to construct the differential excitation : In addition, orientation component of WLD describes the gradient orientation of interest point. In the original WLD, only 4 neighbors of are utilized which may lose some important discriminating information and are sensitive to noise. In our study, we adopted an improved WLD (IWLD) descriptor by introducing Sobel operators (see Figure 6). By taking into account all 8 neighbors of , it can not only preserve sufficient orientation information but also effectively suppress the noise. Thus, the orientation component of IWLD is computed as: where and denote the outputs of the filters and (see Figure 6).
To perform histogram statistics, the differential excitation is quantized into M intervals where is the lower bound and is the upper bound. So, the value of m is calculated as follow: Similarly, is also quantized into T dominant orientations as follow: , and By calculating m, t value of each point in an image, a 1D histogram vector can be obtained accordingly. To fully mine the local discriminative information, we first divide the image into sub blocks. Here, represents the number of sub blocks in vertical direction and H represents the number of sub blocks in horizontal direction, and the histogram vector of each block is obtained accordingly. Then all the histogram vectors of the image are concatenated into the final onedimensional IWLD feature vector.
In this work, there are four free parameters ( to be tuned. Through grid search on Yeast and H. pylori data sets, we chose M=8, T=8, V=H=2 in our experiments and each protein sequence sample is transformed into a 256 dimensional IWLD vector. Next, every two IWLD vectors from corresponding protein pairs are concatenated into a 512 dimensional vector. Then, the dimensionality reduction algorithm PCA is employed to reduce the impact of noises and accelerate the predictive process, and the final 200 dimensional reduced vector is constructed for the subsequent classification.

Discriminative vector machine
Classification is a fundamental issue in pattern recognition field and there exist numerous classification algorithms for different recognition tasks. In this work, our newly developed discriminative vector machine (DVM) classifier [19] was adopted in classification. DVM is a probably approximately correct (PAC) learning classifier which can reduce the error caused by generalization and is very robust. For a given test sample , the first step of DVM is to find its nearest neighbors (kNNs) to suppress the effect of outliers. The kNNs of can be expressed as , where denotes the nearest neighbor. Equally, can also be represented as where comes from the class. So the objective of DVM is to solve the following minimization problem: (7) where can be denoted as or where is the coefficient from the class, is a norm of and the corresponding L 2 norm is employed in our calculation, is the element of and is a robust M-estimator to improve the robustness of DVM. M-estimator is a generalized maximum likelihood operator proposed by Huber to estimate parameters under the cost function [30]. In this work, a robust Welsch M-estimator ( is adopted to attenuate error so that outliers would have a less impact on classification. The last section of Eq. (7) is a manifold regularization where is the similarity between the and the nearest neighbors of . In this work, is defined as the cosine distance between the and the NN of . Then the corresponding Laplacian matrix L can be expressed as (8) where is the similarity matrix whose element is is a diagonal matrix www.impactjournals.com/oncotarget whose element is the sum of . According to Eq. (8), the last section of Eq. (7) can be rewritten as Furthermore, a diagonal matrix is constructed and its element is denoted as: (9) where is the kernel size which can be calculated in the following form: (10) where d is the dimension of y and θ is a constant to curb outliers. In this work, it is assigned to 1.0 as in the literature [31]. By merging Eq. (8), (9) and (10), the minimization of Eq. (7) can be converted to the following problem: (11) According to the theory of half-quadratic minimization, the global solution can be described as: After the related coefficients are calculated, the test sample can be identified as the class if the residual is the minimum value. (13) By means of robust M-estimator and manifold regularization to suppress the effect of outliers and strengthen its discriminatory ability, DVM classifier has better robustness and higher generalization ability than kNNs. In this work, there are two classes in total to be identified: noninteracting protein pair (class 1) and interacting pair (class 1). If the residual is the minimum distance, the test sample would be classified as non-interacting protein pair, or it would be identified as interacting protein pair. For three free parameters ( ) of DVM model, it is time-consuming to directly search for their optimal values. It is gratifying that DVM algorithm is so stable that all these parameters only affect the performance slightly if they are set in feasible ranges. Based on above knowledge and through grid search, the parameters and are set as 1E-3 and 1E-4 respectively. Just as described before, is a constant and is always set to 1 throughout the entire process. For large data set, DVM classifier needs to spend relatively more time in finding the representative vector, so multi-dimensional indexing techniques can be adopted to speed up the search process to a certain extent.

Procedure of proposed model
The procedure of our proposed model mainly contains two steps: feature extraction and classification.
The feature extraction is also divided into three steps: (1) the PSI-BLAST tool is used to represent each protein sequence and PSSM is obtained accordingly; (2) The PSSM from each protein is transformed into the corresponding histogram vector via IWLD descriptor; (3) Dimensional reduction of the histogram vector is performed by PCA algorithm. In the same way, sample classification also consists of two steps. (1) Based on the data sets of Yeast, H. pylori and Human, DVM model is trained and used to carry out classification; (2) The trained DVM model is then employed to predict the PPIs and its performance is evaluated accordingly. Furthermore, SVM model is also constructed for predicting PPIs on Human data set and the corresponding evaluation is also performed. The overall flow chart of our method is shown in Figure 7.

Evaluation criteria
To evaluate the performance of related predictive methods, four criteria, including the accuracy (Acc), sensitivity (Sen), precision (Pre), and Matthews's correlation coefficient (MCC), were introduced, which can be calculated as follows: (1) where TP (true positive) represents the number of interacting protein pairs predicted correctly while FP (false positive) denotes the number of non-interacting protein pairs predicted falsely. Similarly, TN (true negative) stands for the number of non-interacting protein pairs predicted correctly, and FN (false negative) denotes the number of interacting protein pairs predicted falsely. Receiver-operating characteristics (ROC) curve is a standard technique for summarizing classifier performance over a range of trade-offs between TP and FP error rates. In our study, ROC curves were also calculated to evaluate the validity of prediction models.

ACKNOWLEDGMENTS
This work is supported in part by the National Science Foundation of China, under Grants 61373086, 11301517, 61572506, 11301517 and 11631014, in part by Guangdong Natural Science Foundation, under Grant