An ensemble approach for large-scale identification of protein-protein interactions using the alignments of multiple sequences

Protein–Protein Interactions (PPI) is not only the critical component of various biological processes in cells, but also the key to understand the mechanisms leading to healthy and diseased states in organisms. However, it is time-consuming and cost-intensive to identify the interactions among proteins using biological experiments. Hence, how to develop a more efficient computational method rapidly became an attractive topic in the post-genomic era. In this paper, we propose a novel method for inference of protein-protein interactions from protein amino acids sequences only. Specifically, protein amino acids sequence is firstly transformed into Position-Specific Scoring Matrix (PSSM) generated by multiple sequences alignments; then the Pseudo PSSM is used to extract feature descriptors. Finally, ensemble Rotation Forest (RF) learning system is trained to predict and recognize PPIs based solely on protein sequence feature. When performed the proposed method on the three benchmark data sets (Yeast, H. pylori, and independent dataset) for predicting PPIs, our method can achieve good average accuracies of 98.38%, 89.75%, and 96.25%, respectively. In order to further evaluate the prediction performance, we also compare the proposed method with other methods using same benchmark data sets. The experiment results demonstrate that the proposed method consistently outperforms other state-of-the-art method. Therefore, our method is effective and robust and can be taken as a useful tool in exploring and discovering new relationships between proteins. A web server is made publicly available at the URL http://202.119.201.126:8888/PsePSSM/ for academic use.


INTRODUCTION
Protein-Protein Interactions (PPIs) play an important role in almost every cellular process [1,2]. A variety of biochemical activities performed by PPIs are the foundation of life, such as immune response, regulation of transcription and translation, DNA replication, and endocrine function [3]. In recent decades, in order to understand the mechanisms of all kinds of biochemical activities, a variety of biological experimental methods have been designed to detect the interactions between proteins, for example, two-hybrid systems [4,5], mass spectrometry [6,7], immunoprecipitation [8], protein chip technology [9], etc. However, it is time-consuming, cost-intensive and small-scale to identify the interactions among proteins using biological experiments only. Therefore, there is an urgent need to use computational methods to predict protein-protein interactions efficiently and massively.
So far, a number of computational methods have been proposed to predict protein-protein interactions. These Research Paper methods can be roughly divided into three types: structurebased methods [10][11][12][13], sequence-based methods [14][15][16][17][18][19][20][21][22][23][24][25] and function-annotation-based methods [26][27][28][29]. Among them, there is no need to know protein structure information and a pre-knowledge using the sequence-based approaches, which has aroused more and more interests in researchers. For example, Martin et al. developed a computational model to identify the interactions among proteins by using the signature descriptor [30]. This model achieved an accuracy of 70% and 80% when testing on the H. pylori and Yeast data sets by 10-fold cross-validation. Shen et al. proposed the conjoint triad approach to predict human PPIs considering the local environments of residues [16]. In the experiment, the accuracy of this model reached 83.9%. Ahmad et al. proposed an algorithm to predict the DNA-binding sites based on the neural network, which adopted amino acid sequences evolutionary information in terms of their position specific-scoring matrices [31].
In this paper, we propose a novel sequence-based computational method for predicting potential proteinprotein interactions. Specifically, we first convert the protein amino acids sequence into the Position Specific Scoring Matrix (PSSM) [32] that contains the information of evolution; Then use the Pseudo Position-Specific Score Matrix (PsePSSM) [33][34][35] algorithm to extract features expecting more information. Finally, the Rotation Forest (RF) [36,37] classifier is applied to determine whether the proteins are related or not. In the experiment, the proposed method is implemented on the Yeast data set, and the accuracy of five-fold cross-validation is 98%. At the same time, we also verified on the Helicobacter. pylori, C.elegans, E.coli, H.sapiens and M.musculus data sets, and yielded the accuracy of 89.75%, 98.50%, 91.00%, 97.45% and 98.08%, respectively. In order to further evaluate the prediction performance, we also compare the proposed method with other excellent methods. Comparison results show that the proposed method consistently outperforms other state-of-the-art methods.

Evaluation measures
Four standard criteria are used to evaluate the performance of our approach, including accuracy (Accu.), sensitivity (Sen.), precision (Prec.) and Matthews correlation coefficient (MCC). MCC represents the correlation coefficient between the observed and the predicted class. It ranges from -1 (the best predictive model) to 1 (the worst predictive model). These measures are defined as follows: (1) where TP denotes the number of positive samples to be correctly predicted; FP denotes the number of negative samples to be incorrectly predicted; TN denotes the number of negative samples to be correctly predicted; FN denotes the number of positive samples to be incorrectly predicted, respectively. In addition, the receiver operating characteristic (ROC) [38] curve is used to access the performance of classifier. In the ROC curve, the default threshold for the classifier is 0.5. The threshold will be changed with the true positive rate versus the false positive rate when a new set of prediction result is accepted; this change will be expressed through graphics.

Assessment of prediction ability
In order to achieve the best performance of the rotation forest, we use the grid search method to adjust the corresponding parameters. In this study, PCA [36] was chosen as rotation forest transformation method and the J48 decision tree [39] derived from the WEKA machine learning workbench was selected as the base classifier. Figure 1 shows the accuracy of the classifier under different parameter values. From the Figure 1 we can see that our method performs well, the average prediction accuracy is rapidly increasing with the increase of the value of L at the beginning and increase rate becomes slow when the value of L is greater than 5. However, the accuracy always presents a fluctuation state with the increase of the value of the parameter K. After a comprehensive assessment, we choose the optimal parameters of K=8 and L=5 ultimately.
In this paper, 5-fold cross-validation technique is used as a means to evaluate our model. More specifically, the entire feature data set is randomly divided into five approximately equal subsets. Four of these subsets are used for training and the rest of the subset for testing. The cross-validation process is repeated 5 times so that each data set can be used for testing once. Table 1 lists the results of our predictions on Yeast data set, the value of average accuracy, precision, sensitivity, and MCC are 98.38%, 99.92%, 96.84%, and 96.82%, respectively. The prediction accuracy of the five models are all greater than 98.17%, the precisions are greater than 99.62%, the sensitivities are greater than 96.32%, and the MCC are greater than 96.40%. The ROC curves performed on Yeast data set is shown in Figure 2. In this figure, X-ray depicts

The performance of the proposed method on the H. pylori data set
To better evaluate the performance of the proposed model in PPIs prediction, we focused on the testing of H. pylori data set. We use the same feature extraction method and the same RF parameters to verify its effect, the results achieved as shown in Table 2. On the H. pylori data set we obtain the accuracy of the 5 models are 92.45%, 88.16%, 90.05%, 89.37%, and 88.70%, respectively. We can see from Table 2 that the excellent prediction performance of our model with an average precision value of 89.75%, precision value of 90.18%, sensitivity value of 89.12%, and MCC value of 81.62%. Additionally, it can also be seen from Table 2 that the standard deviation of accuracy, precision, sensitivity and MCC is as low as 0.0167, 0.0274, 0.0183 and 0.0269. The ROC curves are shown in Figure 3.

Comparison with previous method
In recent years, many researchers have proposed various models to predict the PPIs and achieved good results. In order to further evaluate the prediction performance, we compare the proposed method with these excellent methods in the same benchmark data sets. In addition, as the state-of-the-art classification algorithm, SVM has been successfully used to predict PPIs. In this experiment, we also compare the classification performance between Rotation Forest classifier and SVM classifier on the Yeast data set. The corresponding  Table 3 and Table 4 summarize the results of these comparisons. Table 3 shows the average prediction results of the different models on the Yeast data set, we can see that the accuracy obtained by other methods are between 75.08% and 89.33%, the average accuracy obtained by our method is 98.38%. In the comparison of classifiers, the accuracy obtained on the rotation forest classifier is higher than those obtained on the support vector machine classifier. Table 4 shows the performance of different methods on the H. pylori data sets. We can see from the Table 4

Performance on independent data sets
After completing the experiment on the Yeast and H. pylori data sets, we continue to test the performance of the proposed method on the independent data sets (C.elegans, E.coli, H. sapiens and M.musculus). In the experiment, we take all the Yeast data set as training set, independent data sets as the test set to predict protein-protein interactions. Table 5 lists the accuracy of our method on four data sets. It can be seen from the table that the highest accuracy of the proposed method is 98.50% on the C.elegans data set, and even the lowest accuracy achieved on the E.coli data

Validate potential protein-protein interactions from the PPIs database
After evaluating the effectiveness of the proposed model by using the 5-fold cross validation method, we here calculate the interaction probability for all potential protein-protein pairs in the datasets of Yeast. Specifically, the whole negative and positive data explored in 5-fold cross validation experiments are used for training and all the unknown protein-protein pairs are used as testing set. The predicted protein pairs with top-100 ranks in the potential PPI lists are considered as highly potential protein-protein interactions and further verified by three public databases (i.e. DIP [47], MINT [48] and IntAct [49]). These databases have been supplemented by some newly detected protein-protein interactions since the gold standard data explored in this study were collected in 2007. All the predicted possibilities for top 100 potential PPIs in Yeast can be obtained in Supplementary Table S1. As shown in Table 6, 15 new protein-protein interactions are finally confirmed. Note that the high-ranked interactions that are not reported yet may also exist in reality. Based on these results, we anticipate that the proposed model is feasible to predict new protein-protein interactions.

Data sources
We evaluate our model focus on publicly available Saccharomyces cerevisiae data set introduced by Guo et al. [17]. The PPIs data were extracted from Saccharomyces cerevisiae core subset of database of interacting proteins (DIP) [47], version DIP_20070219. Through the two algorithms, paralogous verification method (PVM) and expression profile reliability (EPR) [50], the core subset of reliability is tested. And less than 50 residues of the protein of protein pairs are removed. In order to reduce pairwise sequence redundancy, multiple sequence alignment tool, CD-Hit [51,52], was adopted with a threshold of 40% identity. Eventually the 5594 proteins are left to form the positive data set. The negative dataset consists of 5594 additional protein pairs, which are selected at different subcellular localization. Therefore, the positive and negative data set each accounted for half of the 11188 protein pairs constitute the final data set.
As a comparison, we further assess the capabilities of our model in the H. pylori data set, which was described by Rain et al. [53]. It can be downloaded at http://www. cs.sandia.gov/~smartin/software.html. This data set contains 2916 protein pairs which include half interacting pairs and half non-interacting pairs. It provides a platform for comparing different methods [30,42,43,45,46].

Position-specific scoring matrix
Position-Specific Scoring Matrix (PSSM) is used to detect the distantly related proteins, and initially introduced by Gribskov et al. [32]. It has made outstanding achievements in these areas: protein secondary structure prediction [54], prediction of disordered regions [55], and protein binding site prediction [56]. A PSSM is an L × 20 matrix, which can be denoted where a i j , in the i row of PSSM means that the probability of the ith residue being mutated into type j of 20 native amino acids during the procession of evolutionary in the protein from multiple sequence alignments.
In order to extract the evolutionary information, each protein sequence in the data set is used to align and search homogenous sequences from SwissProt database by the Position Specific Iterated BLAST (PSI-BLAST) [57] tool. PSI-BLAST will return a 20-dimensional vector which indicates the probabilities of conservation against mutations to 20 different amino acids including its own. To get broad and high homologous sequences, we select in this study the value of e-value is 0.001 and the value of iterations is 3, respectively. Applications of PSI-BLAST and SwissProt database can be downloaded at http://blast. ncbi.nlm.nih.gov/Blast.cgi.

Pseudo position-specific score matrix
In order to reduce the probability of missing sequence-order information, we introduced the concept of pseudo amino acid composition by Chou et al. [58]. where a i j , 0 represents the original scores directly generated by the PSI-BLAST, and its value is typically positive integers or negative integers. This is not what we want standardized scores, which may have zero means if more than 20 amino acids and may remain unchanged if it continues through the same conversion program. The positive score implies that the corresponding mutation appears more frequently in the alignment than expected by chance, and the negative score, on the contrary, implies that the corresponding mutation appears less frequently in the alignment than expected by chance. However, according to the definition of PSSM, different lengths of proteins will correspond to different rows number in matrices. Equation 7 is employed to express the protein sample PSSM, so that the PSSM descriptor can be represented as a uniform pattern.
where a j denotes the average score when the amino acid residues in protein P in the process of running the algorithm was evolved into amino acid type j. However, if only P PSSM is used to represent the protein P, all the sequence information will be lost during evolution. In order to prevent the occurrence of missing all information of sequence-order, the thought of pseudo amino acid was introduced to improve the Equation 7. Hence, based on the where a j is the correlation factors of amino acid type j.
Although the value allowed for e can be 0, 1, 2, …, or 49, considering the time costs and efficiency factors, we took e to 0,1,2,3,4, so a total of 200-dimensional vectors are eventually used in this study.

Rotation forest
Rotation Forest (RF) is a novel proposed ensemble classifier that uses independently trained decision trees. The main idea of the Rotation Forest simultaneously encourages individual accuracy and diversity within the ensemble. In order to generate the training samples of the base classifier, the feature set is randomly divided into K subsets. The linear transformation method is applied to each subset, and retains all the principal components to maintain the precision of data. The rotation formed the training sample of new features to ensure the diversity of data. Hence the rotation forest can enhance the accuracy of individual classifier and the diversity in the ensemble at the same time.
Suppose that x y , be the corresponding labels, and S be the feature set. Assuming that the number of decision trees in the rotation forest is L, expressed as , respectively, and the feature set is randomly divided into K subsets of equal size. The preprocessing steps for an individual classifier is: the first select the appropriate parameters K which is a factor of n, and S randomly divided into K disjoint subsets, so the number of features contained in each feature subset is C = n k ; the second from the training dataset X to select the corresponding column of the feature in the subset R i j , , form a new matrix X i j , . Then the bootstrap subset of objects extracts three-quarters the size of the data set from X to construct a new training subset X i j Therefore, the test sample x easily assigned to the classes with the greatest possible. The schematic diagram of the prediction model is shown in Figure 4.

CONCLUSION
In this paper, we proposed a novel method to predict protein-protein interactions using the Rotation Forest  In order to preserve as much information as possible, we first convert the protein amino acids sequences into the PSSM matrix, and then extract the features using the PsePSSM algorithm, finally determine whether there is an interaction between protein pairs through the RF classifier.
To evaluate the performance of the proposed method, we implement it on the Yeast, H. pylori and independent data sets. In addition, we also compare the proposed method with other excellent methods. Excellent experimental results demonstrate that the proposed method is feasible and effective in the prediction of protein interactions. The low standard deviation of these criterion values indicates that our method is stable and robust. In future studies, we will focus on improving the classification algorithm to expect higher predictive accuracy and less time consumption in predicting protein-protein interactions.

WEBSERVER
In order to facilitate the use of researchers, we have built a web server to implement the proposed prediction model. The web server provides the source code and the Yeast data sets used in this article for users to download. It can be accessed to at http://202.119.201.126:8888/ PsePSSM/. Users can query the predicted results of the Yeast data sets through the webpage and receive the predict results by e-mail.