Systematic identification of self-interacting proteins with ensemble classifiers using evolutionary information

As the center of most biological processes, Protein-Protein Interactions (PPIs) constitute the basis of the formation of biological mechanisms. Deregulation of PPIs results in many diseases including cancer and pernicious anemia. As a special type of PPIs, the Self-interacting Proteins (SIPs) occupy an important position in them. Although a large number of SIPs data have been generated by experimental methods, currently-detected self-interacting proteins cover only a small part of the complete network. Therefore, there is a great need for computational methods to efficiently and accurately predict SIPs. In the present study, we introduce a novel computational method based on protein sequence information to predict SIPs. More specifically, each protein sequence is converted to Position-Specific Scoring Matrix (PSSM) containing the evolutionary information. And then an effective feature extraction approach, namely, Auto Covariance (AC) is employed to construct a feature set. Finally, the improved Rotation Forest (RF) model is used to remove the noise of the feature set and give prediction results. When performed on yeast and human SIPs data sets, the proposed method can achieve high accuracies of 80.50% and 93.70%, respectively. Our method also shows a good performance when compared with the SVM classifier and other existing methods. Consequently, the proposed method can be considered to be a promising model to predict SIPs. In addition, for the purpose of further research in the future, the user-friendly web server is freely available to academic use at http://www.proteininteraction.cn/sip/.


INTRODUCTION
As both the material base of life and the main bearer of life activities, proteins affect the cells through interaction with other components.In these interactions, Protein-Protein Interactions (PPIs) has attracted more attention of researchers because of their critical roles in living organisms.Deregulation of PPIs results in many diseases including cancer and pernicious anemia.The PPI data accumulated from the previous numerous small-scale experiments and some recent large-scale experiments allow us to establish the proteome-wide PPI networks [1][2][3][4], which will help us to deepen the understanding of cell structure and function from the perspective of the system and provide theoretical basis for the discovery of new drug targets and drug design.
One special type of PPIs is Self-interacting proteins (SIPs).They represent those with more than two copies that can interact with each other.Two interaction partners of SIPs are two identical copies represented by the same gene, which can result in the formation of homodimer.More than two copies of a protein interact with each other to form a homotrimer or a higher order homo-oligomer.Recent research have shown that homo-oligomerization plays important roles in a variety of vital biological processes, such as immune response, enzyme activation, signal transduction and gene expression regulation [5][6][7][8][9].Ispolatov et al. [10] noted that SIPs occupy a significant position in the protein interaction networks (PINs), meaning that there are great possibilities that the SIPs can interact with a large number of other proteins.At the same time, it also shows its functional importance for cellular systems.Pereira-Leal and their collaborators proposed a genome-wide, cross-species analysis of the origins and evolution of protein complexes.Their conclusion indicates that the evolution of many protein complexes was first established through self-interactions and then through the duplication of these self-interacting proteins [11].In addition, one of the key factors that regulate protein function is self-interaction.Without increasing the size of the genome, through self-interactions, the functional diversity of proteins can be greatly expanded [12].
Recently, some computational methods for the prediction of PPIs have been developed [13,14].By analyzing the relationship between codon pair usage and PPIs in yeast, Zhou et al. drew a conclusion that codon pair usage of interacting protein pairs has great difference on the random expectations.And it is used as a motivation by proposing a novel method named CCPPI to predict PPIs by using codon pair frequency difference as Support Vector Machine input [15].Based on pairwise similarity theory, Zaki et al. used only the protein primary structure before proposing a simple and efficient method for predicting PPIs [16].You et al. used only the protein sequence information to predict PPI, in which a kind of method called PCA-EELM (Principal Component Analysis-Ensemble Extreme Learning Machine) is designed.When performed on the PPIs data of Saccharomyces cerevisiae, this model yields 87.00% prediction accuracy, 86.15% sensitivity and 87.59% precision [17].These methods generally take into account the correlational information between protein pairs, such as coevolution, co-localization and co-expression.However, such information is not available when predicting protein self-interacting.Furthermore, the data sets used in these methods do not contain protein interactions among the same partners, making them unsuitable for SIP prediction.Therefore, there is a strong motivation to design efficient and reliable computation methods for large-scale prediction SIPs.
Based on the Rotation Forest (RF) algorithm [18,19], in this study, we designed an improved RFbased approach (ImRF) [20,21] for predicting SIPs by only using protein amino acids sequences.First, the candidate self-interacting protein sequence is converted into Position-Specific Scoring Matrix (PSSM) [22].Second, an effective feature extraction method called Auto Covariance (AC) [23] is used to extract feature vector from PSSM.Finally, the features of weighted selection are fed into the RF classifier to predict SIPs.In the experiments, the proposed model was evaluated on yeast and human SIPs data sets.The experiment result shows that our model achieved 80.50% and 93.70% prediction accuracy with 85.30% and 94.70% specificity on these two datasets, respectively.In order to further evaluate the performance of our model, we compared it with other existing methods and the state-of-the-art support vector machine (SVM) classifiers on yeast and human data sets.Excellent results indicate that our model can effectively extract useful information from large amounts of data and produce better prediction accuracy.

Performance of the proposed method
In order to avoid over-fitting to affect the performance of our model, we divided the data set into training set and independent test set.Taking the human data set as an example, we randomly selected about 1/6 of the samples from the whole human data set as the independent test set.Since the number of negative instances is much larger than that of the positive ones in human data set, we randomly selected negative samples from the remaining human negative data set to set up the training set with the ratio of about 1:1.To ensure the reliability of the results, the independent test set and training set were constructed for 5 times and so were the experiments .The final results were expressed in the form of mean and standard deviation.The same strategy was also used to apply to the yeast dataset.For the sake of guaranteeing the fair outcome, there are several parameters that should be optimized for our model.Through the grid search method, in this experiment, the parameter lg of the feature extraction method AC is set to 5. In the improved rotation forest algorithm, feature selection rate r =0.7, the number of sub sets K = 5, and the number of decision trees L = 7.
The results of our method on yeast and human datasets are shown in Tables 1, 2. It can be seen from Table 1 that the overall accuracies of five experiments are all above 79.09% for yeast dataset.Specifically, the accuracies of each experiment are 79.89%,79.09%, 80.91%, 82.95% and 79.55%, respectively.We can see that the average accuracy, specificity, sensitivity, and MCC are 80.50%, 85.30%, 42.60%, and 23.20%, respectively.The standard deviations of them are 1.50%, 2.10%, 3.40%, and 1.60%, respectively.Table 2 lists the experimental results of our method on the human data set.Accuracies of the five experiments are 93.88%,92.72%, 93.56%, 94.44%, and 94.40%, respectively.The good results of average accuracy, specificity, sensitivity, and MCC of 93.70%, 94.70%, 34.00%, and 15.40%, respectively.The standard deviations of them are 0.60%, 0.70%, 3.80%, and 0.90%, respectively.The ROC curves performed on yeast and human datasets was shown in Figures 1, 2. In those figures, x-ray depicts False Positive Rate (FPR) while y-ray delineates True Positive Rate (TPR).
Thanks to choosing the appropriate classifier and feature extraction method, we can see from Table 1 and Table 2 that our method has achieved good results when predicting SIPs.Our method plays an important role in improving the accuracy of prediction, which may be attributed to the following three reasons: (1) PSSM has the advantage of resisting background noise and reducing the redundancy of prediction results.It can retain enough prior information of protein sequences, thus helping to improve the prediction accuracy.( 2) Feature extraction method AC takes neighboring effect into account, which makes it possible to discover patterns of the entire sequences.(3) High-dimensional data not only increases the computational cost but also is likely to contain  redundant information.In this experiment, we use the improved rotation forest method to calculate the weight of the feature and remove the features of small weight.This increases the proportion of the useful information and helps to improve the performance of the classifier.The experimental results show that these powerful factors can provide help for the prediction of SIPs.

Comparison with the SVM Classifier
Although the experimental results show that the performance of our proposed prediction model is good, in order to have a clearer understanding of our classifier, we compare it with the state-of-the-art support vector machine (SVM) classifier.In the experiment, we have taken the same feature extraction method and implemented it in the yeast and human data sets, respectively.We use the LIBSVM tool [24] to execute the SVM classifier.The SVM parameters determined by the grid search method are c = 10 and g = 10, and other parameters use the default value.
Tables 3, 4 list the prediction results of SVM classifier on yeast and human datasets respectively.It can be seen from Table 3 that the average accuracy of SVM on yeast dataset is 78.10%, while the results of five experiments are 78.64%,77.27%, 78.07%, 78.07%, and 78.30%.However, the improved rotation forest classifier achieved 80.50% average accuracy.Similarly as displayed in Table 4, the average accuracy of SVM on human dataset is 91.30%, while the results of five experiments are 91.72%,92.16%, 90.16%, 91.48%, and 91.12%.At the same time, the accuracy of the improved rotation forest classifier is 93.70%.The ROC curves performed on yeast and human data sets were shown in Figures 3, 4.

Comparison with other methods
In order to further evaluate the performance of the proposed method, we also compared our final model with three existing SIPs predictor SLIPPER [25], CRS [26], SPAR [26] and three PPI predictors DXECPPI [27], PPIevo [28] and LocFuse [29] based on the yeast and human datasets.Tables 5, 6 list the results of the abovementioned methods on yeast and human data sets.We can observe from Table 5 that the proposed method performs well and the accuracy is only next to the highest, 6.84% higher than the average accuracy of other six methods on yeast data set.Similarity, as shown in Table 6, the prediction results of the proposed method are obviously higher those of the other six different methods on human dataset.Accuracy is 1.61% higher than the highest method, and 16.31% higher than that the average of the other six methods.The prediction results show that the proposed method can more effectively improve the accuracy than the current existing methods and suitable for predicting SIPs.

Web server
For the convenience of using the proposed model, a user-friendly web server has been made available at http://www.proteininteraction.cn/sip/.Web server mainly provides predictive proteins self-interacting on yeast data set.Users input yeast protein sequences in the web page and enter the received email address.After pressing the submit button, the server will automatically predict whether the proteins can interact with each other based on our proposed method.After the completion of the server computing, users can check E-mail in the mailbox, which shows the predicted results.

CONCLUSIONS
In this paper, we proposed a novel computational method based on protein sequence information to largescale and efficient prediction protein self-interaction, which combines the feature extraction method AC and improved rotation forest classifier.In order to evaluate the performance of the proposed method, we implemented it on the yeast and human data sets.We also compared the state-of-the-art support vector machine classifier with other popular methods commonly used for PPIs prediction.In these comparisons, we achieved good performance.The experimental results on yeast and human data sets show that the prediction accuracy achieved by our method has been significantly improved.In addition, for the convenience of researchers, we construct a user-friendly web server based on the proposed method.It can provide users with the predicted result of whether proteins could interact with each other.In future research, we will focus on more effective feature extraction methods and machine learning algorithms to improve the prediction accuracy.
In this experiments, we only extract protein sequences in which two interaction partners are exactly the same and interactive type is the 'direct interaction' in relevant databases.Eventually, the number of human protein selfinteraction instances we have obtained is 2,994.
We construct datasets through the following steps in order to achieve the purpose of evaluating the performance of our model [26].Firstly, we only preserve the number of residues in proteins ranging from 50 to 5,000.The rest of the proteins were removed from the whole human proteome.Secondly, to ensure the quality of the protein self-interaction data, each sample in positive data set must satisfy one of the following conditions: (1) At least two publications reported the protein self-interaction; (2) The protein is defined as homo-oligomer (including homodimer and homotrimer) in UniProt; (3) At least two large-scale experiments or one small-scale experiment detected the self-interaction.Finally, to construct the negative dataset, we removed the predicted SIPs annotated in UniProt and all types of SIPs from the whole human proteome (including proteins annotated as more extensive 'physical association' and 'direct interaction').As a result, 1,441 positive samples and 15,938 negative samples were constructed as human SIPs data set.In addition, we used the same strategy in the construction of yeast data set, which contained 710 positive samples and 5,511 negative samples.

Position-specific scoring matrix
Position-specific scoring matrix (PSSM) is generated by a set of sequences which has the structure or sequence similarity.Initially introduced by Gribskov et al. [22], it is used for detecting distantly related protein.PSSM has made outstanding achievements in areas such as protein secondary structure prediction [36], protein binding site prediction [37], and prediction of disordered regions [38].A PSSM is a matrix of N× 20, which can be denoted as

{ }
, : 1 1 20 , where N represents the length of the protein sequence and 20 the number of the amino acids.Each matrix ( , ) M i j is defined as follows: where , i j e represents the probability that the ith residue being mutated into the jth naive amino acid during the evolutionary process of protein multiple sequence alignment.In order to generate the PSSM matrix of evolutionary information, we implement the Position-Specific Iterated BLAST (PSI-BLAST) tool [39] on each protein sequence.PSI-BLAST will return a 20-dimensional vector which indicates the probabilities of conservation against mutations to 20 different amino acids including its own.To get broad and high homologous sequences, in this study, we decide that the value of e-value which is 0.001, the value of iterations is 3, and matching database is SwissProt, respectively.Applications of PSI-BLAST and SwissProt database can be downloaded from http://blast.ncbi.nlm.nih.gov/Blast.cgi.

Auto covariance
As one of the most efficient methods for analyzing the sequence of vector statistics, the Auto Covariance (AC) has been widely used in the prediction of secondary structure content [40,41], protein family classification by researchers [42,43], and protein interaction prediction [23].AC variable indicates that in a given protein sequence of two residues average correlation, the expression is: where lg is the distance between residues, i represents the i th amino acid, L denotes the length of the protein sequence, , i j M indicates the matrix score of amino acid i at position j .Using the above expression, the value of AC variable M can be figured out as M lg N = × , where N is the number of descriptors.When all the data in the database complete the operation, each protein sequence was represented as a vector of AC variables and a protein pair was characterized by concatenating the vectors of two proteins in this protein pair.

Feature weighted rotation forest
In this paper, an improved rotation forest algorithm is proposed, which adds the weight selection on the basis of the original rotation forest.This will remove the features of small weight, namely noise, increase the proportion of useful information and improve the accuracy of the classifier.We use χ 2 statistical method to calculate the weight of the features.A feature F against the class feature is calculated as follows: (3 where n is the number of values in feature F and ij p is the count of the value , i j q is the expected` value of i d and j c , defined as: In order to make full use of the useful information, we perform the following steps.First, use formula (3) to calculate the weight of each feature; second, descend sort features according to the weight value; finally, select new features from the full feature set in accordance with a given feature selection rate r.After executing these steps, we construct a new data set and use it as the input of the rotation forest.
Rotation forest is a popular ensemble classifier.In order to generate the training samples of the base classifier, the feature set is randomly divided into K subsets.The linear transformation method is applied to each subset and retains all the principal components to maintain the precision of data.The rotation formed the training sample of new features to ensure the diversity of data.Therefore, the rotation forest can enhance the accuracy for individual classifier and the diversity in the ensemble at the same time.
Assuming that { } , i i x y contains T training samples in which 1 2 ( , , , ) is an n-dimensional feature vector.Let X be the training sample set, Y the corresponding labels and F the feature set.Then X is T × n matrix, which is composed of n observation feature vector composition.The feature set is randomly divided into K equal subsets by a suitable factor.Let the number of decision trees be L, then the decision trees in the forest can be represented as 1 2 , , , L G G G … . The implementation process of the algorithm is as follows.
(1) Select the suitable parameter K which is a factor of n randomly dividing F into K parts of the disjoint subsets and each subset containing a number of features is n k .
(2) From the training data set X, select the corresponding column of the feature in the subset , i j G to form a new matrix , (3) Matrix X′ i, j is used as the feature transform for producing the coefficients in a matrix , i j M , in which the jth column coefficient is considered as the characteristic component jth.
(4) The coefficients obtained in the matrix , i j M are constructed a sparse rotation matrix i P , which is expressed as follows: , 0 0 In the prediction period, the test sample x is provided and generated by the classifier i G of , ( )

Performance evaluation
In this experiment, we use the prediction accuracy (Accu.),sensitivity (Sen.),Specificity (Spe.), and Matthews Correlation Coefficient (MCC) as the evaluation criterion to assess the performance of our method, they are defined as: TP TN Accu.
TP TN FP FN TP Sen.
TP FN = + where TP, TN, FP, FN represent the number of true positives, true negatives, false positives and false negatives, respectively.Moreover, the receiver operating characteristic (ROC) curve [44] is used to visually display the performance of the classifier.The area under the ROC curves (AUC) is also calculated as an indicator of assessment.

Figure 1 :
Figure 1: Performance comparison performed by our proposed model on Yeast SIPs data set in terms of ROC curves and AUCs.As a result, SIP-ECEI yielded high performance with the AUC of 0.6375.

Figure 2 :
Figure 2: Performance comparison performed by our proposed model on Human SIPs data set in terms of ROC curves and AUCs.As a result, SIP-ECEI yielded high performance with the AUC of 0.7463.

Figure 3 :
Figure 3: Performance comparison performed by SVM model on Yeast SIPs data set in terms of ROC curves and AUCs.As a result, SIP-ECEI yielded high performance with the AUC of 0.6328.

Figure 4 :
Figure 4: Performance comparison performed by our proposed model on Human SIPs data set in terms of ROC curves and AUCs.As a result, SIP-ECEI yielded high performance with the AUC of 0.7463.
of samples in the class C value j c , and N is the total number of samples in the training set.

X
followed by a bootstrap subset of objects extracted 75% of X constituting a new training set ' , i j X .
determine x belonging to class i y .Next, the class of confidence is calculated by means of the average combination, and the formula is as follows: