iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier

Protein phosphorylation is a posttranslational modification (PTM or PTLM), where a phosphoryl group is added to the residue(s) of a protein molecule. The most commonly phosphorylated amino acids occur at serine (S), threonine (T), and tyrosine (Y). Protein phosphorylation plays a significant role in a wide range of cellular processes; meanwhile its dysregulation is also involved with many diseases. Therefore, from the angles of both basic research and drug development, we are facing a challenging problem: for an uncharacterized protein sequence containing many residues of S, T, or Y, which ones can be phosphorylated, and which ones cannot? To address this problem, we have developed a predictor called iPhos-PseEn by fusing four different pseudo component approaches (amino acids’ disorder scores, nearest neighbor scores, occurrence frequencies, and position weights) into an ensemble classifier via a voting system. Rigorous cross-validations indicated that the proposed predictor remarkably outperformed its existing counterparts. For the convenience of most experimental scientists, a user-friendly web-server for iPhos-PseEn has been established at http://www.jci-bioinfo.cn/iPhos-PseEn, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.

Protein phosphorylation is one of the most-studied post-translational modification (PTM or PTLM) that can alter the structural conformation of a protein, causing it to become activated, deactivated, or modifying its function.The most commonly phosphorylated amino acids are serine (S-type), threonine (T-type), and tyrosine (Y-type).
In human cells, phosphorylation also plays a critical role in the transmission of signals controlling a diverse array of cellular functions, such as cell growth, survival differentiation, and metabolism; while its dysregulation is implicated in many diseases.Therefore, information of phosphorylation sites in proteins is significant for both basic research and drug development.
Many efforts have been made to identify the protein phosphorylation.These methods include mass spectroscopy [19,20], phosphor-specific antibody [21], etc.Unfortunately, these experimental techniques are both time-consuming and expensive.Facing the explosive growth of protein sequences merging in post genomic age, it is highly desired to develop computational methods for effectively identifying the phosphorylation sites in proteins.
Actually, by using computational approaches such as artificial neural networks, hidden Markov models, and support vector machines, some prediction method were developed based on various different features including disorder scores, KNN scores, amino acid frequency [22,23], and attribute grouping and position weight amino acid composition [24].
In view of its importance and urgency, it is certainly worthwhile to further improve the prediction quality by introducing some novel approaches as elaborated below.
According to the Chou's 5-step rule [25] and demonstrated in a series of recent publications [11,12,18,[26][27][28][29][30], to develop a really useful sequence-based predictor for a biological system, we should stick to the following five guidelines and make them crystal clear: (1) how to construct or select a valid benchmark dataset to train and test the predictor; (2) how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their essential correlation with the target concerned; (3) how to introduce or develop a powerful algorithm (or engine) to run the prediction; (4) how to properly conduct crossvalidation tests to objectively evaluate the anticipated accuracy; (5) how to provide a web-server and user guide to make people very easily to get their desired results.Below, we are to address the five procedures one-by-one.However, their order may be changed in order to match the rubric style of Oncotarget.

A new ensemble web-server predictor
By fusing four different pseudo component approaches, a new ensemble classifier, named iPhos-PseEn, has been established for predicting phosphorylation sites in proteins.

Success rates and comparison with the existing methods
The success rates achieved by the iPhos-PseEn predictor via the 5-fold cross validation for S-, T-and Y-type phosphorylation are given in Table 1, where for facilitating comparison the corresponding rates by Musite [22] and PWAAC [24] are also listed.As we can see from the table, compared with its counterparts, iPhos-PseEn is remarkably better than its counterparts in predicting all the three phosphorylation types as measured with all the four metrics, clearly indicating that the proposed predictor not only can achieve higher sensitivity, specificity, and overall accuracy but is also much more stable.As shown from the table, compared with Sp, the improvement in Sn is relatively less significant.This is quite normal because the metrics Sn and Sp are used to measure a predictor from two different angles and hence they are actually constrained with each other [28,31,32].
Graphical approach is a useful vehicle for analyzing complicated biological systems as demonstrated by a series of previous studies (see, e.g., [33][34][35][36][37][38][39][40]).Here, to provide an intuitive comparison, the graph of Receiver Operating Characteristic (ROC) [41,42] was utilized to show the advantage of iPhos-PseEn over the Musite [22] and PWAAC [24].In Figure 1 the green and red graphic lines are the ROC curves for the Musite and PWAAC, respectively; while the blue graphic line for the proposed predictor iPhos-PseEn.The area under the ROC curve is called AUC (area under the curve).The greater the AUC value is, the better the predictor will be [41,42].As we can see from Figure 1, the area under the blue curve is remarkably greater than that under the red or green line, once again indicating that the proposed predictor is indeed much better than Musite and PWAAC predictors.Therefore, it is anticipated that iPhos-PseEn will become a useful high throughput tool in this important area, or at the very least, play a complementary role to the existing methods.
Why could the proposed method enhance the prediction quality so significantly?The key is the following.Many important features, which have been proved being closely correlated with phosphorylation sites by previous investigators, such as disorder, nearest neighbor scores, amino acid occurrence frequency, and amino acid position weight, are fused into an ensemble classifier via the general PseAAC approach, as will be elaborated in the Materials and Methods section.

Web server and user guide
As pointed out in two recent review papers [16,65], a prediction method with its web-server available would practically much more useful.The web-server for iPhos-PseEn has been established.Moreover, to maximize the convenience for users, a step-by-step guide is provided below. (1) Opening the web-server at http://www.jci-bioinfo.cn/iPhos-PseEn, you will see the top page of iPhos-PseEn on your computer screen, as shown in Figure 2. Click on the Read Me button to see a brief introduction about this predictor.(2) Either type or copy/paste your query protein sequences into the input box at the center of Figure 2. The input sequences should be in the FASTA The method developed by Gao et al. [22].The method developed by Huang et al. [24].
The method proposed in this paper.1 obtained by iPhos-PseEn on the benchmark dataset via the 5-fold cross validation test.(5) As shown on the lower panel of Figure 2, you may also choose the batch prediction by entering your e-mail address and your desired batch input file (in FASTA format of course) via the Browse button.
To see the sample of batch input file, click on the button Batch-example.( 6) Click the Supporting Information button to download the benchmark dataset used in this study.(7) Click the Citation button to find the relevant papers that document the detailed development and algorithm for iPhos-PseEn.

Benchmark dataset
To ensure a high quality, the benchmark dataset used in this study was constructed based on UniProtKB/Swiss-Prot database (released September 2015) at http://www.ebi.ac.uk/uniprot/ according to following the procedures: (1) Open the web site at http://www.uniprot.org/,followed by clicking the button "Advanced".(2) Select "PTM/ Processing" and "Modified residue [FT]" for "Fields".(3) Select "Any experimental assertion" for "Evidence".(4) Type "human" for "Term" to do search.( 5) Collected were only those proteins that consist of 50 and more amino acid residues to exclude fragments.(6) The proteins thus obtained were subject to a screening operation to remove those sequences that had 50% ≥ pairwise sequence identity to any other.
After strictly following the aforementioned procedures, we finally obtained 1,770 proteins, of which 638 are nonphosphorylated proteins and 1,132 are phosphorylated proteins.The latter contain 845 phosphoserine proteins, 386 phosphothreonine proteins, and 249 phosphotyrosine proteins.Note that some of phosphorylated proteins may be with multilabel, meaning they may belong to more than one type.
For facilitating description later, the Chou's peptide formulation was adopted.The formulation was used to investigate the signal peptide cleavage sites [43], nitrotyrosine sites [9], methylation sites [7], enzyme specificity [44], protein-protein interactions [45], hydroxyproline and hydroxylysine sites [8], and proteinprotein binding sites [46].According to Chou's scheme, a potential phosphorylation site-containing peptide sample can be generally expressed by where the symbol  denotes the single amino acid code S, T, or Y, the subscript ξ is an integer, R −ξ represents the ξ -th upstream amino acid residue from the center, the R +ξ the ξ -th downstream amino acid residue, and so forth ( Figure 3).The (2 1)  ξ + -tuple peptide sample ( ) ξ P  can be further classified into the following two categories: , if its center is a phosphorylation site , other wise  In literature the benchmark dataset usually consists of a training dataset and a testing dataset: the former is used for training a model; while the latter, testing the model.But as pointed out in a comprehensive review [47], there is no need to artificially separate a benchmark dataset into the two parts if the prediction model is analyzed with the jackknife test or subsampling (K-fold) cross-validation because the outcome thus obtained is actually from a combination of many different independent dataset tests.Therefore, the benchmark dataset ( ) ξ   for the current study can be formulated as where the positive subset The detailed procedures to construct the benchmark dataset are as follows.(1) As done in [48], slide the ( ) ξ + -tuple peptide window along each of the aforementioned 1,770 protein sequences, and collected were only those peptide segments that have S, T, and Y at the center.(2) If the upstream or downstream in a protein sequence was less than ξ or greater than L − ξ where L is the length of the protein sequence concerned, the lacking residue was filled with the same residue of its closest neighbor.(3) The peptide segment samples thus obtained were put into the positive subset ( )   if their centers have been experimentally annotated as the phosphorylation sites; otherwise, into the negative subset ( ) ) To reduce redundancy, all those peptide samples were removed if they had pairwise sequence identity with any other.
Note that the length of peptide samples and their number thus generated would depend on the ξ value.Many tests by previous investigators [22][23][24], however, had indicated that it would be most promising when 6 ξ = or the sample's length was 2 1 13.ξ + = Accordingly, hereafter we only consider the case of 6 ξ = ; i.e., the samples with 13 amino acid residues.Thus, the benchmark datasets thus obtained for  2 is a summary of their sizes.

Incorporate extracted features into general pseudo amino acid composition
With the avalanche of biological sequence generated in the post-genomic age, one of the most important problems in computational biology is how to formulate a biological sequence with a discrete model or a vector, yet still considerably keep its sequence order information or essential feature.This is because all the existing machine-learning algorithms can only handle vector but not sequence samples, as elaborated in [16].
Because it has been widely and increasingly used, recently three powerful open access soft-wares, called 'PseAAC-Builder' [51], 'propy' [52], and 'PseAAC-General' [64], were established: the former two are for generating various modes of Chou's special PseAAC; while the 3rd one for those of Chou's general PseAAC [25], including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as "Functional Domain" mode (see Eqs.9-10 of [25]), "Gene Ontology" mode (see Eqs.11-12 of [25]), and "Sequential Evolution" or "PSSM" mode (see Eqs.13-14 of [25]).Inspired by the successes of using PseAAC to deal with protein/peptide sequences, three web-servers [66][67][68] were developed for generating various feature vectors for DNA/RNA sequences.Particularly, recently a powerful web-server called Pse-in-One [69] has been developed that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users' studies.
According to the general PseAAC [25], the peptide sequence of Eq.1 or Eq.4 can be formulated as ( ) where the components 1, 2, , ) will be defined by how to extract useful features from the relevant protein/ peptide sequence, and T is the transpose operator.

Disorder Score (DS)
Disorder score is a feature to measure the stability of the local structure.Although disordered region does not have fixed three-dimensional structure in proteins, its functional importance has been increasingly recognized [70][71][72][73].It was recently used for identifying protein methylation sites [7].Particularly, it has been observed that the phosphorylation sites have a strong tendency to be located in disordered regions [74].Using the VSL2 program [75], the disorder score of each amino acid residues in a protein can be calculated and expressed by 1,1 where , k j d is the DS score of the k-th amino acid residue ( 1, 2, , ) k L =  when its type is ( 1, 2, , 20) j =  .Thus, to reflect the disorder information, the PseAAC of Eq.4 was defined by where the components are taken from the disorder score matrix of Eq.5 according to the constituent amino acids in Eq.1 as well as their positions in the relevant protein.

K Nearest Neighbor Score (KNNS)
Local sequence clusters often exist around phosphorylation sites because the PTM samples in a same family usually share similar patterns.To reflect this kind of patterns, the PseAAC of Eq.4 was defined by =κ κ κ κ κ T P (7) as done in [22,23] via the BLOSUM62 matrix [76].

Amino Acid Occurrence Frequency (AAOF)
To reflect the amino acid occurrence frequency, the component in Eq.4 are defined by a 20-D vector; i.e., [ ] where f 1 is the occurrence frequency of amino acid A in the relevant 13-tuple peptide sample, f 2 is the occurrence frequency of amino acid C, and so forth (according to the alphabetical order of the single-letter codes for 20 native amino acids).

Position Weight Amino Acid Composition (PWAAC)
Position weight amino acid composition can reveal the sequence-order information around some PTM sites, and it had been used in identifying viral protein phosphorylation sites [24] as well as methylation sites [77].To reflect this kind of information, the PseAAC of Eq.4 was defined by where ξ is the same as in Eq.1, and
As shown above, by using DS, KNNS, AAOF, and PWAAC, the sample of Eq.1 can be defined by four different PseAAC vectors, as indicated in Eqs.6, 7, 8, and 9, respectively.Accordingly, we have four different basic RF predictors; i.e., RF (1), when the sample is based on DS or Eq.6 RF(2), when the sample is based on KNNS or Eq.7 RF(3), when the sample is based on AAOF or Eq.8 RF(4), when the sample is based on PWAAC or Eq.9

Ensemble random forests
As demonstrated by a series of previous studies, such as signal peptide prediction [82,83], membrane protein type classification [84,85], protein subcellular location prediction [86][87][88], protein fold pattern recognition [89], enzyme functional classification [90], protein-proteins interaction prediction [45], and protein-protein binding site identification [46], the ensemble predictor formed by fusing an array of individual predictors via a voting system can generate much better prediction quality.
Here, the ensemble predictor is formed by fusing the aforementioned four different individual RF predictor of Eq.12; i.e., ( ) where E  denotes the ensemble predictor, and the symbol ∀ denotes the fusing operator [47].In the current study, the concrete fusion process can be described as follows.For a query sample of Eq.1, it would be in turn predicted by RF(1), RF(2), RF(3) and RF(4), respectively.If most outcomes indicated that it belonged to phosphorylation segment, its central residue  was predicted to be phosphorylation site; otherwise, nonphosphorylation site.If there was a tie, the result could be randomly picked between the two.But this kind of tie case rarely happened.For more detailed about this, see a comprehensive review [47] where a crystal clear elucidation with a set of elegant equations are given and hence there is no need to repeat here.
The predictor established via the above procedures is called "iPhos-PseEn", where "i" stands for identify", "Phos" for "phosphorylation site", and "Pse" for "pseudo components", and "En" for "ensemble".Depicted in Figure 4 is a flowchart to show how the ensemble predictor is working.
As mentioned in Introduction, among the five guidelines in developing a useful predictor, one of them is how to objectively evaluate its anticipated success rates [25].To fulfil this, the following two things need to consider: one is what metrics should be used to measure the predictor's quality; the other is what kind of test method should be taken to derive the metrics rates.Below, let us to address such two problems.

Metrics used to reflect the success rates
A set of four metrics are usually used in literature to measure the quality of a predictor: (1) overall accuracy or Acc; (2) Mathew's correlation coefficient or MCC; (3) sensitivity or Sn; and (4) specificity or Sp [91].But the conventional formulations for the four metrics are not intuitive, and most experimental scientists feel hard to understand them, particularly for the MCC.Fortunately, if using the symbols introduced by Chou [92] in studying the signal peptides, the set of four metrics can be written as the following forms [5,93]: where N + represents the total number of true- phosphorylation samples investigated, whereas N + − the number of phosphorylation samples incorrectly predicted to be of false-phosphorylation sample; N − the total number of false-phosphorylation samples, whereas N − + the number of false-phosphorylation samples incorrectly predicted to be of true-phosphorylation sample.
Note that, of the four metrics in Eq.14, the most important are the Acc and MCC: the former reflects the overall accuracy of a predictor; while the latter, its stability in practical applications.The metrics Sn and Sp are used to measure a predictor from two opposite angles.When, and only when, both Sn and Sp of a tested predictor are higher than those of the other tested predictor, we can say the former predictor is better than the latter one.
Also, it is instructive to point out that the set of equations given in Eq.14 is valid for the single-label systems only.As for the multi-label systems whose emergence has become increasingly often in the system biology [103][104][105] and system medicine [106], a completely different set of metrics is needed as elucidated in [107].

Cross-validation
With a set of intuitive evaluation metrics clearly defined, the next step is what kind of validation method should be used to derive the metrics values.
In this study, however, to reduce the computational time, we adopted the 5-fold cross-validation method, as done by many investigators with SVM as the prediction engine.Given below is a more rigorous description of 5-fold cross-validation on a benchmark dataset  .
First, randomly divided the benchmark dataset  into five groups 1  , 2  , 3  , 4  , and 5  , with each having approximately the same number of samples not only for the main-set level but also for all the sub-set levels considered, as can be formulated by where the symbol  means that the divided datasets are about the same in size, and so are their subsets [27].Next, each of the five sub-benchmark datasets was singled out one-by-one and tested by the model trained with the remaining four sub-benchmark datasets.The crossvalidation process was repeated for five times, with their average as the final outcome.In other words, during the process of 5-fold cross-validation, both the training dataset and testing dataset were actually open, and each sub-benchmark datasets was in turn moved between the two.The 5-fold cross-validation test can exclude the "memory" effect, just like conducting 5 different independent dataset tests.
As we can see from Table 2 or Supporting Information S1, S2, and S3, the negative subset ( ) is much larger than the positive subset ( )
The ratio is about 10:1 for all the three types of phosphorylation.
Although this might reflect the real world in which the non-phosphorylation sites are always the majority compared with the phosphorylation ones, a predictor trained by such a highly skewed benchmark dataset would inevitably have the bias consequence that many phosphorylation sites might be mispredicted as nonphosphorylation ones.To deal with this kind of situation, we randomly divide the negative subset into ten groups with each having about the same size.Thus, for each of the three types of phosphorylation, we have ten benchmark datasets in which the positive and negative samples are about the same.It was based on each of such ten datasets that the 5-fold cross-validation was performed, followed by taking an average for the final score.

CONCLUSIONS
The iPhos-PseEn predictor is a new bioinformatics tool for identifying the phosphorylation sites in proteins.Compared with the existing predictors in this area, its prediction quality is much better, with remarkably higher sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient.For the convenience of most experimental scientists, we have provided its web-server and a step-by-step guide, by which users can easily obtain their desired results without the need to go through the detailed mathematics.
We anticipate that iPhos-PseEn will become a very useful high throughput tool, or at the very least, a complementary tool to the existing methods for predicting the protein phosphorylation sites. b

d
See Eq.14 for the definition of metrics.

Figure 1 :
Figure 1: The intuitive graphs of ROC curves to show the performance of Musite, PWAAC, iPhos-PseEn, respectively, for the case of the center residue  is (A) S, (B) T, and (C) Y. See the main text for further explanation.

P
 denotes a true phosphorylation segment with S, T, or Y at its center, ( ) − ξ P  denotes a corresponding false phosphorylation segment, and the symbol ∈ means "a member of" in the set theory.

Figure 3 :
Figure 3: A schematic drawing to show the peptide model


 only contains the samples of false phosphorylation segments ( ) − ξ P  (see Eq.2); while  represents the symbol for "union" in the set theory.

Figure 4 :
Figure 4: A flow chart to show how the four individual random forest predictors are fused into an ensemble classifier via a voting system.See Eqs.12-13 as well as the relevant text for further explanation.

Table 2 : Summary of phosphorylation site samples in the benchmark dataset a Subset Phosphorylation type and number of samples
Of the negative samples, 21,564 from the 845 phosphoserine proteins and the 21,968 from the 638 non-phosphorylated proteins.Of the negative samples, 4,307 from the 386 phosphothreonine proteins and the 5,432 from the 638 non-phosphorylated proteins.d Of the negative samples, 3,968 from the 249 phosphotyrosine proteins and the 4,362 from the 638 non-phosphorylated proteins.
b c