iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC

Protein hydroxylation is a posttranslational modification (PTM), in which a CH group in Pro (P) or Lys (K) residue has been converted into a COH group, or a hydroxyl group (−OH) is converted into an organic compound. Closely associated with cellular signaling activities, this type of PTM is also involved in some major diseases, such as stomach cancer and lung cancer. Therefore, from the angles of both basic research and drug development, we are facing a challenging problem: for an uncharacterized protein sequence containing many residues of P or K, which ones can be hydroxylated, and which ones cannot? With the explosive growth of protein sequences in the post-genomic age, the problem has become even more urgent. To address such a problem, we have developed a predictor called iHyd-PseCp by incorporating the sequence-coupled information into the general pseudo amino acid composition (PseAAC) and introducing the “Random Forest” algorithm to operate the calculation. Rigorous jackknife tests indicated that the new predictor remarkably outperformed the existing state-of-the-art prediction method for the same purpose. For the convenience of most experimental scientists, a user-friendly web-server for iHyd-PseCp has been established at http://www.jci-bioinfo.cn/iHyd-PseCp, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.


INTRODUCTION
Protein post-translational modification (PTM or PTLM) is one of the most efficient biological mechanisms for expanding the genetic code and regulating cellular physiology. Hydroxylation is one type of PTM that can take place in proteins to hydroxylate proline and lysine. Hydroxyproline (HyP) is the key factor in stabilizing collagens [1,2], whose instability or abnormal activity may cause stomach cancer [3] and lung cancer [4,5]. Hydroxylysine (HyL) is also found in collagen, which may affect fibrillogenesis, cross-linking, and matrix mineralization [6]. Consequently, identifying the HyP and HyL sites in proteins is an indispensable step for decoding protein function. It is also crucially important for in-depth understanding the physiological roles of hydroxylation. Meanwhile, it can also provide useful information for developing drugs to treat various diseases associated with hydroxylation.
Although the information of HyP and HyL can be determined by means of large-scale mass spectrometry, it is time-consuming and expensive. Therefore, it is highly demanded to develop computational methods to deal with this problem. In a pioneer work, by incorporating dipeptide position-specific propensity into the general Chou's pseudo amino acid composition (PseAAC) [7] and iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC using the discriminant function algorithm used by Chou et al. for identifying the HIV protease cleavage sites [8,9], Xu et al. [10] proposed a predictor called iHyd-PseAAC to identify the HyP and HyL sites in proteins. Although these authors did make contribution in stimulating the development of this area, more work is definitely needed to improve the prediction quality. And the current study is to devote to do so by introducing the sequence-coupled approach.
According to the Chou's 5-step rule [7] and concurred by many investigators in a series of recent publications [11][12][13][14][15][16][17][18][19][20][21][22][23], for developing a new prediction method that can be widely used by broad users, we should consider the following five points: (1) the prediction method should be with a web-server accessible to public; (2) a compelling demonstration to show its prediction quality being improved over the existing counterparts; (3) a good benchmark dataset used to train or test the new model; (4) an elegant mathematical formulation to represent the statistical samples concerned; and (5) a powerful algorithm to operate the calculation. Below, let us address the above points one-by-one.

A new predictor and its user guide
A powerful predictor, called iHyd-PseCp, has been developed for identifying the HyP and HyL sites in proteins. The new predictor is accessible to the public.
Users can easily get their desired results by following the instructions below.

RESULTS AND ANALYSIS
The success rates achieved by the new predictor iHyd-PseCp via the rigorous jackknife test on the 164 hydroxyproline proteins are given in Table 1, where for facilitating comparison, the corresponding rates obtained by the predictor iHyd-PseAAC [10] are also listed. Also, the jackknife success rates by the new predictor iHyd-PseCp on the 33 hydroxylysine proteins are given in Table 2, along with the corresponding rates obtained by the predictor iHyd-PseAAC [10]. As we can see from Table 1, for predicting the HyP sites, the newly proposed method has remarkably outperformed the state-of-theart method from all the four angles: overall accuracy Acc, stability MCC, sensitivity Sn, and specificity Sp. As for the prediction of the HyL sites, it can be observed from Table 2 that the new predictor iHyd-PseCp has significantly outperformed iHyd-PseAAC [10] in Acc and MCC. Although the rate of Sn by the new predictor is about 9% lower than that by iHyd-PseAAC, interestingly the rate of Sp by the new predictor is about 16% higher than that by iHyd-PseAAC.
It is instructive to point out that, of the four metrics, the most important are the Acc and MCC [11,12,21,22]: the former reflects the overall accuracy of a predictor; while the latter, its stability in practical applications. The metrics Sn and Sp are used to measure a predictor from two opposite angles. When, and only when, both Sn and Sp of the predictor A are higher than those of the predictor B, can we say A is better than B [19]. In other words, Sn and Sp are actually constrained with each other [24]. Therefore, it is meaningless to use only one of the two for comparing the quality of two predictors. A meaningful comparison in this regard should count the rates of both Sn and Sp, or even better count the rate of their combination, which is none but the score of MCC.
Graphic analysis is a very useful vehicle to deal with complicated biological systems as demonstrated by a series of previous studies (see, e.g., ). To provide an intuitive comparison of the proposed method with the existing state-of-the art method [10] by using the graphic analysis, let us use the Receiver Operating Characteristic (ROC) graphs [51,52] as given in Figure 2. In the figure, the green and red graphic lines are the ROC curves for iHyd-PseCp and iHyd-PseAAC [10], respectively, where panel (a) is for the case in predicting HyP sites in proteins, and panel (b) for the case of HyL. The area under the ROC curve is called AUC (area under the curve). The greater the AUC value is, the better the predictor will be [51,52]. As we can see from Figure 2, the area under the green curve is remarkably greater than that under the red line for both the HyP and HyL cases, once again indicating that the proposed predictor is indeed much better than iHyd-PseAAC [10]. Accordingly, we anticipate that iHyd-PseCp will become a useful bioinformatics tool for identifying HyP and HyL sites in proteins, or at the very least, play a complementary role to the existing state-of-the art tool in this area.
Why could the proposed method be able to increase the prediction quality so substantially? This is due to the fact that the amino-acid-coupled effects around the hydroxylation sites have been taken into account via the conditional probability approach. Similar remarkable successes have also been observed in predicting betaturns [53], alpha-turns [54], tight turns and their types in proteins [55], specificity of GalNAc-transferase [56], HIV protease cleavage sites [8,24,57], as well as signal peptide cleavage sites [58][59][60].

Benchmark dataset
The benchmark dataset used in this study was derived from the same proteins as used by Xu et al. [10]. They consist of 164 hydroxyproline proteins and 33 hydroxylysine proteins. The former were used to construct the benchmark dataset for studying the HyP sites, while the latter used to construct the benchmark dataset for studying the HyL sites.
To make the description mathematically more rigorous and clear, the Chou's scheme [61] was adopted to formulate peptide samples, as done recently by many authors in studying the nitrotyrosine sites [62], methylation sites [63], protein-protein interaction [64], and protein-  The scores here were generated by the rigorous jackknife tests on the 164 hydroxyproline proteins as adopted by Xu et al. [10]. b The predictor developed by Xu et al. [10]. c The predictor proposed in this paper. d See Eq.9 for the metrics definition. www.impactjournals.com/oncotarget protein binding sites [65]. According to Chou's scheme, a potential hydroxylation site-containing peptide sample can be generally expressed by where the symbol  denotes the single amino acid code P or K, the subscript ξ is an integer, R −ξ represents the ξ -th upstream amino acid residue from the center, the R +ξ the ξ -th downstream amino acid residue, and so forth. The (2 1) ξ + -tuple peptide sample ( ) ξ P  can be further classified into the following two categories: where ( ) + ξ P  denotes a true hydroxylation segment with P or K at its center, ( ) − ξ P  a false hydroxylation segment with P or K at its center, and the symbol ∈ means "a member of" in the set theory.
In literature the benchmark dataset usually consists of a training dataset and a testing dataset: the former is used for training a model, while the latter for testing the model. But as pointed out in a comprehensive review [66], there is no need to artificially separate a benchmark dataset into the two parts if the prediction model is examined by the jackknife test or subsampling (K-fold) cross-validation since the outcome thus obtained is actually from a combination of many different independent dataset tests.
Thus, the benchmark dataset ( ) î   for the current study can be formulated as where the positive subset are as follows.
(1) As done in [61], slide the ( ) 2 1 ξ + -tuple peptide window along each of the aforementioned 164 hydroxyproline protein sequences, and collected were only those peptide segments that have P (Pro) at the center. (2) If the upstream or downstream in a protein sequence was less than ξ or greater than L − ξ where L is the length of the protein sequence concerned, the lacking amino acid was filled with a dummy residue X. (3) The peptide segment samples thus obtained were put into the positive subset if their centers have been experimentally annotated as the hydroxylation sites; otherwise, into the negative subset. (4) The peptide samples thus obtained were subject to a screening procedure to window those that were identical to any other in a same subset; excluded from the benchmark dataset were also those that were self-conflict, namely, occurring in both hydroxylation group and nonhydroxylation group.
By following the same procedures but using the 33 hydroxylysine proteins and focusing on K (Lys), instead of the 164 hydroxyproline proteins and P (Pro), we obtained the benchmark dataset . Because the length of peptide sample ( ) ξ P  is 2 1 ξ + (see Eq.1), the benchmark dataset with different ξ value will contain peptide segments with different number of amino acid residues, as illustrated below ( ) 13 But preliminary tests had indicated that it would be most promising when 10 ξ = . Consequently, for further study below, instead of Eq.3, we shall consider ( ) ( ) 10 10 10 10 10 10 P (P) (P), when P K (K) (K), when K where the benchmark dataset 10 (P)

Sequence-coupled information and general PseAAC
With the avalanche of biological sequence generated in the post-genomic age, one of the most important problems in computational biology is how to formulate a biological sequence with a discrete model or a vector, yet still considerably keep its sequence pattern or order information. This is because all the existing machinelearning algorithms can only handle vector but not sequence samples, as elaborated in [67].
To address this problem, the pseudo amino acid composition [68,69] or PseAAC was proposed. Ever since the concept of pseudo amino acid composition or Chou's PseAAC [70][71][72] was proposed, it has rapidly penetrated into nearly all the areas of computational proteomics (see, e.g., [73][74][75][76][77][78][79][80] as well as a long list of references cited in [81,82]) and many biomedicine and drug development areas [67,[83][84][85][86]. Because it has been widely and increasingly used, recently three powerful open access soft-wares, called 'PseAAC-Builder' [70], 'propy' [71], and 'PseAAC-General' [81], were established: the former two are for generating various modes of Chou's special PseAAC; while the 3rd one for those of Chou's general PseAAC [7], including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as "Functional Domain" mode (see Eqs. 9-10 of [7]), "Gene Ontology" mode (see Eqs. 11-12 of [7]), and "Sequential Evolution" or "PSSM" mode (see Eqs.13-14 of [7]). Inspired by the successes of using PseAAC to deal with protein/peptide sequences, three web-servers [87][88][89] were developed for generating various feature vectors for DNA/RNA sequences as well. Particularly, recently a powerful web-server called Pse-in-One [90] has been developed that can be used to generate any desired feature vectors for protein/peptide and DNA/ RNA sequences according to the need of users' studies.
According to the general PseAAC [7], the peptide sequence of Eq.1 can be formulated as ( ) ( ) ( ) 10 10 10 where ( ) (R | R ) (R | R ) and ( ) (R | R ) In Eq.7 10 10 9 (R | R ) is the conditional probability of amino acid 10 R − occurring at the left 1st position (see Eq.1) given that its closest right neighbor is 9 R − , 9 9 8 (R | R ) is the conditional probability of amino acid 9 R occurring at the left 2nd position given that its closest right neighbor is 8 R − , and so forth. Note that in Eq.7, only are of non-conditional probability since the right neighbor of 1 R − and the left neighbor of 1 R + are always  (namely Pro for the case of HyP, or Lys for the case of HyL). All these probability values can be easily derived from the positive training subsets taken from Supporting Information S1 and S2, respectively, as done in [9]. Likewise, the components in Eq.8 are the same as those in Eq.7 except for that they are derived from the negative training subsets in Supporting Information S1 and S2, respectively. www.impactjournals.com/oncotarget

Random forests algorithm
The random forests (RF) algorithm is a powerful algorithm and has been widely used in many areas of computational biology (see, e.g. [13-15, 64, 65, 91-93]). The algorithm of random forest is based on the ensemble of a large number of decision trees, where each tree gives a classification and the forest chooses the final classification via the most votes (over all the trees in the forest). In the most commonly used type of random forests, split selection is performed based on the so-called decrease of Gini impurity. In this study, the random forest is used to rank the features using Gini importance that is implemented with the machine learning platform scikit-learn. The detailed procedures of RF and its formulation have been very clearly described in [94], and hence there is no need to repeat here.
For the current study, all the involved peptide samples were converted into a 20-D (dimensional) vector according to Eq.6, and then entered into the RF operation engine as the input. And the output would indicate whether the center residue  of the query peptide is a "hydroxylation site" or "non-hydroxylation site". Note that, in using the current prediction method, one must observe the self-consistency principle: if the center residue of a query peptide is P, =  then the corresponding training data must be taken from ( ) if the center residue of a query peptide is K, =  then the training data must be taken from The predictor established via the above procedures is called "iHyd-PseCp", where "i" stands for "identify", "Hyd" for "hydroxylation site", "Pse" for "general PseAAC", and "Cp" for "sequence coupled effect".
As pointed out in the Introduction section, one of the keys in establishing a useful predictor is how to properly evaluate its anticipated success rates. To realize this, we need to consider the following two things: one is what metrics or scales should be used to quantitatively measure its prediction quality; the other is what validation method should be adopted to calculate or derive the metrics values. Below, let us address the two problems.

A set of four metrics
The following four metrics are usually used in literature to measure the quality of binary classification: (1) overall accuracy or Acc; (2) Mathew's correlation coefficient or MCC; (3) sensitivity or Sn; and (4) specificity or Sp (see, e.g., [95]). Unfortunately, the conventional formulations for the four are not intuitive and that most experimental scientists feel difficult to understand them, particularly for the one of MCC. Interestingly, by using the Chou's symbols and derivation in studying signal peptides [96], the aforementioned four metrics can be easily converted into a set of following equations [97,98]:  9) where N + represents the total number of hydroxylation sites investigated whereas N + − the number of true hydroxylation sites incorrectly predicted to be of nonhydroxylation site; N − the total number of the nonhydroxylation sites investigated whereas N − + the number of non-hydroxylation sites incorrectly predicted to be of hydroxylation site.
According to Eq.9, it is crystal clear to see the following. When 0 N + − = meaning none of the true hydroxylation sites are incorrectly predicted to be of nonhydroxylation site, we have the sensitivity Sn 1 = . When N N + + − = meaning that all the hydroxylation sites are incorrectly predicted to be of non-hydroxylation site, we have the sensitivity Sn 0 = . Likewise, when 0 N − + = meaning none of the non-hydroxylation sites are incorrectly predicted to be of hydroxylation site, we have the specificity Sp  we have Acc 0.5 = and MCC 0 = meaning no better than random guess. Therefore, using Eq.9 has made the meanings of sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient much more intuitive and easier-to-understand, particularly for the meaning of MCC, as concurred recently by many investigators (see, e.g., [11, 12, 16, 18, 19, 21-23, 64, 65, 99-108]).
Note that, however, the set of equations defined in Eq.9 is valid only for the single-label systems. For the multi-label systems whose emergence has become more frequent in system biology [109][110][111] and system medicine [112], a completely different set of metrics are needed as elaborated in [113].

Jackknife test
With a set of well-defined metrics to measuring the quality of a predictor, the next thing is what kind of validation method should be used to score these metrics. In predictive analytics, the following three cross-validation methods are often used: (1) independent dataset test, (2) subsampling (or K-fold cross-validation) test, and (3) jackknife test [114]. Of these three, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark dataset as elucidated in [7]. Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., [73-76, 78-80, 115-123]). Therefore, the jackknife test was also adopted in this study to score the metrics of Eq.9. In the jackknife test, each of the samples in the benchmark dataset is singled out one-by-one and tested by the predictor trained with the remaining samples. During the jackknifing process, both the training dataset and testing dataset are literally open, and each sample is in turn moved between the two. The jackknife test can exclude the "memory" effect; it can also avoid the arbitrariness problem occurring in the independent dataset test and subsampling test as pointed out in [7] because the outcome obtained by the jackknife test is always unique for a given benchmark dataset.

CONCLUSIONS
The iHyd-PseCp predictor is a new bioinformatics tool for identifying the hydroxylation sites in proteins. Compared with the existing state-of-the-art predictor in this area, its prediction quality is much better, with remarkably higher overall accuracy and stability. For the convenience of most experimental scientists, we have provided its web-server and a step-by-step guide, by which users can easily obtain their desired results without the need to go through the detailed mathematics. The reason of including them in this paper is for the integrity of the new prediction method, and that these techniques, such as incorporating the sequence-coupled approach into the general PseAAC, may be of use as well in developing other tools in computational biology.
We anticipate that iHyd-PseCp will become a very useful high throughput tool for both basic research and drug development in the areas relevant to the protein hydroxylation.

ONLINE SUPPORTING INFORMATION Supporting Information S1
The benchmark dataset (P)  used to train and test the model for predicting the possibilit,y of hydroxylation at Pro site. Supplementary Data S1.

Supporting Information S2.
The benchmark dataset (K)  used to train and test the model for predicting the possibility of hydroxylation at Lys site. Supplementary Data S2.