iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition

Occurring at cytosine (C) of RNA, 5-methylcytosine (m5C) is an important post-transcriptional modification (PTCM). The modification plays significant roles in biological processes by regulating RNA metabolism in both eukaryotes and prokaryotes. It may also, however, cause cancers and other major diseases. Given an uncharacterized RNA sequence that contains many C residues, can we identify which one of them can be of m5C modification, and which one cannot? It is no doubt a crucial problem, particularly with the explosive growth of RNA sequences in the postgenomic age. Unfortunately, so far no user-friendly web-server whatsoever has been developed to address such a problem. To meet the increasingly high demand from most experimental scientists working in the area of drug development, we have developed a new predictor called iRNAm5C-PseDNC by incorporating ten types of physical-chemical properties into pseudo dinucleotide composition via the auto/cross-covariance approach. Rigorous jackknife tests show that its anticipated accuracy is quite high. For most experimental scientists’ convenience, a user-friendly web-server for the predictor has been provided at http://www.jci-bioinfo.cn/iRNAm5C-PseDNC along with a step-by-step user guide, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It has not escaped our notice that the approach presented here can also be used to deal with many other problems in genome analysis.


INTRODUCTION
Post-transcriptional modifications (PTCM) of RNA plays a paramount role for the metabolism processes of RNAs, such as for their splicing export, immune tolerance, and transcription [1][2][3]. So far, more than 100 distinct PTCMs have been identified in tRNAs, rRNAs, Mt-tRNAs, miRNAs, lincRNAs, miscRNAs, protein-coding genes, pseudogenes, etc. [1]. Among these modifications, the methylation of the 5-methylcytosine (m 5 C) is an epigenetic one [4] formed by the action of RNA methyltransferases [5] (Figure 1). The m 5 C modification is well investigated in DNA, but the corresponding studies in cellular RNA were mainly confined to tRNA and rRNA [6].
Actually, the m 5 C modification site in RNA has various biological functions, including the one that can regulate RNA metabolism in both eukaryotes and prokaryotes [7]. Furthermore, it plays a key role in yeast cell fate decision [4]. It is also significant for animal (such as mouse) and human embryonic development [1]. Although many efforts have been made by using biological experiments to determine the m 5 C sites in RNA (see, e.g. [2,3]), it is time-consuming and expensive to completely rely on the experimental approaches alone. Facing today's explosive growth of uncharacterized RNA sequences, it is highly demanded to develop computational approach to help getting the information.
Very recently, in a pioneering study, Feng et al. [8] proposed an interesting method to identify RNA m 5 C sites via the powerful PseKNC approach [9][10][11]. But no webserver has been provided for their method, and hence its practical application value is quite limited. In view of this, the present study was initiated to fill such an empty area.

RESULTS AND DISCUSSION
A predictor called "iRNAm5C-PseDNC" has been established. The success rates achieved by it on the benchmark dataset constructed based on experimental servations (Supplementary Information 1 [4,5]. During the modification process, a methyl group is attached to the 5th atom of the 6-atom ring. SAM and SAH are the abbreviations of S-adenosylmethionine and S-adenosylhomocysteine, respectively. The former is the source of the methyl group; while the latter, the byproduct. www.impactjournals.com/oncotarget  [12]. c See ref. [13]. d See ref. [14]. e See ref. [15]. f See ref. [16]. g The web-server predictor developed in this paper. where the definitions for the metrics Sn, Sp, Acc, and MCC are given in Eq.13 of the MATERIALS AND DISCUSSION section later. Since it is the first web-server predictor ever developed for identifying the m 5 C sites in RNA sequences, it is not possible to demonstrate its power by comparing with its counterparts for exactly the same purpose. Nevertheless, we can indirectly show its power via a cohort of the anticipated success rates (Table 1) reported from the five powerful web-server predictors in genome and proteome analyses [12] [13][14][15][16]. As we can see from the table, the iRNAm5C-PseDNC is with the highest score for Acc (see column 3), and the same is true for MCC (column 4), indicating the proposed predictor is not only high in overall accuracy but also quite stable.
Also, it is instructive to point out that, among the four metrics in Eq.13, the most important are the Acc and MCC. The metrics Sn and Sp are used to measure a predictor from two completely opposite angles, and they are actually constrained with each other [17]. Therefore, it is meaningless to use only one of the two for comparison [18]. When, and only when, both Sn and Sp of the predictor A are higher than those of the predictor B, can we say A is better than B. In other words, a meaningful comparison in this regard should count the scores of both Sn and Sp, or even better, the rate of their combination that is none but the score of MCC [19,20]. Now, let us use graphic analysis to further show the proposed predictor's quality. Graphs are a useful vehicle for studying complicated biological systems because they can provide intuitive insights, as demonstrated by a series of previous studies (see, e.g., [21][22][23][24][25][26][27][28]). To provide an intuitive illustration, the graph of Receiver Operating Characteristic (ROC) [29,30] was adopted as given in Figure 2, where the green line is the ROC for iRNAm5C-PseDNC. The area under the ROC curve is called the AUC (area under the curve). Being within the region of 0 and 1, the greater the AUC is, the better the predictor would be. For the current predictor, the AUC is 0.9626, which is very close to 1, the value for a perfect predictor.

MATERIALS AND METHODS
As practiced in a series of recent studies [19, 20, 35-37, 39-41, 46-53] in complying with the 5-step rule proposed in [54], to establish a really useful sequence-based predictor for a biological system, one should make the following five steps very clear: (1) how to construct or select a valid benchmark dataset to train and test the predictor; (2) how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their essential correlation with the target concerned; (3) how to introduce or develop a powerful algorithm (or engine) to run the prediction; (4) how to properly conduct cross-validation tests to objectively evaluate the anticipated accuracy; (5) how to provide a web-server and user guide to make people very easily to get their desired results. In the rest of this paper, we are to address these point-by-point.

Benchmark dataset
To make the description simpler and cleaner, the Chou's sequential scheme, which had been used by many previous investigators for analyzing the enzyme specificity [55], signal peptide cleavage sites [56], nitrotyrosine sites [16,57], hydroxyproline or hydroxylysine sites [50,58], methylation sites [34,39,59], protein-protein interaction [32], protein-protein binding sites [60,61], carbonylation sites [48], and phosphorylation sites [51], was adopted in this study. According to Chou's scheme, a potential RNA m 5 C modificationsite sample can be generally expressed by where the center symbol  denotes the single nucleic acid code cytosine (C), the subscript ξ is an integer, N −ξ represents the ξ -th upstream nucleotide from the center, the N +ξ denotes the ξ -th downstream nucleotide, and so forth. The ( ) 2 1 ξ + -tuple RNA sample R ξ ( )  can be further classified into the following two categories: if its center can be of m C where R ξ sample with C at its center, R ξ − ( )  a false one with C at its center, and the symbol ∈ means "a member of" in the set theory.
In literature the benchmark dataset usually consists of a training dataset and a testing dataset: the former is for training a model, while the latter for testing it. But as elucidated in a comprehensive review [62], there is no need to artificially separate a benchmark dataset into the two parts if the prediction model is tested by the jackknife or subsampling (K-fold) cross-validation because the outcome thus obtained is actually from a combination of many different independent dataset tests. Thus, the www.impactjournals.com/oncotarget benchmark dataset S C ξ ( ) for the current study can be formulated as  , respectively (see Eq.3); while ∪ denotes the symbol of "union" in the set theory [62].
The benchmark dataset used in this study was derived from RMBase [1], which is a resource for decoding the landscape of RNA modifications from highthroughput sequencing data. The detailed procedures are as follows. (1) The genomic sequences downloaded from RMBase [63] are in the form of DNA; to make the entire description of this paper in a coherent manner, we first change the code T to U for all the genomic sequences taken from RMBase and make them become RNA sequences. (2) As done in [64], by sliding the But it was observed via preliminary tests that when ξ = 20 (i.e., the RNA samples formed by 41 nucleotides), the corresponding results were most promising. In other words, we observed a turning point for the success rates at ξ = 20 . After this point, the success rates would become going down with the increase of such parameter. Accordingly, hereafter we only consider the 41-tuple nucleotide samples without explicitly mentioning the parameter ξ any more.
The benchmark dataset  thus obtained is given in Supplementary Information 1, which can also be downloaded at http://www.jci-bioinfo.cn/iRNAm5C/Supp-S1.pdf. It contains 1,900 RNA samples, of which 475 belong to the positive subset  + and 1,425 to the negative subset  − .

Sample formulation
An RNA samples in the aforementioned benchmark dataset can be generally expressed as where N 1 represents the 1st nucleotide of the RNA sample at its sequence position 1, N 2 the 2nd nucleotide at its position 2, and so forth. Except for N C 20 = , they can be any of the four nucleotides; i.e., Based on the sequential model of Eq.6, one could directly utilize BLAST to perform statistical analysis. Unfortunately, this kind of straightforward and intuitive approach failed to work when a query RNA sample did not have significant similarity to any of the character-known RNA sequences. To overcome this problem, investigators have shifted their focus to the discrete or vector model. The reason of doing so is also due to the fact that nearly all the existing machine-learning aorithms can be directly used to handle vector models but not sequences, as elaborated in [45]. One of the well-known vector models for DNA/ RNA sequences is the k-tuple nucleotide (or k-mers) composition; i.e., where f i k represents the normalized occurrence frequency of the i-th k-mer, and the symbol T is the transpose operator.
When k = 1 , Eq.8 reduces to where f 1 1 , f 2 1 , f 3 1 and f 4 1 are the normalized occurrence frequencies of adenine, cytosine, thymine, and uracil in the RNA sequence, respectively.
When k = 2 , Eq.8 reduces to where f 1 2 is the normalized occurrence frequencies of AA in the RNA sequence, f 2 2 is that AC, f 3 2 is that AG, f 4 2 is that AU, and so forth.
As we can see from above, the vector's dimension will rapidly increase with the k value, causing the so-called "high-dimension disaster" [65] or overfitting problem. This will significantly reduce the deviation tolerance or cluster-tolerant capacity [66], and make the prediction model contain a lot of noise and very unstable.
Therefore, the k-mers approach is useful only when the value of k is very small. In other words, it can only be used to incorporate the local or short-range or local sequence-order information, but certainly not the long-range or global sequence-order information.
According to the concept of pseudo components, the RNA sequence can be generally formulated by [11,54] where the subscript Ω is integer and its value as well as the components Ψ Ω u u ( , , , ) = 1 2  will depend on how to extract the desired information from the RNA sequence of Eq.6.
In this study, we used the approach called "physicalchemical property matrix combined with auto/crosscovariance" proposed by Liu et al. [39] to define the components in Eq.7. According to that approach, the vector components in Eq.7 are given by where λ is an integer within the range from 0 to 39. Using exactly the same calculation approach as elaborated in [39], we found that λ = 5 was optimal choice for the current study. As for how to calculate the concrete values in Eq.12, see ref. [39] where a crystal clear description had been given and hence there is no need to repeat here.

Random forest algorithm
Being a powerful algorithm, the random forest (RF) has been increasingly used to analyze various different problems in computational biology (see, e.g. [32, 36, 37, 40, 48, 50-52, 60, 61, 81-84]). The essence of RF is to compare each individual classifier as a tree, and the combination of many such classifiers as a forest. In this study, 100 trees were used for the forest, and dimension of the random subspace was 22. Each tree in the forest is trained with different part of the benchmark dataset, and hence may yield a different result. The final outcome is determined via a vote from all the trees. For more information about RF, see [85] where a very detailed description has been given, and hence there is no need to repeat here.

Test procedure
One of the important procedures [54] in developing a new prediction method is how to objectively evaluate its anticipated success rate [54]. To address this, we need to consider two issues. (1) What metrics should be used to quantitatively reflect the predictor's quality? (2) What kind of test approach should be utilized to score the metrics?

Metrics formulation
The following metrics are generally used to measure the prediction quality from four different angles: (1) Acc for measuring the overall accuracy of a predictor, (2) MCC for its stability, (3) Sn for its sensitivity, and (4) Sp for its specificity [86]. Unfortunately, their conventional formulations as given in [86] lack intuitiveness and most experimental scientists feel difficult to understand them, particularly for the MCC. Interestingly, using the Chou's symbols introduced in studying signal peptides [56], Xu et al. [13] and Chen et al. [12] converted them into a set of four intuitive equations, as given by (13) where N + represents the total number of the true m 5 C sites investigated, while N − + is the number of the true m 5 C sites incorrectly predicted to be of false m 5 C site;   µ µ µ µ www.impactjournals.com/oncotarget N − is the total number of the false m 5 C sites investigated, while N + − is the number of the false m 5 C sites incorrectly predicted to be of true m 5 C site.
According to Eq.13, it is crystal clear to see the following. When N − + = 0 meaning none of the true m 5 C sites are incorrectly predicted to be of false m 5  . and MCC = 0 meaning no better than random guess. Therefore, Eq.13 has made the meanings of sensitivity, specificity, overall accuracy, and stability much more intuitive and easier-to-understand, particularly for the meaning of MCC, as concurred recently by many investigators (see, e.g., [18, 20, 31-35, 47-52, 60, 75-77, 84, 87-92]).
Note that, however, the set of equations defined in Eq.13 is valid only for the single-label systems. For the multi-label systems whose emergence has become more frequent in system biology [93][94][95] and system medicine [96] or biomedicine [40], a completely different set of metrics are needed as elaborated in [97].

Test method
Now let us discuss what kind of test method should be used to score the four metrics in Eq.13. In statistical analysis, the following three methods are often used to test a predictor: (1) independent dataset test, (2) subsampling (or K-fold cross-validation) test, and (3) jackknife test [98]. Of these three, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark dataset as elucidated in [54]. Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., [15,[99][100][101][102][103][104][105][106][107][108]).
Accordingly, here we also used the jackknife test to check the quality of iRNAm5C-PseDNC predictor. During the jackknifing process, both the training dataset and testing dataset are actually open, and each sample will be in turn moved between the two. The jackknife test can exclude the "memory" effect. Also, the arbitrariness problem [54] rooted in the independent dataset and subsampling tests can be completely avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset.