iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences

Catalyzed by adenosine deaminase (ADAR), the adenosine to inosine (A-to-I) editing in RNA is not only involved in various important biological processes, but also closely associated with a series of major diseases. Therefore, knowledge about the A-to-I editing sites in RNA is crucially important for both basic research and drug development. Given an uncharacterized RNA sequence that contains many adenosine (A) residues, can we identify which one of them can be of A-to-I editing, and which one cannot? Unfortunately, so far no computational method whatsoever has been developed to address such an important problem based on the RNA sequence information alone. To fill this empty area, we have proposed a predictor called iRNA-AI by incorporating the chemical properties of nucleotides and their sliding occurrence density distribution along a RNA sequence into the general form of pseudo nucleotide composition (PseKNC). It has been shown by the rigorous jackknife test and independent dataset test that the performance of the proposed predictor is quite promising. For the convenience of most experimental scientists, a user-friendly web-server for iRNA-AI has been established at http://lin.uestc.edu.cn/server/iRNA-AI/, by which users can easily get their desired results without the need to go through the mathematical details.


INTRODUCTION
RNA editing is a post-transcriptional modification that changes the genomic template through the insertion, deletion, deamination or substitution of nucleotides within the edited RNA molecule.Among the five types of RNA editing reported so far, the modification from adenosine to inosine, the so-called "A-to-I" editing, is the most common one [1,2].This type of editing is catalyzed by the enzyme called "adenosine deaminase" (ADAR) [3] as shown in Figure 1.The concrete process is: adenosine is deaminated to inosine, followed by decoding to become guanosine (G) due to the polymerase enzyme and translational machinery [4].
Besides altering the genetic code that can expand the transcriptome and proteome, the A-to-I editing may also involve in various important biological processes ranging from alternative splicing [5], nonsense-mediated mRNA decay [6] to gene expression and translation [7].RNA secondary and teritary structures may also be affected by A-to-I editing [8].In addition, RNA A-to-I editing was also found to be closely associated with the formation of cancers [9][10][11] and a series of major diseases by editing of glutamate receptors, editing of serotonin receptors and by other mechanisms [12].Therefore, knowledge about the A-to-I editing sites in RNA is crucially important for both basic research and drug development.
Although the experimental method called "RNA-Seq" is a powerful tool for determining the RNA editing

Research Paper
candidates [13,14], it is time-consuming.Besides, since the A-to-I editing sites determined by the RNA-Seq tool are derived indirectly from the A-to-G sites rather than directly from the original A-to-I sites themselves, it is very difficult to discriminate RNA editing events from the case of A-to-G mutations in the single-nucleotide polymorphism (SNP) [15] that simply does not exist in the reference genome.And hence the A-to-I editing sites obtained by RNA-Seq often include many false positive ones.
Therefore, it would be very useful for in-depth genome analysis or drug development to develop a sequence-based computational method that can effectively predict which adenosine sites in a RNA sequence can be "A-to-I" edited, and which ones cannot.
The present study was devoted to address this problem.

RESULTS AND DISCUSSION
A computational method called "iRNA-AI" has been developed.It is the first predictor ever established by using the computational approach and sequence information alone to identify human A-to-I editing sites.
Rigorous cross-validations on a well-established benchmark dataset (Supporting Information S1) have shown that the iRNA-AI predictor can achieve very high scores in sensitivity (Sn), specificity (Sp), overall accuracy (Acc), and stability (MCC); i.e., For the rigorous but intuitive definitions about S n , S p , Acc, and MCC, see Eq.16 given later.
Since so far there is no other existing computational method whatsoever for predicting the A-to-I editing sites in RNA of human transcriptome, it is not possible to show the predictor's power by the comparison manner between counterparts.Nevertheless, the power of iRNA-AI and its quality can be examined via a practical application on a set of experiment-confirmed independent dataset (Supporting Information S2), which contains 3,243 true A-to-I editing sites and 3,243 false A-to-I editing sites.The corresponding success rates thus obtained are given by n p S 84.19% S 89.36% Acc 93.81% MCC 0.80 The above results indicate that the success rates achieved by the predictor iRNA-AI on the independent dataset of Supporting Information S2 are quite consistent with those via the jackknife test on the benchmark dataset of Supporting Information S1.
For the convenience of most experimental scientists, the web-server of iRNA-AI has been established at http:// lin.uestc.edu.cn/server/iRNA-AI/.
Moreover, to maximize the users' convenience, a step-by-step guide has been provided in Supporting Information S3, by which users can easily get their desired results.
Because knowledge about the positions of A-to-I editing sites would be of great help for in-depth understanding the biological functions and processes concerned.It is anticipated that iRNA-AI will become a useful high throughput tool for understanding the biological significance of A-to-I RNA editing, or at the very least, a complementary tool to the existing experimental methods in this regard.Although the detailed ADAR's action mechanism is not well understood yet, the crystal structure of human ADAR has been solved (PDB code: 3IAR) that provides structural basis for the elucidation of its catalytic mechanisms and its specific recognition of the target sequence.Since both computational biology and structural biology have made great contributions to understand enzyme activities and catalytic mechanisms [16,17], it is anticipated that the current iRNA-AI predictor will become a useful tool for revealing the catalytic mechanism of ADAR.

MATERIALS AND METHODS
Prediction is always difficult, particularly in dealing with a complicated biological system as studied here.Nevertheless, a prediction method would be deemed rewarding or successful if it could timely help getting some useful information or stimulate and inspire some other relevant methods.To realize this, we should make the following five procedures very clear as done in a series of recent publications [18][19][20][21][22][23][24][25][26][27][28][29] according to the Chou's 5-step rules [30]: (1) how to construct or select a valid benchmark dataset to train and test the model; (2) how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their essential correlation with the target concerned; (3) how to introduce or develop a powerful algorithm (or engine) to run the prediction; (4) how to properly conduct cross-validation tests to objectively evaluate the anticipated accuracy; (5) how to provide a web-server and user guide to make people very easily to get their desired results.Since the content about the web-server has already been described in the RESULTS AND DISCUSSION section, below let us address the other four procedures one-by-one.

Benchmark dataset
For facilitating formulation, the Chou's sequential scheme [31] was adopted.It was successfully used to study signal peptide cleavage sites [32,33], hydroxyproline and hydroxylysine sites [22,34], methylation sites [35][36][37], nitrotyrosine sites [38,39], carbonylation sites [19], phosphorylation sites [24], sumoylation sites [29], and protein-protein binding sites [40,41].According to Chou's scheme, a potential RNA A-I editing site sample can be generally expressed by where the symbol  denotes the single nucleic acid code A (adenine), the subscript ξ is an integer, N -ξ represents the ξ-th upstream nucleotide from the center, the N + ξ the ξ-th downstream nucleotide, and so forth.The (2ξ + 1)-tuple RNA sample ( ) ξ R  can be further classified into the following two categories: , if its center can be of A-to-I editing , otherwise where R  denotes a true A-to-I editing segment with A at its center, ( ) − ξ P  a false one with A at its center, and the symbol ∈ means "a member of" in the set theory.
The benchmark dataset is derived from DARNED database [42] that contains 333,216 A-to-I editing sites confirmed by experiments.The detailed procedures to construct the benchmark dataset are as follows.(1) As done in [43], by sliding the (2ξ + 1)-tuple nucleotide window (Figure 2) along each of the RNA sequences taken from DARNED database, collected were only those RNA segments with A =  at the center.( 2) The RNA segment samples thus obtained were marked with a positive label if their centers were experimentally annotated as the A-to-I editing sites, while those with a negative label if their centered adenosine could not be edited to inosine as confirmed by experiments.(3) To reduce redundancy or homology bias, we used the CD-HIT program [44] to remove those RNA segments that had 60% pairwise sequence identity with any other in a same-labeled group.By strictly following the above procedures, we obtained an array of benchmark datasets with different ξ values, and hence different lengths of RNA samples as well (see Eq.3), as illustrated below 37 nucleotides, when 18 39 nucleotides, when 19 41 nucleotides, when 20 49 nucleotides, when 24 51 nucleotides, when 25 43 nucleotides, when 26 −ξ +ξ along a RNA sequence.During the sliding process, the scales on the window are aligned with different nucleotides so as to define different (2ξ + 1) -nt RNA samples.Adapted from Chou [43] with permission.See the text for further explanation.
In Eq.5 the symbol  means "formed by".But it was observed via preliminary tests that when 25 ξ = (i.e., the RNA samples formed by 51 nucleotides), the corresponding successful scores (see Eq.16 later) were most promising.Accordingly, hereafter we only consider the 51-tuple nucleotide samples.
After going through the above procedures, we obtained 6,243 positive label samples, from which we randomly picked out 3,000 to form the positive subset for the benchmark dataset.But keep it in mind that the remaining 3,243 samples would be used later for other purpose.
The corresponding negative label samples were substantially more than the positive ones.Although this reflects the fact that in the real world most adenosine nucleotides in RNA cannot be edited to inosine, a machine learning predictor trained by an imbalanced or highly skewing benchmark dataset may negatively affect its performance [36,45].To balance out the numbers between positive and negative samples for model training, we also randomly picked out 3,000 negative label samples to form the negative subset for the benchmark dataset.Thus, the benchmark dataset  can be formulated as where the positive subset +  contains 3,000 true A-to-I editing site-containing RNA sequences, while the negative subset −  contains 3,000 false A-to-I editing sitecontaining RNA sequences, and  denotes the symbol of "union" in the set theory [46].
The detailed sequences for the samples in the benchmark dataset are given in Supporting Information S1.
In literature the benchmark dataset usually consists of a training dataset and a testing dataset: the former is for the usage of training a model, while the latter for testing the model.But as elucidated in a comprehensive review [46], there is no need to artificially separate a benchmark dataset into the two parts if the prediction model is examined by the jackknife test or subsampling (K-fold) cross-validation since the outcome thus obtained is actually from a combination of many different independent dataset tests.According to such a point of view, it is enough to use the benchmark dataset of Eq.6 alone for the current study.It is instructive, however, to use the proposed predictor on an independent dataset for demonstrating its practical application.The independent dataset Ind  can be formulated as Ind Ind Ind where the positive independent subset Ind +  contains the remaining 3,243 samples mentioned above, while the negative independent subset Ind −  also contains 3,243 samples constructed in a way similar to that of −  in Eq.6.None of the samples in the independent dataset Ind   occurs in the benchmark dataset  .
The detailed sequences for the samples in the independent dataset are given in Supporting Information S2.

Formulation of RNA samples
With the avalanche of biological sequences emerging in the post-genomic era, one of the most challenging problems in computational biology is how to formulate a biological sequence with a discrete model or vector that can, however, reflect its essential sequence pattern or feature.This is indeed indispensible since nearly all the existing machine-learning algorithms were devised to handle vectors but not sequences, as elucidated in a recent review [47].Unfortunately, a biological sequence expressed with a vector might totally lose its sequenceorder information [48] and sequence pattern features as well.To deal with such a problem for protein/peptide sequences, the pseudo amino acid composition (PseAAC) [49][50][51] was proposed.Ever since the concept of PseAAC was proposed in 2001 [48], it has been widely used in nearly all the areas of computational proteomics (see the long lists of papers cited in two review papers [51,52].Inspired by its great successes, the concept of PseAAC has been extended to cover DNA/RNA sequences as well by introducing the pseudo K-tuple nucleotide composition (PseKNC) [53,54], which has been proved very useful in computational genetics/genomics [55,56] as well as conducting various genome analyses (see, e.g., [26,28,[57][58][59][60][61][62][63][64] and a review article [65]).Also, because the approach of pseudo components has been increasingly used in both computational proteomics and genomics, a very powerful web-server called "Pse-in-One" [66] has been established that can be used to generate various modes of pseudo components for both protein/peptide and DNA/RNA sequences.
According to [65], the general form of PseKNC for an RNA sequence sample can be expressed as [ ] where T is a transpose operator, while the subscript Z is an integer and its value as well as the components u φ ( ) will depend on how to extract the desired information from the RNA sequence sample concerned.To enable Eq.8 to reflect both the short-and long-range sequence coupling information within the RNA sample, we are to use the nucleotide chemical property and nucleotide density to define its components as described below.

Physicochemical properties of nucleotides
RNA is formed by four types of nucleotides: A (adenosine), C (cytidine), G (guanosine), and U (uridine).Among the four types: (1) A and G have two rings, whereas C and U only one; (2) from the angle of chemical functionality, A and C can be categorized as amino group, while G and U as keto group; (3) in forming the secondary or tertiary structure, there are three hydrogen bonds between C and G but only two between A and U (Figure 3), and hence, the former is stronger than the latter in hydrogen bonding, which would play different roles for the low-frequency vibration [67,68] and its biological function accordingly [69,70].Therefore, the four types of nucleotides can be classified into three different groups as shown in Table 1.

Distribution density of nucleotides
To reflect the occurrence frequency of a nucleotide and its distribution along the sequence of a RNA sample, we use the following equations where D i is the density of the nucleotide N i at the site i of a RNA sequence, L i the length of the sliding substring concerned, @ denotes each of the site locations counted in the substring, and 1, if N the nucleotide concerned (N ) 0, otherwise For instance, suppose a RNA sequence "ACGUA".The density of "A" at the sequence position 1, 2, 3, 4, or 5 is 1 1/1 = , 0.5 1/ 2 = , 0.33 1/ 3 ≈ , 0.25 1/ 4 = , or 0.4 2 / 5 = , respectively; that of "C" is 0 0 /1 = , 0.5 1/ 2 = , 0.33 1/ 3 ≈ , 0.25 1/ 4 = , or 0.2 1/ 5 = , respectively; and so forth.By combing Eq.9 and Eq.11, the i-th nucleotide of Eq.3 can be uniquely defined by a set of four variables; i.e.,

Support vector machine (SVM) operation engine
Being a machine learning algorithm based on statistical learning theory, SVM has been widely and successfully used in the realm of bioinformatics [58,62,[73][74][75] and computational biology [18,36,[59][60][61]76].The basic idea of SVM is to transform the input data into a high dimensional feature space and then determine the optimal separating hyperplane.For a brief formulation of SVM and how it works, see the papers [77, 78]; for more details about SVM, see a monograph [79].
In the current study, the LibSVM package 3.18 was used to implement SVM, which can be freely downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/.Because of its effectiveness and fast speed in training process, the radial basis kernel function (RBF) was used to obtain the best classification hyperplane here.In the SVM operation engine, the regularization parameter C and the kernel width parameter γ were optimized via an optimization procedure using the grid search approach as described by 2 with step with step where C ∆ and ∆γ represent the step gaps for C and , γ respectively.
The predictor obtained via the above process is called iRNA-AI, where ''i'' stands for ''identify", and ''AI'' for ''A-to-I editing" sites in RNA sequence.

Prediction quality examination
How to objectively evaluate the anticipated success rates is an indispensible step in developing a new predictor [30].To address this, we need to consider two issues: one is what metrics should be used to reflect the predictor's success rates; the other is what test method should be adopted to derive the metrics rates.
To quantitatively evaluate the quality of a binary classification predictor, four metrics are generally needed [80].They are: (1) Acc for the predictor's overall accuracy; (2) MCC for its stability; (3) Sn for its sensitivity; and (4) Sp for its specificity.Unfortunately, the conventional formulations for the four metrics are not quite intuitive and most experimental scientists feel difficult to understand them, particularly the stability of MCC.Fortunately, as elaborated in [59,81], by using the Chou's symbols and derivation in studying signal peptides [82], the conventional metrics can be converted into a set of four intuitive equations, as formulated below: where N + represents the total number of A-to-I editing samples investigated, while N + − is the number of true A-to-I eting samples incorrectly predicted to be of non-Ato-I editing sample; N − the total number of the non-A-to-I editing samples investigated, while N − + the number of the non-A-to-I editing samples incorrectly predicted to be of true A-to-I editing sample.Now it is crystal clear to see the following from Eq.16.When 0 N + − = meaning none of the true A-to-I editing samples are incorrectly predicted to be of non-A-to-I editing See the section of "Physicochemical Properties of Nucleotides" for further explanation.
sample, we have the sensitivity Sn = 1.When N N + + − = meaning that all the true A-to-I editing samples are incorrectly predicted to be of non-A-to-I editing sample, we have the sensitivity Sn = 0. Likewise, when 0 N − + = meaning none of the non-A-to-I editing samples are incorrectly predicted to be of true-A-to-I editing sample, we have the specificity Sp = 1; whereas N N − − + = meaning that all the non-A-to-I editing samples are incorrectly predicted to be of true A-to-I editing samples, we have the specificity Sn = 0. Accordingly, it has rendered the meanings of sensitivity, specificity, overall accuracy, and stability much more intuitive and easier-to-understand by using Eq.16, particularly for the meaning of MCC, as concurred recently by many investigators (see, e.g., [36,38,40,45,61,62,75,[83][84][85][86][87][88]).
Note that, however, the set of metrics as defined in Eq.16 is valid only for the single-label systems.As for the multi-label systems whose emergence has become more frequent in system biology [89][90][91] and system medicine [92] or biomedicine [25], a completely different set of metrics are needed as elucidated in [93].
With a set of good metrics to measure the quality of a predictor, the next thing we need to consider is what validation approach should be adopted to score these metrics.In statistical prediction, the following three crossvalidation methods are usually applied: (1) independent dataset test, (2) subsampling (or K-fold cross-validation) test, and (3) jackknife test [94].Of these three, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark dataset as elucidated in [30].Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., [63,76,78,[95][96][97][98][99][100][101][102][103][104][105][106][107]).In view of this, here we also used the jackknife test to examine the quality of iRNA-AI predictor.During the jackknifing process, both the training dataset and testing dataset are actually open, and each sample will be in turn moved between the two.The jackknife test can exclude the "memory" effect.Also, the arbitrariness problem as mentioned in [30] with the independent dataset and subsampling tests can be totally avoided since the outcome obtained by the jackknife crossvalidation is always unique for a given benchmark dataset.
Even though, however, in order to reduce the computational time, the K-fold cross-validation approach has still often been used, as done by many investigators with SVM as the prediction engine (see, e.g., [25,28,102,107]).Also, for demonstrating the practical application of a predictor, the independent dataset test has often been used as well [18].

Figure 1 :
Figure 1: An illustration to show the most common type of RNA editing, a modification from adenosine (A) to inosine (I) or the "A-to-I editing" that is catalyzed by the adenosine deaminase (ADAR).See the text for further explanation.

Figure 2 :
Figure 2: Illustration to show the sequence segments highlighted by sliding the scaled window [ ] ,

Figure 3 :
Figure 3: Illustration to show the structure of paired nucleic acid residues.The upper panel is the A-U pair bonded to each other with two hydrogen bonds; the lower panel is the G-C pair with three hydrogen bonds.It can also be seen from the figure that A and G have two rings, while C and U have one ring.Also, according to chemical functionality, A and C can be classified into the amino group, while G and U into the keto group.See the main text for further explanation.
of true A-to-I editing samples in the positive dataset and none of the non-A-to-I editing samples in the negative dataset are incorrectly predicted, we have the overall accuracy Acc = 1 and MCC = 1; when N N the true A-to-I editing samples in the positive dataset and all the non-A-to-I editing samples in the negative dataset are incorrectly predicted, we have the overall accuracy Acc = 0 and MCC = -1; whereas when / have Acc = 0.5 and MCC = 0 meaning no better than random guess.