Small non-coding RNA profiling in human biofluids and surrogate tissues from healthy individuals: description of the diverse and most represented species

The role of non-coding RNAs in different biological processes and diseases is continuously expanding. Next-generation sequencing together with the parallel improvement of bioinformatics analyses allows the accurate detection and quantification of an increasing number of RNA species. With the aim of exploring new potential biomarkers for disease classification, a clear overview of the expression levels of common/unique small RNA species among different biospecimens is necessary. However, except for miRNAs in plasma, there are no substantial indications about the pattern of expression of various small RNAs in multiple specimens among healthy humans. By analysing small RNA-sequencing data from 243 samples, we have identified and compared the most abundantly and uniformly expressed miRNAs and non-miRNA species of comparable size with the library preparation in four different specimens (plasma exosomes, stool, urine, and cervical scrapes). Eleven miRNAs were commonly detected among all different specimens while 231 miRNAs were globally unique across them. Classification analysis using these miRNAs provided an accuracy of 99.6% to recognize the sample types. piRNAs and tRNAs were the most represented non-miRNA small RNAs detected in all specimen types that were analysed, particularly in urine samples. With the present data, the most uniformly expressed small RNAs in each sample type were also identified. A signature of small RNAs for each specimen could represent a reference gene set in validation studies by RT-qPCR. Overall, the data reported hereby provide an insight of the constitution of the human miRNome and of other small non-coding RNAs in various specimens of healthy individuals.


Detection of isomiRNAs
IsomiR analysis was performed using isomiRID algorithm [4] using the default settings. Only isomiRs associated with a median number of reads greater than 20 in at least one biospecimen were considered. A maximum of three mismatches between reads and reference miRNA sequences was considered for the analysis.

Analysis of other sncRNAs
The set of small RNA-Seq reads not aligned by SHRiMP to miRNA sequences were aligned against human genomic sequence hg38 (GRCh38) using Bowtie2 v2.2.7 in default settings [5].
Reads alignment files were used to quantify the expression of ncRNA annotations from Gencode v24 [6] and Database of small human non-coding RNAs (DASHR) database [7]. Specifically, Gencode v24 database was used to isolate the ncRNA annotations shorter than 70 bp. According with this threshold, 276 ncRNAs were isolated. DASHR was used to identify the set of piRNA (average length 31+/−1 bp) and tRNA (average length 74+/−7 bp) annotations. In total, 34,175 piRNA and 643 tRNA annotations were isolated from DASHR (Supplementary Table 1C).
Reads mapping to ncRNA loci were counted using featureCounts algorithm from Subreads v1.5.0 package (Liao et al., 2014). The algorithm was applied with options -O and -M and counting separately reads mapped on Gencode v24, piRNA and tRNA genes.
To identify the ncRNAs expressed in each biospecimen, the annotations with median reads greater than 20 were selected. Then, read counts were normalized by computing the library size factor [8]. The read count tables from the three studies of plasma exosomes samples were merged into a single study. Since these datasets were generated in independent studies a SVA [3] was performed to correct the read counts. The analysis was performed using the svaseq function of the package and by setting the number of surrogate variable equal to three.

External data integration
The set of sncRNAs identified in this study was compared with public lists of sncRNAs detected in specimens and tissues from healthy individuals as reported in supplementary materials of target publications and databases. Specifically, normalized expression from DASHR databases were used to compare the expression of miRNAs and other sncRNAs in plasma, serum, and eight human tissues related to the biospecimens analysed in this study. Supplementary data from [9] were used to verify miRNA expression levels in plasma and urine samples. Data from [10] and [11] were used to assess the expression of miRNAs in normal colon tissues. The expression levels of miRNAs and tRNAs in 40 plasma samples were retrieved from [12]. miRNA expression levels from different specimens were retrieved from GSE85830 [13].
ExoCarta database [14] annotations were used to retrieve information about miRNA expression in extracellular vesicles.

Bioinformatic tools
The list and the expression levels of sncRNAs identified in the different specimen types were compared using Venn diagrams and heatmap.2 R functions. PCA analysis was performed using prcomp R function and autoplot function from ggfortify R package. The contribution of each sncRNAs expression level to the classification of specimen type was evaluated using Weka 3.6.12 [15]. The Weka RandomForest classifier was applied in default settings and 10-fold cross-validation. The contribution of each covariate to the classification results was evaluated using Weka ChiSquareAttributeEval. This methodology is based on the independence of the occurrence of a specific attribute (sncRNA expression) and the occurrence of a specific class (specimen type). The miRNA functional enrichment analysis was performed using EnrichR web tool [16] on the list of validated miRNA targets annotated in miRWalk 2.0 database [17].