microRNA profiles in urine by next-generation sequencing can stratify bladder cancer subtypes

Bladder cancer (BC) is the most frequent malignancy of the urinary tract with a high incidence in men and smokers. Currently, there are no non-invasive markers useful for BC diagnosis and subtypes classification that could overcome invasive procedures such as cystoscopy. Dysregulated miRNA profiles have been associated with numerous cancers, including BC. Cell-free miRNAs are abundantly present in a variety of biofluids including urine and make them promising candidates in cancer biomarker discovery. In the present study, the identification of miRNA fingerprints associated with different BC status was performed by next-generation sequencing on urine samples from 66 BC and 48 controls. Three signatures based on dysregulated miRNAs have been identified by regression models, assessing the power to discriminate different BC subtypes. Altered miRNAs according to invasiveness and grade were validated by qPCR on 112 cases and 65 controls (among which 46 cases and 16 controls were an independent group of subjects while the rest were replica samples). The area under the curve (AUC) computed including three miRNAs (miR-30a-5p, let-7c-5p and miR-486-5p) altered in all BC subtypes showed a significantly increased accuracy in the discrimination of cases and controls (AUC model = 0.70; p-value = 0.01). In conclusions, the non-invasive detection in urine of a selected number of miRNAs altered in different BC subtypes could lead to an accurate early diagnosis of cancer and stratification of patients.


Samples processing and rna extraction
Urine samples from each participant were collected in the morning, stored at 4° C until the processing consisting of a centrifugation at 3,000 g for 10 min. The urine supernatant aliquots were then transferred in tubes and centrifuged again at 12,000 g for 10 min at 4° C and stored at -80° C until use.
Total RNA was extracted with Urine microRNA Purification kit (Norgen biotek corp, Canada), starting from 1ml of urine and according to the manufacturer's standard protocol. The extracted RNA was eluted with 30 μL of RNase-free water. RNA quantity was determined by Qubit ® 2.0 Fluorometer with Qubit ® microRNA Assay Kit (Life technologies, Italy).
The study was conducted in two phases. In the Discovery phase, expression of miRNAs in urine from BC cases and controls matched for age and smoking was evaluated by small RNA-sequencing (small RNAseq). In the Replica/Validation phase, miRNAs that were differentially expressed in the Discovery phase were further validated on the same set of BC cases and controls (Replica) and on urine samples from independent group of cases and controls (Validation) using individual microRNA LNA PCR primer sets (Exiqon, Denmark).

rna library preparation and sequencing
For sequencing, the small RNA transcripts were converted into barcoded cDNA libraries as previously described by us [3]. Library preparation was performed with the NEBNext Multiplex Small RNA Library Prep Set for Illumina (New England BioLabs Inc., USA) followed by small RNA-seq analysis on the Illumina HiSeq2000 platform (Illumina Inc., USA). For each library, 6 μL of small RNA (~100 ng of RNA) was used in all the experimental procedures as starting material. Each library was prepared with a unique indexed primer so that the libraries could all be pooled into one sequencing lane (each pool including 24 samples). Multiplex adaptor ligations, reverse transcription primer hybridization, reverse transcription reaction and the PCR amplification were processed with regard to the protocol for library preparation (Protocol E7330, New England BioLabs Inc., USA). After PCR pre-amplification, the cDNA constructs were purified with the QIAQuick PCR Purification Kit (Qiagen, Germany) following the modifications suggested by the NEBNext Multiplex Small RNA Library Prep Protocol and loaded on the Bioanalyzer 2100 (Agilent, Germany) using the DNA High Sensitivity Kit (Agilent, Germany) according to the manufacturer's protocol.
Further quality control check and gel size selection was performed using Agencourt AMPure XP Beads (Beckman Coulter, Italy) according to the NEBNext Multiplex Small RNA Library Prep Protocol.
A concluding Bioanalyzer 2100 run with the High Sensitivity DNA Kit (Agilent, Germany) completed the workflow of library preparation. The obtained sequence libraries were subjected to the Illumina sequencing pipeline, passing through clonal cluster generation on a single-read flow cell by bridge amplification on the cBot and 50 cycles sequencing-by-synthesis on the HiSeq2000 (Illumina Inc., USA).

mirna quantification by quantitative reverse transcriptase polymerase chain reaction (qpCr)
Candidate miRNA biomarkers selected in the Discovery phase were further validated in independent urine samples by real-time quantitative PCR (qPCR) using the miRCURY LNA ™ Universal RT microRNA PCR system (Exiqon A/S, Vedbaek, Denmark). Reverse transcription (RT) was performed using the Universal cDNA synthesis kit II (Exiqon) according to the manufacturer's instructions with the addition of one spike-in (UniSp6) to the RT reaction. For qPCR, complement cDNA was diluted 1:40. Four ul of 1:40 water diluted cDNA products were mixed at 5 ul of ExiLENT SYBR Green Mastermix and 1 ul of specific miRNA probe (Exiqon). All cDNA products were prepared in duplicate PCR reactions following manufacturer's instructions. For quality control purpose, one RNA sample was measured twice and a sample containing nuclease-free water and carrier RNA was profiled as negative control. All the reactions were run on a ABI Prism 7900 Sequence Detection System (Applied Biosystems, Foster City, CA, USA). A melt curve analysis was performed for amplification specificity of each individual target per sample.

Computational analysis mirna sequencing analyses
The obtained FASTQ files were quality-checked using FastQC software (http://www.bioinformatics. babraham.ac.uk/projects/fastqc/). The base quality and the N content features were also considered: all files passed both these checks.
Small-RNA-seq analysis was as described in [4]. Briefly, reads shorter than 14 nucleotides were discarded from the analysis; the remaining reads were clipped from the adapter sequences using Cutadapt software (http://cutadapt.readthedocs.io/en/stable/index.html). In Cutadapt, the maximum error rate in terms of mismatches, insertions and deletions was set to 0.15. The trimmed reads were mapped against the precursor miRNA sequences downloaded from miRBase (Release 21) using the Shrimp algorithm [5] setting the options for miRNA sequences alignment. The use of precursor miRNAs as reference guarantees a precise and specific count detection. Only those reads with maximum 2 mismatches were retained. After reads filtering steps a matrix of integer values called count matrix was created. The value in the i-th row and the j-th column of the matrix reports how many reads have been unambiguously assigned to mature miRNA i in the sample j.
Due to the high variability among BC subtypes (i.e. invasive and non-invasive, but also non-invasive according to grade), we initially compared miRNA expression levels in healthy subjects with respect to MIBC, NMIBC G1+G2 and NMIBC G3 categories.
The candidate miRNAs were selected by uniform (ad-hoc) pipeline of analysis (adapted from [6]) composed by two statistical methods both running on the original miRNA counts matrix: (i) identification of differentially expressed miRNAs by negative binomial generalized linear models implemented in DESeq2 package in R (version 1.6.3); (ii) computation of a regression model in which single variable levels (i.e. individual miRNA expression levels) are used to predict the class label (i.e. BC patient or healthy control) of each subject.

Differentially expressed mirnas
The identification of DEmiRNAs was performed by DESeq2 package [7] (version 1.6.3) developed in the Bioconductor suite [www.bioconductor.org] through R statistical programming language (version 3.1.1).
The hidden variation that might be affecting part of the differential expression levels of miRNAs in the dataset due to batch effect was detected by the SVA package [8] (version 3.18). The number of surrogate variables estimated from SVA package considered in the further statistical analysis was equal to the number of batches.
The hypothesis tests used in DESeq2 was based on the likelihood ratio test (LRT). The LRT is useful for testing multiple terms at once; the terms considered in the present study were the following confounding variables: smoke, age, the surrogate variables identified by SVA package, and the class of samples (i.e. healthy or BC).

identification of mirnas associated with high predictive power
The identification of a set of miRNA predictors is based on the computation of a regression model in which single variable levels (i.e. miRNA expression values) are used to predict the class label (i.e. cancer or healthy) of each subject [9].
Let N be the number of subjects with n + m = N, where n being the number of subjects belonging to class 0 (C0, i.e. cancer samples) and m the number of subjects belonging to class 1 (C1, i.e. healthy samples). Let also J be the number of miRNAs in the count matrix. and healthy subject (C1), the classification vector ŷ , predicted using the j-th gene as independent variable, is obtained by applying the following criterion: The comparison of ŷ with y measures the ability of each predictor to correctly classify the subjects. This quantity is called predictive power (PP) and is defined as follows: The J values of PP form a distribution of predictive power values, describing the ability of the DEGs of classifying the samples. The P (with P ≤ J) good predictors miRNAs are chosen among those miRNAs whose PP ≥ q, with q being a quantile of the predictive powers distribution, describing the minimal number of subjects a miRNAs needs to correctly classify to be considered a good predictor.

Selection of candidate mirnas
The methodologies to determine the differential expressed miRNAs and the computation of the predictive power are applied on the miRNA counts matrix and the results are selected according the following criteria. From the first method, the candidate miRNAs (DEmiRNAs) are those associated with adjusted FDR≤0.05 and with the mean read count ≥300. From the latter method, the candidate PPmiRNAs are associated with a predictive power greater than 0.70.
The final selection of candidate miRNAs for the Replica/Validation step was based on the common miRNAs between the list of candidate miRNAs identified by both the two statistical methods. However, a careful examination of the reads counts for a single miRNA across the classes (cancer and healthy samples) was performed on all the common miRNAs and the best DEmiRNAs and PPmiRNAs in order to select those having a clear separation of the counts distribution between the two classes.
Finally, for the sake of completeness we also take into account miRNA biomarkers already reported in literature, as an additional validation of our results. In particular, miR-106b was also included in the validation step since it has been indicated as a good biomarker for BC in several papers [10][11][12].

Selection of mirnas as endogenous controls from nGS data
The selection of endogenous miRNA controls for normalizing miRNA expression in the qPCR analysis for the Replica/Validation phases was performed using the approach developed by Eisenberg and Levanon [13] and adapted to miRNA NGS data.
In details, miRNAs were selected considering the individual raw count and the following criteria: (i) the single miRNA expression must be observed in all samples (at least 2 reads for each sample), (ii) with a low variance between samples (log2 standard deviation value <12), and (iii) no outstanding expression in any sample (log2 fold change between -4 and 7).
Two reference genes (miR-28-3p and miR-361-3p) were responding to the selection criteria and were employed as endogenous controls in the qPCR analyses.

external validation in tCGa miBC tumor and normal tissues
Results from the Discovery phase between controls and MIBC cases were further compared with an openaccess dataset of MIBC individuals from The Cancer Genome Atlas (TCGA) project (for a detailed description of the generation of data see [14]).
The differential expression analysis of the raw count data between MIBC tumor and normal tissues from the same subjects was performed using DESeq2 package. P-values adjusted for multiple testing (False Discovery Rate, FDR and Bonferroni`s) were considered as statistically significant.

analysis of target genes
For selected miRNAs, the set of validated target genes was extracted by the miRWalk database [15]. EnrichR was used for gene ontological analysis and pathway enrichment [16,17].
The list of target genes for each selected miRNA was also filtered with the list of genes dysregulated in BC as reported in the BC Cluster database (http://www.bccluster. org/). The resulting genes in common were tested for their over-representation using EnrichR.
The relevance of each gene set enrichment was estimated using a p-value adjusted for multiple testing based on hypergeometric distribution. Gene sets with a probability values under 5% were considered significantly overrepresented.

analysis of qpCr data
GenEx software (Multi-D) was used for data preprocessing including inter-plate calibration, evaluation of isolation and reverse transcription efficiency, setting specific cut-offs for negative control miRNA Ct values, and duplicates averaging. The analyses were performed calculating delta Ct (DCt) values either by global mean normalization or normalization according to the two reference genes (miR-28-3p and miR-361-3p) selected from NGS. The fold-change was calculated as log 2 -∆∆CT between BC (or BC subcategories) and control samples.
MiRNAs with a Ct value > 38 were deemed to be not detected. To avoid biased inference due to qPCR nondetects (Ct value = 40) a left-censoring approach was employed. Ct values of 40 were in fact substituted with the highest observed Ct value for a given miRNA [18]. Ct values were then normalized by subtracting the Ct value of the selected endogenous controls or the global mean Ct from each of the 21 miRNAs of interest. Differential miRNA expression was determined by logistic regression adjusted for age and smoking. The unadjusted p-values< 0.05 were considered as statistically significant, since these analyses were hypothesis-driven.
To test whether miRNA expression values exhibit an increasing or decreasing behavior from healthy controls to MIBC patients, a trend test adjusted for multiple testing (FDR) was performed. Similar analyses were conducted considering only BC cases categories (i.e. NMIBC G1+G2, NMIBC G3, MIBC).
Finally, we tested for the improvement in casecontrol discrimination when considering the information given by miRNA expression together with traditional risk factors. Predictive performance was assessed with respect to the ability of each miRNA and the traditional BC risk factors to discriminate between cases and controls using the Receiver Operating Curves (ROC) of the two models by the DeLong test (pROC package). To perform the ROC curve and to assess the area under the curve (AUC), functions prediction and performance from the ROCR package were used reSultS

Sample details
In total, 116 urine samples were included to be performed by small RNA-seq in the Discovery phase. Of these, one sample was discarded since the patient was affected by bladder hyperplasia, while another sample was discarded since the number of raw reads obtained resulted extremely low. Finally, 114 samples were used for the analyses (66 BC cases and 48 controls). Among cases, 10 resulted MIBC while 56 were NMIBC (39 G1 + G2 and 17 G3) ( Table 1).
An average of 10.6 million reads were generated from the 114 libraries, ranging from 0.18 to 84.8 million reads per sample. Raw reads were trimmed for adaptor sequence: reads with length less than 14 nucleotides were discarded. This gave an average of 7.37 million reads per sample, ranging from 0.15 to 62.9 million reads (Supplementary Table 1). Out of an average of 7.37 million of trimmed reads an average of 0.65 (6%) millions of reads were mapped to the list of pre-miRNAs collected in miRBase (release 21). Considering all samples, an average of 975 (38%) unique miRNAs were identified and associated with at least one read (ranging from 302 (11%) to 1744 (68%)). After the mapping step, we created a count matrix composed by 114 samples and 1822 miRNAs having at least one read in one sample. We selected those miRNAs passing to the Discovery phase based on those having at least 20 counts considering all samples. 1787 out of 1822 miRNAs have been considered to the biomarkers identification.
The list of target genes for each selected miRNA was also filtered with the list of genes dysregulated in BC as reported in the BC Cluster database (http://www. bccluster.org/). The resulting genes in common (n = 65) were tested for their over-representation using EnrichR. Several genes were over-represented in a list of KEGG pathways and again "Pathways in cancer_Homo sapiens_ hsa05200" (Adjusted p-value ≤ 0.0000001), "MicroRNAs in cancer_Homo sapiens_hsa05206" (Adjusted p-value = 0.00001), and "Bladder cancer_Homo sapiens_hsa05219" (Adjusted p-value = 0.0001) resulted among the top 10 significant pathways (data available upon request).