Genetic variation in long noncoding RNAs and the risk of nonalcoholic fatty liver disease

The human transcriptome comprises a myriad of non protein-coding RNA species, including long noncoding RNAs (lncRNAs), which have a remarkable role in transcriptional and epigenetic regulation. We hypothesized that variants in lncRNAs influence the susceptibility to nonalcoholic fatty liver disease (NAFLD). Using next generation sequencing, we performed a survey of genetic variation associated with randomly selected lncRNA-genomic regions located within both experimentally validated and computationally predicted regulatory elements. We used a two-stage (exploratory, n = 96 and replication, n = 390) case-control approach that included well-characterized patients with NAFLD diagnosed by liver biopsy. We sequenced > 263 megabase pairs at quality score > Q17, in a total of 2,027,565 reads, including 170 lncRNA-genomic regions. In the sequencing analysis and the validated dataset, we found that the rs2829145 A/G located in a lncRNA (lnc-JAM2-6) was associated with NAFLD and the disease severity. Prediction of regulatory elements in lnc-JAM2-6 showed potential sequence-specific binding motifs of oncogenes MAFK and JUND, and the transcription factor CEBPB that is involved in inflammatory response. The A-allele was significantly associated with NAFLD as disease trait (p = 0.0081) and the disease severity (NASH-nonalcoholic steatohepatitis vs. controls: OR 2.36 [95% CI: 1.54−3.62], p = 0.000078). The A-allele carriers also have significantly higher body mass index and glucose-related traits compared with homozygous GG. Hence, our results suggest that variation in lncRNAs contributes to NAFLD severity, while pointing toward the complexity of the genetic component of NAFLD, which involves still unexplored regulatory regions of the genome.


Next generation sequencing (NGS)
DNA was isolated from whole blood, as previously described [26,27,34] and was quantified by a Qubit DNA high-sensitivity assay kit (Life Technologies, Carlsbad, CA, USA).
Library preparation for each sample was performed using the IT AmpliSeq 2.0 Beta kit following the manufacturer's instructions (Life Technologies, Carlsbad, CA, USA). Briefly, 10 ng of DNA was used as a template to generate the amplicon library for sequencing variation in the selected regions by the Ampliseq software (Life Technologies, Carlsbad, CA, USA). Genomic regions of interest were PCR amplified prior to sequencing, and the sequencing adaptors with short stretches of index sequences (96 barcodes) that enabled sample identification were ligated to the amplicons using the IT Xpress barcode adaptor kit. The prepared library was quantified using the Ion library TaqMan Quantitation Kit. Sequencing template preparation (emulsion PCR and beadsenrichment) from sequencing libraries was carried out using an Ion OneTouch Template Kit and Ion OneTouch system (Ion OneTouch Instrument and Ion OneTouch ES, Life Technologies, Carlsbad, CA, USA) according to the manufacturer's protocol. Prepared templates were sequenced using Ion Sequencing Kit v2. Ion Torrent Suite software 4.2.1 (Life Technologies, Carlsbad, CA, USA) was used for converting raw signals into base calls and extracting FASTQ files of sequencing reads. Number of nucleotide flows during sequencing was set to 500 cycles.

Variant calling, estimation of quality control, data analysis and prediction of variant/ mutation effect
The data obtained from the Ion Torrent PGM were processed using the Ion Torrent Suite Software v 4.2.1 (Life Technologies, Carlsbad, CA, USA). Variants were annotated with dbSNP (http://www.ncbi.nlm.nih.gov/ SNP/) IDs using SnpSift. In silico analysis, aimed at predicting gene and transcript functional consequences was performed by the bioinformatic tool variant Effect Predictor (VEP) (http://useast.ensembl.org/Homo_ sapiens/Tools/VEP), employing ENSEMBL transcripts and SnpEff platform (http://snpeff.sourceforge.net/) using University of California, Santa Cruz (UCSC) transcripts. CellBase available at http://docs.bioinfo.cipf.es/projects/ cellbase was used to annotate variants with the phenotype information from HGMD, ClinVar, UNIPROT and COSMIC. Annotation in http://www.ncbi.nlm.nih.gov/ SNP was also used to determine whether variants were novel or already associated with a phenotype.
Read alignment to the hg19 Human genome reference sequence was performed using torrent mapping alignment program (TMAP) included in PGM. The Ion Torrent variant caller plug-in was used to perform the variant calling with the germline low stringency settings. In order to filter erroneous base callings, control quality filtering steps were performed using proprietary Perl scripts. At the variant coverage depth > 20, we inspected the presence of strand bias by checking the number of bases covered in both strands for the variant, the reference >= 4 reads in both strands, and the quality for variant calling > 17. The retained variants were visually examined using Integrative Genomics Viewer (IGV) software (http://www.broadinstitute.org/igv/) to check for any inconsistency in the base calls.
RegulomeDB, a database that annotates SNPs with known and predicted regulatory elements in the intergenic regions of the H. sapiens genome, was used for the prediction of variants effects' (http://www.regulomedb. org/snp/). HaploReg tool, available at http://compbio. mit.edu/HaploReg, was used for exploring regulatory elements in the selected SNP/s, including information of variants in linkage disequilibrium (LD) from the 1000 Genomes Project, linked SNPs and small indels along with chromatin state and protein binding annotation from the Roadmap Epigenomics and ENCODE projects, and the effect of SNPs on regulatory motifs.

Prediction of functional elements potentially associated with regulation of gene expression in the sequenced regions
We focused on the following data: (1) ENCODE Project transcription factors binding sites (TFBS) for HepG2 cell line and (2) Roadmap Epigenomics Project 15-state chromatin segmentations for various cell lines (HepG2, adult liver, adipose nuclei, heart left ventricle, heart right ventricle and heart right atrium). Each region of interest was bioinformatically assessed using AnnotationHub, an R package available through Bioconductor as explained elsewhere [18].
The ENCODE (Encyclopedia Of DNA Elements) project (https://www.genome.gov/encode/) was used to predict TFBS; the cell type selected was HepG2 (which is a cell line derived from a male patient with liver carcinoma). The rationale of this selection is: 1-our work was focused on the liver, 2-this is a model system for metabolism disorders, 3-the cell line represents the endoderm lineage, 4-ENCODE only contains information on designated cell types.
TFBS data is based upon ChIP-seq experiments and consists of peak calls (regions of enrichment) based on an uniform processing pipeline developed for the ENCODE Integrative Analysis effort. The score values were computed at UCSC based on signal values assigned by the ENCODE uniform analysis pipeline. The input signal values were multiplied by a normalization factor calculated as the ratio of the maximum score value (1000) to the signal value at 1 standard deviation from the mean, with values exceeding 1000 capped at 1000. This has the effect of distributing scores up to mean + 1std across the score range, but assigning all above to the maximum score.
The NIH Roadmap Epigenomics Mapping Consortium (http://www.roadmapepigenomics.org/) was used to predict information on chromatin accessibility in the sequenced regions chromatin segmentations. Chromating state learning data is based upon different chromatin marks (H3K4me3, H3K4me1, H3K36me3, H3K27me3, H3K9me3) in their spatial context (chromatin states) across the epigenome that were analyzed with ChroHMM v1.10 by the Roadmap Epigenomics Project. The trained model was then used to compute the posterior probability of each state for each genomic bin in each cell type. Our regions of interest were labeled using the state with the maximum posterior probability.

Prediction of miRNAs target genes and pathway analysis
The Co-expression Meta-analysis of miRNA Targets (CoMeTa) available at http://cometa.tigem.it/ index.php was used to predict target genes of miRNAs as well as pathway analysis. This platform is based on the assumption that the targets of a given miRNA are likely to be co-expressed and therefore to belong to the same miRNA gene network. The CoMeTa tool aims at the inference of miRNA targets and miRNA-regulated gene networks by integrating expression data from hundreds of cellular and tissue conditions. CoMeTa integrates expression data from hundreds of cellular systems and multiple tissues along with the analysis of 675 human miRNAs.