Elucidating drivers of oral epithelial dysplasia formation and malignant transformation to cancer using RNAseq

Oral squamous cell carcinoma (OSCC) is a prevalent cancer with poor prognosis. Most OSCC progresses via a non-malignant stage called dysplasia. Effective treatment of dysplasia prior to potential malignant transformation is an unmet clinical need. To identify markers of early disease, we performed RNA sequencing of 19 matched HPV negative patient trios: normal oral mucosa, dysplasia and associated OSCC. We performed differential gene expression, principal component and correlated gene network analysis using these data. We found differences in the immune cell signatures present at different disease stages and were able to distinguish early events in pathogenesis, such as upregulation of many HOX genes, from later events, such as down-regulation of adherens junctions. We herein highlight novel coding and non-coding candidates for involvement in oral dysplasia development and malignant transformation, and speculate on how our findings may guide further translational research into the treatment of oral dysplasia.

Clinical information for the 19 HPV-negative patients used in this study is given in Supplementary  Table S1. Each patient had three samples sequenced: normal oral mucosa (N), dysplasia (D) and OSCC tumour (T). On average 490ng total RNA was used per sample resulting in an average library output of weight: 120ng and fragment length: 200bp. Each library passed quality assessment via the bioanalyser and, following processing, every sequencing file (fastq) passed the per-base and per-sequence quality tests in the FastQC programme. Sequencing resulted, on average, in 79 ± 33 million reads, with 76% ± 8% aligning, per sample (Supplementary Table S2). These metrics indicate that the RNA extracted from FFPE tissue was sufficient for high-quality sequencing results; highly-degraded RNA would produce adapter-contaminated libraries (evident on bioanalyser traces) and not result in such high percentages of unique alignments. Gene expression was quantified as Fragments per Kilobase per million Mapped (FPKM) for each of 62,766 protein-coding and non-coding genes. In total 29,733 of these were shown to be expressed (average FPKM across any of the three groups: N, D or T, of at least 0.1 FPKM) and used in downstream analysis.

Study design and patient selection
Patients were selected from an existing prospective sample collection from patients attending the Oral surgery outpatients' clinic of Leeds General Infirmary with potential or proven cancerous oral lesions. The study was approved by the local ethics committee; (REC ref no 07/Q1206/30 and 08/H1306/127). All patients provided informed consent and were anonymised. All formalin-fixed paraffin-embedded (FFPE) blocks from the patient's surgical sample were then obtained from archive to produce the research samples used in this study.
Criteria for patient selection were as follows: patients had (i) OSCC, (ii) no previous OSCC (iii) no adjuvant treatment prior to surgery, (iv) tumour, normal epithelium and dysplastic epithelium present in their surgical sample and (v) clinical diagnosis of HPV negative tumour (subsequently confirmed via RNA analysis, see below), were selected for further sampling (see Supplementary Table S1). All blocks derived from the surgical samples were then reviewed by a pathologist to identify the most representative blocks from which to source each sample type for every patient. Care was taken to select non-adjacent areas for each sample type wherever possible.

rNA extraction from FFPE samples
Total RNA was prepared from the above selected samples from macro-dissected FFPE tissue using a commercially available kit. Briefly, 4 μm-thick sections were cut from each selected FFPE tissue block and stained with H&E. An independent pathologist, blind to the patient identity and diagnosis, reviewed all the marked H&E slides in order to (i) confirm the diagnosis and histology reported in the original pathology report; (ii) mark representative areas of normal epithelium, dysplastic epithelium and cancer for macro-dissection and (iii) evaluate and record the percentage of tumour cells in the marked area ( Figure 1). Seven further consecutive 10 μm-thick sections were cut from each block for RNA extraction, followed by a final 4 um slide for H&E staining and review by the pathologist for persistence of histology throughout the sampling. Slides for RNA extraction were heated on a hot plate at 60°C for 3 min, and then rehydrated by immersion in xylene for 5 min, 100% ethanol for 5 min, 90% ethanol for 5 min, 70% ethanol for 5 min. Sections were immediately macro-dissected using sterile disposable scalpels to harvest the desired tissue area; the corresponding H&E stained, marked slide was used as a guide and great care was taken to avoid sampling adjacent areas of tissue. All the macro-dissected tissue from each individual sample was placed in a separate sterile centrifuge tube labelled with the unique patient study ID and sample ID. RNA extraction was performed using the High Pure FFPE RNA Micro kit according to the manufacturer's instructions (Roche, Burgess Hill, West Sussex, UK). RNA samples were quantified and quality checked using a Nanodrop™ 8000 (Thermo Fisher scientific Ltd, Altrincham, Cheshire, UK), a 2200 TapeStation (Agilent Technologies UK Ltd, Wokingham, Berkshire, UK) and the Quant-it™ RNA BR Assay kit for the Qubit ® 2.0 Fluorometer (Life Technologies Ltd, Paisley, UK).

Library preparation
Strand-directional whole transcriptome sequencing libraries were prepared using the ScriptSeq™ Complete Kit (Human/Mouse/Rat)-Low Input (Epicentre, Madison, Wisconsin, USA), following the manufacturers instructions for FFPE samples using the Ampure XP system (BeckmanCoulter) for clean-up steps where recommended. Briefly, total RNA samples were treated with Ribo-zero to remove rRNA, followed by a column clean-up and quality check on a 2200 Tapestation using an R6K high sensitivity screen tape. rRNA depleted samples were then subject to fragmentation, cDNA synthesis and terminal tagging prior to PCR amplification and index tagging

SUPPLEMENTAry rESULTS
www.impactjournals.com/oncotarget/ Oncotarget, Supplementary Materials 2015 using the ScriptSeq™ Index PCR primers (epicentre). Purified Libraries were quality checked and quantified on a 2200 Tapestation using a D1000 screen tape. Samples with >10% adapter contamination were subject to repeat purification using the Ampure XP system.

Sequencing and alignment
100bp paired-end sequencing was performed on a HiSeq2500. Samples were barcoded and sequenced to the equivalent of four samples per lane by sequencing 12 samples over three lanes, where possible, to alleviate possible lane effects. After merging data for the same samples from different lanes, the result was two fastq files, one for each paired end read, per sample. Fastq files were processed using Trim Galore! version 0.2.7, to remove low quality bases, trim adaptors and fix paired-end reads, retaining unpaired reads of at least 35bp post-trimming (http://www.bioinformatics.babraham.ac.uk/projects/trim_ galore/). At this stage there are three fastq files per sample: one for each paired read and one containing unpaired reads. Two alignments are then performed per sample, a paired end alignment and single end alignment of the unpaired reads. Each alignment uses Tophat2, version 2.0.7, to align reads in a strand directional manner to human reference genome GRCh37.p11, using the gencode. v17 genome annotation as a guide, allowing each read to align a maximum of five times (1). Two mismatches were allowed per read, with only one in the first 20bp. The two alignment files were merged, sorted and reheaded to create a single bam file per sample. An example of the commands used, detailing each parameter, are:

Alignment statistics
The alignment statistics, detailed in Supplementary  Table S2

HPV analysis
To confirm that our samples were HPV negative, as suggested by the clinical reports, we aligned all sequenced reads to the HPV16 and HPV18 genomes using Tophat2. As a positive control we also aligned RNAseq data from a HPV positive patient's trio of samples to these genomes. The number of RNA derived reads aligning to the HPV18 genome was zero in all cases. The number of RNA derived reads aligning to the HPV16 genome was zero in all samples from all patients included in this study except PG137 for which there were 7 reads and 4 reads that aligned in the N and T samples respectively. In contrast, the HPV positive patient had zero aligned in their N sample and 4210 and 1137 reads aligning in their D and T samples respectively.

Expression quantification
Bam files were input to cuffdiff, version 2.1.1, in patient matched pairs (2). Cuffdiff is used to assign multireads (reads that have more than valid alignment) to a single location using the -u parameter. The cuffdiff output, used for downstream analysis, was the Fragments Per Kilobase per million reads Mapped (FPKM) and the count data (denoted raw_frags) required by the edgeR programme. FPKM is the normalised expression metric used to compare genes within and between samples.

Differential expression analysis
EdgeR was used (3), as opposed to using cuffdiff directly, for differential expression analysis because it can be applied in paired mode using the generalized linear modeling approach as demonstrated in section 4.4 of the edgeR User's Guide (March 2013). Three pairwise comparisons were made: normal epithelium versus dysplasia (NvD), dysplasia versus tumour (DvT) and normal epithelium versus tumour (NvT). A false discovery rate of 0.01 is used to determine differential expression. This is an example script used in the edgeR analysis, complete with comments:

Functional enrichment
The David Bioinformatics Database 6.7 was used to assess functional enrichment, via the web server (4). The background from which to measure enrichment was the list of all genes expressed within at least one of the sample groups (29,733 genes). These were input using their Ensembl ID. Individual gene lists were then input as per the details in the results section, for which the findings are described and tabulated in Figure 2. Gene ontology terms were inspected, and pathway analysis using Biocarta, KEGG and Panther. Gene family enrichment was also inspected based on the Panther database terminology. Significant associations were those with a p-value < 0.05 following a Benjamini-Hochberg multiple testing correction.

Heatmaps, boxplots and principal component analysis (PCA)
Heatmaps were created from log 2 fold change values in R using the heatmap.2 function in the gplots library. Colours were selected from the colorRampPalette in the RColorBrewer library. Boxplots were created in R using the standard boxplot function. PCA was performed for all expressed genes, and separately for all expressed protein-coding genes, using the prcomp function in R. An iterative R script was used to plot the first ten PCs against one another, with the group (N, D or T) that the sample belonged to annotated. Upon determining the biplot that best separated the three biologically relevant groups, the weighting of each gene within those PCs was ascertained by accessing the relevant rotation matrix for the prcomp object in question. The average magnitude of the weights given to each gene by the two most informative PCs was used to rank the genes in this analysis and an R script was used to annotate each gene according to whether it had was DE in any of the pairwise comparisons.

Correlation and network creation
All 29,733 expressed genes were correlated against one another, as previously described (5). Briefly, the correlation matrix was generated by multiplying the input expression matrix, after normalizing to give each gene a mean of 0 and a magnitude of 1, by its transpose. To ascertain the threshold of significance, 1% of the correlations (~4.42 × 10 6 values) were selected at random and the correlation values that demarcated the top and bottom 0.0005% were calculated: −0.74 and 0.95. Any correlations equal to or below or above these thresholds, respectively, were retained. The resulting network was plotted using Cytoscape v3.0.2, with annotation files indicating, per gene, the type of gene and whether, and between which pairwise comparisons, it was DE.

Immune cell quantification
ESTIMATE was run using the expressed gene FPKM values for all samples as input. The output was an immune score and stroma score per sample. Score differences were then calculated for each patient as score Y minus score X for an XvY analysis. Statistical tests were run in R, using Shapiro-Wilk tests to confirm the data were normally distributed before performing paired t-tests.
Immune-cell specific genes were extracted from the supplemental material of Bindea et al (6). The number of significantly (p < = 0.1) upregulated genes of each immune cell type was compared to the total number of significantly upregulated genes, in the NvD and DvT analyses separately. A Fisher's test (p <=0.05) was used to identify immune cell types for which a significant number of genes had been upregulated in either analysis.