Transcriptome analysis reveals differentially expressed lncRNAs between oral squamous cell carcinoma and healthy oral mucosa

Oral cavity and oropharyngeal squamous cell carcinoma (OSCC) is a major cancer type in the head and neck region. To better understand the roles long non-coding RNA (lncRNA) play in OSCC carcinogenesis, we compared the expression levels of 3,054 probe sets for lncRNAs between 167 OSCCs and 45 healthy oral mucosa using an Affymetrix HG U133 plus 2.0 array dataset. We found 658 lncRNA transcripts (790 probe sets) to be significantly differentially expressed using a criteria of FDR < 0.01, with 36 of them (39 probe sets) showing more than a 2-fold change. We further validated the top differentially expressed lncRNAs in three independent datasets from Gene Expression Omnibus (GEO) repository: GSE42743, GSE9844, and GSE6791. Fourteen lncRNAs (15 probe sets) were validated in all three datasets using the criteria FDR < 0.01: LOC441178, C5orf66-AS1, HCG22, FLG-AS1, CCL14/CCL15-CCL14, LOC100506990, TRIP10, PCBP1-AS1, LINC01315, LINC00478, COX10-AS1/LOC100506974, MLLT4-AS1, MIR31HG, and DUXAP10/LINC01296. Three lncRNAs in the validated list which showed the highest fold change (LOC441178, HCG22 and C5orf66-AS1) were verified by quantitative RT-PCR in a subset of 20 OSCCs and 10 control samples. In silico prediction of their functional role has given us directions for further investigation.


INTRODUCTION
Oral cavity and oropharyngeal squamous cell carcinoma (OSCC) is the eighth most common cancer among men and fourteenth among women in the U.S. according to recent data [1]. The estimated incidence is 11.1 [SEER, http://seer.cancer.gov/] per 100,000 in the U.S. About 600,000 new cases arise annually worldwide [1][2][3]. Tobacco and alcohol consumption, as well as infection with high risk human papillomavirus (HPV), have been shown to be the main risk factors of OSCC in the U.S. [2,[4][5][6]. However, the precise mechanisms of OSCC carcinogenesis are not well understood; and little improvement has been made to the overall prognosis for advanced-stage OSCC in the past two to three decades, leaving the patients and their families with heavy disease burden [7]. There continues to be an urgent need to achieve a better understanding of the mechanisms of oral carcinogenesis in order to aid the discovery of effective therapeutic targets.
It is estimated that >70% of the human genome can be transcribed. However, only 2% are protein-coding and the majority of transcripts are not [8]. These majority transcripts are categorized as non-coding RNA (ncRNA). Except for those "housekeeping" ncRNAs such as tRNAs, rRNAs, there are also other mRNA-like transcripts which can be subdivided by length: small ncRNA (< 200 nt) and long ncRNA (> 200 nt) [8]. Small ncRNAs, like microRNAs, have been studied extensively and there is evidence to suggest that they may play an important role in cancer, including OSCC [9]. However, the study of lncRNAs in cancer has only begun in the last decade [10].

Research Paper
Oncotarget 31522 www.impactjournals.com/oncotarget The functions of only a handful of lncRNAs have been studied with results showing that they play a role as transcriptional and post-transcriptional regulators. Their reported functions include: 1) Remodeling chromatin state (e.g. HOTAIR or Xist influences chromatin remodeling by recruiting PRC2 [11][12][13]); 2) Providing stability to proteins or protein complexes (e.g. MALAT1 and NEAT1 serve as molecular scaffolds for proteins within nuclear speckles and paraspeckles [10,14]); 3) Competing with endogenous RNAs to modulate their functions ( e.g. UCA1 and MEG3 regulate oncogenes by "sponging" miRNAs and decreasing their function [15,16]). Gibb et al. was the first to report the expression of lncRNA in oral mucosa, implying lncRNAs contribute to the oral transcriptome [17]. Subsequent studies have evaluated whether lncRNAs are involved in tumor development by comparing their expressions between cancers and controls. Several such studies reported aberrantly expressed lncRNAs in oral cancers. HOTAIR (HOX transcript antisense RNA) was reported to be upregulated in OSCC [18,19] and laryngeal squamous cell carcinoma [20,21]. An increase in its expression was associated with metastasis and poor prognosis of OSCC [18]. In 2015, Sharma S. et al. reported HPV16 oncoprotein E7 could be involved in cervical cancer carcinogenesis through regulating HOTAIR expression and function [22]. Whether this molecular event also occurs in HPVrelated oral cancer remains unknown. MALAT1 (metastasis associated lung adenocarcinoma transcript 1) is another reported lncRNA. This large, infrequently spliced noncoding RNA was aberrantly expressed in lung cancer and cervical cancer, and its upregulation is associated with growth and metastasis of several types of cancer, such as colorectal, pancreatic and gastric cancers [23,24] and in tongue squamous cell carcinoma and OSCC [23,25]. Other lncRNAs that have been reported in oral cancer include MEG3, which was down-regulated in tongue cancer [26], and UCA1, which was up-regulated in tongue cancer [27]. Some studies of head and neck cancer [20,[28][29][30] suggested that lncRNA, including AFAP1-AS1, AB209630, GAS5, and HOTAIR, could potentially serve as predictors of patient outcome or treatment response.
To better understand the role of lncRNAs in OSCC carcinogenesis, and to gain insight for the identification of potentially clinically relevant targets, we used a whole genome approach to examine expression differences of lncRNAs between OSCC and normal oral tissue. Previous studies by Risueño et al. [31], Liao et al. [32] and Michelhaugh et al. [33] have shown that existing microarray data can be mined to study lncRNA transcription. In this study, we developed our own approach to identify and validate a list of significantly dysregulated lncRNAs between OSCC and oral mucosa from healthy individuals using our previously generated microarray data.

Differentially expressed lncRNA genes between cases and controls
To identify differentially expressed lncRNAs between OSCC and control, we used a dataset comprised of 167 OSCCs and 45 oral mucosa samples from healthy controls previously generated using Affymetrix Human Genome U133 Plus 2.0 array (Affymetrix, Santa Clara, CA, USA). Out of 3,054 candidate probe sets that represent 2,172 lncRNA transcripts (see Supplementary  Table 1), we found 790 probe sets (representing 658 lncRNAs transcripts) to be significantly differentially expressed with the criteria FDR < 0.01(see Supplementary  Table 2). Of them, 568 were down-regulated and 222 were up-regulated in cancer. Amongst all, 39 probe sets (36 lncRNA transcripts) have a fold change difference > 2, with the majority (31 out of 39) of them being downregulated. The two most differentially expressed probe sets identified the same lncRNA LOC441178 with both showing a greater than 14-fold down-regulation in OSCC compared to control oral mucosa. To validate our findings in external datasets, we analyzed the expression of the top 39 probe sets that were identified in our study in three external datasets downloaded from GEO, adjusting for age and sex. Study participants' characteristics for our study and for these three validation sets are summarized in Table 1. With a criteria of FDR < 0.01, most of the 39 probes were differentially expressed between cases and controls in at least one of the three GEO datasets. The 15 probe sets (representing 14 lncRNA transcripts) that were differentially expressed in all three validation datasets were as follows: LOC441178, C5orf66-AS1, HCG22, FLG-AS1, CCL14/CCL15-CCL14, LOC100506990, TRIP10, PCBP1-AS1, LINC01315, LINC00478, COX10-AS1/ LOC100506974, MLLT4-AS1, MIR31HG, and DUXAP10/ LINC01296. The results on their differential expressions are shown in Table 2.

Verification of LOC441178, HCG22 and C5orf66-AS1 by quantitative RT-PCR
To confirm the differential expression signals found in the array data, three lncRNAs (LOC441178, HCG22 and C5orf66-AS1) in the validated list showing the highest fold changes were chosen to undergo qRT-PCR testing in a subset of 20 OSCCs and 10 normal controls samples (see Supplementary Table 3). The samples were randomly chosen from the original FHCRC study samples which consist 167 OSCCs and 45 oral mucosa from healthy people. LOC441178 has two identified transcripts, one shorter transcript that codes for a 93 aa protein, and the other longer transcript is identified as a lncRNA. To confirm that the lncRNA transcript is truly www.impactjournals.com/oncotarget differentially expressed, we designed a PCR primer pair that targets only the lncRNA transcript. Figure 1 shows the differential expressions of the three lncRNAs in OSCC and controls (P < 0.05). The correlation of Affymetrix array data and PCR results were calculated. Pearson's correlation coefficient value of LOC441178, HCG22 and C5orf66-AS1 were 0.88, 0.83 and 0.72, respectively (see Supplementary Figure 1).

Prediction of putative targets of lncRNA LOC441178 and functional clustering
Terai G et al. [38] have reported a comprehensive interaction database of lncRNA and mRNA based on sequence complementarity. We used that database to predict the targets of the No. 1 differentially expressed lncRNA LOC441178. The top 100 putative targets are shown in Supplementary Table 4. Functional annotation of these putative 100 target genes using DAVID bioinformatics classification tool [39] showed that they are enriched in 8 clusters with suggested functional involvement in cellular motility, ion channel signaling and ATP metabolism (see Table 3).

DISCUSSION
In this study, we identified 790 significantly differentially expressed probe sets, representing 658 lncRNAs, between OSCCs and normal oral mucosa, with 39 of them, representing 36 lncRNAs, showing more than a 2-fold change in their expression levels. We also used three independent datasets to validate our findings, and 14 lncRNAs were found to be significantly differentially expressed in all data sets. We suspect these differentially expressed lncRNAs in OSCC may be involved in oral cancer carcinogenesis, and they also could serve as new choices for the investigation of potential OSCC biomarkers and/or therapeutic targets. Some of our findings are novel in that of the 14 validated lncRNAs, more than a half (8 out of 14) have never been reported to be associated with cancer before: LOC441178, COX10-AS1, PCBP1-AS1, FLG-AS1, MLLT4-AS1, LINC01315, LOC100506990, and CCL15-CCL14. LOC441178 was the top differentially expressed lncRNA in all four datasets, and was confirmed by qRT-PCR in a subset of samples. Prediction of targets of lncRNA LOC441178 in silico using a lncRNA-RNA interactions database [38] showed that LOC441178 will  potentially interact with both coding and non-coding RNAs, some of which are related to cancer, such as coding mRNA MUC16/CA125 [40], and non-coding RNA KCNQ1OT1 [41]. Functional clustering of these predicted targets give us an initial glance at the biological functions associated with this top lncRNA. With more than 100 fold change difference between cancer and control in qRT-PCR testing, it may have strong potential as a diagnostic biomarker for OSCC. HCG22 was found to be downregulated in oral cancer and its lower expression was reported to be associated with poor survival in a recent lncRNA study using TCGA data [42]. While our qRT-PCR results confirm the down-regulation of HCG22 in OSCC, we did not find an association between HCG22 expression levels and survival in our study population (data not shown). C5orf66-AS1, also known as Epist, was previously found to be downregulated and may act as a tumor suppressor in esophageal squamous cell carcinoma [43]. It has the same expression pattern in OSCC (see Figure 1), suggesting that C5orf66-AS1 may be involved in multiple cancer types that occur in the upper aerodigestive tract. Within 5kb distance of C5orf66-AS1 is pituitary homeobox 1 (PITX1). PITX1 expression level was found recently to be a novel predictor for treatment response of head and neck cancer [44]. Further study is needed to see if C5orf66-AS1 could be a regulator of this neighboring PITX1 gene. MIR31 host gene (MIR31HG), also named lncRNA LOC554202, was found to be down-regulated in gastric cancer[45] and bladder cancer [46], but up-regulated in breast cancer [47]. Further, there is evidence showing that it was upregulated in oncogene-induced senescence process; and knockdown of MIR31HG induced a tumor suppressor p16INK4A dependent senescence phenotype, which might link it to HPV-related cancer carcinogenesis [48]. In our study, MIR31HG expression does not appear to be significantly different between HPV-positive OPC (oropharyngeal cancer) and HPV-negative OPC (P = 0.02, lower in HPV-positive OPC). However, we found it was significantly differentially expressed between subsites of OSCC (OPC vs. OCC (oral cavity cancer), P = 9.29E-07, lower in OPC), and between smokers and non-smokers (P = 5.05E-05, lower in smokers). LINC01296 was reported to be dysregulated in colorectal cancer and was suspected to be a potential prognostic biomarker [49]. The expression level of the probe set for LINC01296 in our study was also significantly different between OSCC and controls, suggesting that this lncRNA may be involved in carcinogenesis process.
Some previously reported cancer-associated lncRNAs, such as HOTAIR, UCA1, and MALAT1, were found to be differentially expressed in our study, but none was in our top, validated list. Interestingly, some lncRNAs showed opposite expression direction in our study than in previous studies focusing on individual lncRNAs in head and neck cancers. Supplementary Table 5 shows their deregulation status in other studies and in our data. The discrepancies perhaps could be related to differences in sample size, platform used, origin of the control tissue used, or tumor site.
There are a large number of lncRNAs with unknown function that have yet to be studied. While genome wide study of lncRNA is a good approach to unveil more lncRNAs that may be involved in carcinogenesis or may represent potential therapeutic targets, making direct comparisons of results from various studies has been difficult. A major challenge is the use of different assay platforms by different studies. The lack of standardization in gene nomenclature is also an issue. Several studies of   [42,50,51]. Two other studies have looked at lncRNA expression using re-annotated microarray data in OSCC. Even when the same microarray platform or a different version of the same microarray platform was used, difficulty in direct comparison remains. Gao et al. [52] found eight differentially expressed lncRNAs by comparing 26 tongue squamous cell carcinomas versus 12 controls, using the same Affymetrix U133 Plus 2.0 Gene Chip array data like we did. None of the eight appeared in our differential expression list. Differences in tumor site, sample size, and probe sets reannotation processing may be the reasons between their results and ours. Zhang et al. [53] compared 57 OSCCs and 22 normal controls using an Affymetrix Human Exon 1.0 Array. In their results, 160 lncRNAs were found to be dysregulated in cancer, of which seven were found in our 658 lncRNA list (LOC645949, LINC00896, FAM99B, LINC00167,  LINC01056, SND1-IT1, NCBP2-AS2). The reasons of so few overlaps are unknown and could be related to different version of the Affymetrix platform used and the difference in data processing. Zhang et al. excluded probe sets that map to multiple transcripts in their analysis; while we did not. Excluding probe sets that also identify coding transcripts may mean a loss of potential lncRNA of interest. In our study, we chose to keep those probe sets in the initial analysis, because in some cases, it is possible to determine lncRNA transcript expression status more definitively through other methodologies such as quantitative RT-PCR. This approach allowed us to identify our most differentially expressed lncRNA LOC441178.
The recent development of using CRISPRi method for high throughput functional screening of lncRNAs has increased the number of lncRNAs with known functions [54]. However, this is beyond the scope of the current report. In silico prediction of the potential targets and the pathway analysis of these targets gave us some suggestion to the possible biological functions of our top differentially expressed lncRNA. However, further research is needed to explore more on these functions and the potential usage as biomarkers and/or therapeutic targets for OSCC.

Gene expression array data and identification of lncRNAs
The gene expression microarray data used in this study was generated previously by our group as part of an OralChip study at the Fred Hutchinson Cancer Research Center (FHCRC). The study was designed to identify gene expression profiles that are related to clinical outcomes of OSCC [34,35,55,56]. The microarray data were generated using the Affymetrix Human Genome U133 Plus 2.0 array (Affymetrix, Santa Clara, CA, USA) and were deposited in the Gene Expression Omnibus (GEO) Repository under the general accession number GSE30784. The dataset has 212 samples in total, 167 of them are OSCCs and 45 of them are oral mucosa from healthy controls. The samples and patients' clinical data including age, gender, tumor site, stage, and HPV status were obtained from the original OralChip study. This investigation has been approved by the institutional review office of Fred Hutchinson Cancer Research Center. All microarray data have passed two rounds of quality control checks [34,56] . CEL files were preprocessed and normalized using RMA algorithm in Partek Genomics Suite TM software version 6.6. To determine the candidate probe sets on the microarray that represent lncRNAs, first, we matched lncRNAs identified by HGNC (HUGO Gene Nomenclature Committee) Database [57] (downloaded from http://www.genenames. org/ at 1/27/2016) with Affymetrix probe ID using Entrez gene ID or Ensemble gene ID. 2,183 probe sets were matched to transcripts identified as "RNA, long noncoding" by HGNC; second, we identified 2,488 probe sets mapped to lncRNA from the Affymetrix annotation file (downloaded from http://www.affymetrix.com/ at 1/25/2016, netaffx-build=35) for the U133 2.0 Plus array. After removal of 1,617 duplicates, we obtained a final list of 3,054 probe sets that represent 2,172 lncRNA transcripts.

Statistical analysis
The statistical analysis of differentially expressed lncRNA (between cases and controls) was performed using linear regression, adjusting for age and sex. The false discovery rate (FDR) of 0.01 and a 2-fold difference in expression between the two groups were used as criteria to select the differentially expressed lncRNA. All analyses were performed using R version 3.3.0.

Validation using independent datasets
In order to validate our findings, we searched Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih. gov/geo/) for datasets with OSCC and normal control samples that also used the Affymetrix U133 Plus 2.0 Gene Chip array. Three datasets were identified: GSE42743 with 74 oral cavity squamous cell carcinoma and 29 adjacent normal tissue [35]; GSE9844 with 26 tongue squamous cell carcinoma and 12 matched adjacent normal tissue [36]; and GSE6791 comprised of 28 cervical cancers, 42 head and neck cancers and 14 site-matched normal oral tissue [37]. We used these three data sets to validate our findings after excluding samples of irrelevant anatomic site and samples containing missing data. The false discovery rate (FDR) < 0.01 was used as criteria in each dataset during validation comparison.

Quantitative reverse transcription polymerase chain reaction (qRT-PCR)
Among the validated list of lncRNAs, we chose three (LOC441178, C5orf66-AS1 and HCG22) with the highest fold change to do further validation by performing qRT-PCR using a subset of 20 oral cancer samples and 10 control samples randomly chosen from samples in the OralChip study. Each sample was assayed in triplicate in 10 uL reaction volumes using the QuantiTect SYBR Green RT-PCR kit (Qiagen, Valencia, CA, USA) on a 7900HT Sequence Detection System (Life Technologies, Carlsbad, CA, USA). The cycling conditions were as follows: 30 min incubation at 50°C, 15 min incubation at 95°C, and 40 cycles of 15 s at 94°C, 30 s at 55°C, and 30 s at 72°C. Ten-point standard curves were generated using Total RNA -Human Normal Tissue Tongue (Biochain Inst., Newark, CA, USA) for all three genes. The linear correlation coefficient (R2) was ≥ 0.99 for all runs. The mean threshold cycle (Ct) values were calculated from the triplicate Ct values. Samples that had Ct values with SD > 0.35 in their triplicate run were repeated. Mean Ct values were standardized to the mean Ct value of beta actin housekeeping gene (Quantitect primers (Qiagen, Valencia, CA, USA). Correlations between qRT-PCR and microarray data were performed using R Stats Package. Primer pairs used for each gene were as follows: LOC441178: Forward primer GGGTATTTTGTGC TCCCCCA; Reverse primer CAGGCACTGAAGGTTCG GAT; HCG22: Forward primer ACAGCAGTGAAACC CACCA; Reverse primer GAAGCCCAATCCAACA AAGAGC; C5orf66-AS1: Forward primer GCTTCGCGTC AAGAGGGTAT; Reverse primer AAGCCGCGGGAA TGTCTTTA. P value was calculated using student's t test between groups.

Functional predictions
Prediction of the putative targets of lncRNA LOC441178 was done using a comprehensive interaction database of lncRNA-mRNA which was reported by Terai G et al. [38] in 2016. GO ontology and KEGG pathway analysis and functional clustering of the top 100 targets was done by a gene functional classification tool, DAVID Bioinformatics Resources 6.8, NIAID/NIH [39]. The enrichment thresholds were by the web tool default and the enrichment score was calculated automatically by the tool.