LSCC SNP variant regulates SOX2 modulation of VDAC3

Lung squamous cell carcinoma (LSCC) is a genomically complex malignancy with no effective treatments. Recent studies have found a large number of DNA alterations such as SOX2 amplification in LSCC patients. As a stem cell transcription factor, SOX2 is important for the maintenance of pluripotent cells and may play a role in cancer. To study the downstream mechanisms of SOX2, we employed expression quantitative trait loci (eQTLs) technology to investigate how the presence of SOX2 affects the expression of target genes. We discovered unique eQTLs, such as rs798827-VDAC3 (FDR p-value = 0.0034), that are only found in SOX2-active patients but not in SOX2-inactive patients. SNP rs798827 is within strong linkage disequilibrium (r2 = 1) to rs58163073, where rs58163073 [T] allele increases the binding affinity of SOX2 and allele [TA] decreases it. In our analysis, SOX2 silencing downregulates VDAC3 in two LSCC cell lines. Chromatin conformation capturing data indicates that this SNP is located within the same Topologically Associating Domain (TAD) of VDAC3, further suggesting SOX2's role in the regulation of VDAC3 through the binding of rs58163073. By first subgrouping patients based on SOX2 activity, we made more relevant eQTL discoveries and our analysis can be applied to other diseases.


INTRODUCTION
Lung cancer is the leading cause of cancer death in the United States with approximately 158,000 deaths in 2016 [1,2]. Lung squamous cell carcinoma (LSCC) represents a large portion of lung cancers with over 60,000 new cases diagnosed each year [3,4]. The five year survival rate for LSCC is only 12% [1,[4][5][6]. Unlike adenocarcinoma lung cancer (LUAD), the other major subtype of lung cancer, there are no targeted treatments for LSCC [7][8][9]. The higher molecular complexity of LSCC has hampered our understanding and the discovery of druggable targets for LSCC.
Recent studies, including the comprehensive analysis by The Cancer Genome Atlas (TCGA) network, have found a large number of DNA alterations [4]. The most notable alteration is amplification of chromosome 3q, which contains the SOX2 locus [10]. SRY (sex determining region Y)-box 2 (SOX2) is a stem cell transcription factor that plays a role in cell self-renewal and differentiation [11][12][13][14]. It has found to be amplified and/or overexpressed in 63% of LSCC at the transcript level and 80-90% at the protein level [10,15,16]. Other studies have shown that SOX2 overexpression is crucial in promoting LSCC formation in vivo upon loss of Pten and Cdkn2ab [17].
There are different molecular mechanisms at play in SOX2-active LSCC patients compared to SOX2inactive LSCC patients. SOX2 contains a DNA-binding region and regulates the expression of many downstream genes by binding to enhancer regions and facilitating in the remodeling of chromatin and subsequent initiation and transcription of target genes [18][19][20]. Other studies have found that single nucleotide polymorphisms (SNPs), located within the binding region of transcription factors, that can change the binding affinity of the factors and consequently the expression of their target genes [21][22][23][24][25]. Therefore, some SNPs, especially those located in SOX2 Research Paper www.oncotarget.com binding sequences, can alter SOX2's binding affinity to DNA. The increased or reduced binding of SOX2 to DNA can affect the transcription and expression of nearby target genes. Matrix eQTL allows us to identify local and distal eQTLs that are associated with the expression of target genes [26]. We can understand the downstream mechanism of SOX2 by identifying eQTLs in SOX2active patients that are not in SOX2-inactive patients [26][27][28]. Analysis of flanking sequences of the eQTL or the SNPs in strong linkage equilibrium (LD) (r 2 < 0.8) can elucidate which motifs the SNPs may alter. Additional analysis of ChIP-seq and high-throughput chromatin conformation capture (Hi-C) data can further confirm the interaction of SOX2 to specific regions of the genome and thereby its modulation of target genes [29,30]. The general workflow is summarized in Figure 1. Our goal is to understand the complex mechanisms of LSCC in order to better develop targeted treatments that work for a subset of patients with SOX2 activation.

SOX2 in LSCC patients
Unlike LUAD, SOX2 is often seen amplified in LSCC patients [4]. Model-based analysis of regulation of gene expression (MARGE) confirms high regulatory potential of SOX2 in LSCC patients and suggests that it is a master regulator [31] (Supplementary Table 1). The increased presence of this transcription factor may lead to differential regulation of downstream genes. To understand SOX2's roles in LSCC, multiple types of data from TCGA was analyzed. Patients were first divided into two groups: SOX2-active and SOX2-inactive based on their gene expression, copy number variation, methylation of SOX2. For methylation, only the 14 SOX2 CpGs located within the promoter region of SOX2 are considered. These CpGs have high variance and are highly correlated with SOX2 gene expression (Supplementary Figure 1). Patients with activated SOX2 met all three of the following criteria ( 2) SOX2 copy number variation is duplicated compared to normal (segment means > 0. 5) 3) SOX2 promoter region is hypomethylated (methylation β-values ≤0.4) Patients that did not meet all three criteria are considered as SOX2 inactive. A total of 366 LSCC patients had all three data types available and were included in our study. Out of the 366 patients, 219 had high SOX2 expression, 212 had SOX2 copy number amplification, and 292 patients had SOX2 hypomethylation. Although many patients met more than one criterion, only 159 patients which had all three criteria are grouped as SOX2active and the other 196 patients are grouped as SOX2inactive ( Figure 2D).

Matrix eQTL identifies SNP to VDAC3 expression association
Expression quantative trait loci (eQTL) analysis correlates SNP genotypes gene expression variations [26]. There are different sets of eQTLs in SOX2-active patients that are not in SOX2-inactive patients. Using a highly efficient and accurate eQTL analysis tool called Matrix eQTL [26,28], we identified 686,251 cis SNP-gene pairs in the SOX2-active patients and 688,591 pairs in SOX2- on their SOX2 gene expression, copy number variation, and methylation. Matrix eQTL analyses are performed on each group of patients to identify group-specific SNP-gene pairs. Only the eQTL pairs with genes that are downregulated after SOX2 silencing are included in our analysis. Only the SNP-gene pairs that are unique in SOX2 active patients are considered. Right: Further analyses on SOX2 motifs, ChIPseq binding peaks, and topologically associating domains are conducted to validate a SOX2→ SNP→ Gene relationship. inactive patients (p < 0.05) within 1 Mbps of each other. Since we are interested in the mechanism of SOX2, we focused on SNP-gene pairs that are putatively affected by SOX2 expression. Fang WT, et al. significantly knocked down SOX2 expression using siRNAs in two LSCC cell lines with high expression levels of SOX2 (LK2 and NCI-H20) [32]. The gene expression of the SOX2-knocked down cells was profiled using an array. After normalization, we conducted differential expression analysis and found 266 unique genes downregulated in SOX2 siRNA cells compared to control cells as shown in Figure 3A. Of the large number of SNP-gene pairs from our Matrix eQTL analysis, 8,546 SOX2-active and 8,956 SOX2-inactive pairs contained a gene that is downregulated upon silencing SOX2. Looking at the top results with FDR p < 0.01, only 16 and 38 cis SNP-gene pairs remain (Table 1 and Supplementary Table 2). For each pair, the expression of the gene is correlated with the genotype of the SNP. For example, patients with active SOX2, the genotypes for SNP rs798827 is significantly correlated with the expression of the gene VDAC3 (FDR p-value = 0.0034), where allele G is correlated with lower expression of VDAC3, and allele T is correlated with higher expression of VDAC3. Not only that, but VDAC3 is downregulated upon SOX2 silencing in two LSCC cell lines. Further analysis reveals one gene-pair overlap between SOX2 active and inactive cis-Matrix eQTL results. The correlation of this SNP-gene pair may be independent of SOX2 activity.

SOX2 regulates VDAC3 expression
Among the 16 SNP-gene pairs in SOX2 active patients that were not in SOX2 inactive patients, SNP rs798827 to gene VDAC3 pair caught our interest (Table 1). Voltage-dependent anion-selective channel 3 (VDAC3) are small integral membrane channels important for controlling the flux of metabolites between mitochondria and cytoplasm [33,34]. It has previously been shown to play a role in apoptotic and oxidative stress signaling [35][36][37]. It was also found to be downregulated upon SOX2 silencing in two LSCC cell lines ( Figure 3B).
Not only is the expression of VDAC3 significantly higher in SOX2 active patients when compared to SOX2 inactive patients (p = 0.0031), the expression of VDAC3 is also correlated with the genotype of SNP rs798827 in SOX2 active patients, but not in SOX2 inactive patients (ANOVA t-test p < 0.05) ( Figure 4A and 4B). The T genotype of rs798827 correlated with a significantly higher expression of VDAC3 than the G genotype. This association is only seen in SOX2 active patients, but not in SOX2 inactive patients ( Figure 4B). SOX2 ChIP-seq data from Watanabe H, et al. identified 5371 regions with high SOX2 interactions in a LSCC cell line (HCC95) [30]. SOX2 peaks were detected using MACS and normalized for copy number variation. Figure 4C shows the SOX2 ChIP-seq peak at the promoter region of VDAC3. ChIP-seq peaks for H3K27ac, an active enhancer mark, is also shown for the same LSCC cell line [38]. Collectively, our analysis indicates that SOX2 binds to the promoter region of VDAC3 and subsequently, regulates the expression of VDAC3.

SNPs or LD SNPs are located in SOX2 binding motifs
The HaploReg v4.1 web interface by Broad Institute is a well annotated database for SNP and their linkage disequilibrium (LD) SNPs [39][40][41]. When a SNP is in strong LD to another SNP, then those SNPs are highly associated with one another. In other words, the alleles of a few SNPs can suggest the alleles of their LD SNPs. Analyzing the 16 target SNPs from our cis-Matrix eQTL analysis of SOX2 active patients, we identified over 350 SNPs within strong LD (r 2 > 0.80) to our 16 SNPs. SOX2 is a member of the SOX HMG box family of transcription factors [42]. It can bind to specific regions of the genome at consensus binding sequences. Looking at the flanking sequences of those LD SNPs, seven unique SNPs are located in and may alter a SOX family binding motif ( Table 2). The SNP from rs798827-VDAC3 pair is within a strong LD to SNP rs58163073. This LD SNP is actually an insertion variation where an [A] nucleotide is inserted. The flanking sequences of LD SNP rs58163073 make up the SOX2 binding motif and the SNP genotype modified the position weight matrix score. The [T] allele has a stronger binding affinity for SOX2 (11.5)

VDAC3 and rs58163073 are located within the same TAD
The spatial organization of the genome plays a role in the transcriptional regulation of genes [43]. Chromatin conformation capturing methods such as Hi-C can reveal regions of the genome that have high interaction [44][45][46]. These regions are referred to as Topologically Associating Domains (TADs) [43,47]. Hi-C data can be visualized in Hi-C heat maps [47]. Rao, S. S. P., Schmitt, A., and the Dekker Laboratory generated multiple Hi-C data for lung tissues and cell lines [29,46,[48][49][50]. Figure 6 shows Hi-C heat map at chr8:41000000-44000000 for lung cell line IMR90. Figure 7 shows the Hi-C heap maps for lung cancer cell line A549, and two lung tissue samples. Figures were generated using 3D Genome Browser [51]. The location of VDAC3 and SNP rs58163073 are indicated. From analyzing Hi-C maps of lung cell lines (normal and cancer) and lung tissues, we have confirmed that VDAC3 and rs58163073 are located within the same TAD.

Significant association between SNP and clinical features
Minor allele frequencies (MAF) of rs798827 and rs58163073 varied by race, with Black or African Americans more likely to carry the minor alleles than Whites and Asians. The MAF of rs798827 were 0.32 for Black or African American, 0.03 for White, and 0.15 for Asian according to information from the 1000 Genomes Project. These frequencies were found to be similar in LSCC patients, 0.48, 0.07, and 0.22 for each race respectively, again, with a higher frequency of the minor alleles in Black or African American patients ( Figure 8A and 8B). Our SNP is also associated with location of tumor. The malignant location of patients with the major allele were 47% left and 53% right lungs. Patients with the minor allele had tumor more predominantly located on the right lungs (24 % left and 76% right, chi-square test 0.007461) Figure 8C. There were no significant association between SNP genotype, and stage and age at diagnosis (Table 3).

DISCUSSION
LSCC remains a complex disease with many molecular alterations. SOX2 is often seen amplified in LSCC patients and plays an important role in the progression and development of LSCC, however its mechanisms are not very well understood. Utilizing multiple layers of data from genomic to epigenomic, we are able to depict a potential mechanism of SOX2 in LSCC. First, LSCC patients were separated into two groups: SOX2-active and SOX2-inactive based on their SOX2 gene expression, copy number variation, and methylation. By first distinguishing patients based on their SOX2 activity, the mechanisms of SOX2 can be better elucidated. Using Matrix eQTL analysis, we linked variations in gene expression to SNP genotypes. Differences in SNP genotypes, especially within regulatory elements or binding motifs, can affect the binding affinity of transcription factors such as SOX2, and alter downstream transcription of target genes.
In our study, we focused on eQTLs that are found in SOX2-active patients but not in SOX2-inactive patients. We identified a SNP-gene pair (rs798827-VDAC3) that is unique in SOX2-active patients. The genotype of rs798827 is significantly correlated to the expression of VDAC3 in SOX2-active patients but not in SOX2-inactive patients. Since SOX2 is a transcription factor that binds to specific regions of the genome, variations in the binding sequences may alter the binding affinity of SOX2 to that region. Using HaploReg 4.1, we identified multiple SNPs within a strong LD to our eQTL SNPs. We found that rs798827 is within strong LD to SNP rs58163073 and motif analysis indicates that SNP rs58163073 is located within the binding motif of SOX2. The [T] allele of SNP rs58163073 increased the position weight matrix score for SOX2 and the [TA] allele decreased it. This SNP directly alter the binding sequence of SOX2 and the [T] allele increased its binding affinity.
Our cell line and ChIP-seq analysis further suggests that SOX2 regulates VDAC3. When SOX2 was silenced in two LSCC cell lines, the expression of VDAC3 was also downregulated. In addition, SOX2 ChIP-seq peaks show SOX2 binding in the promoter region of VDAC3. SNP rs58163073 and VDAC3 are also located within the same TAD in lung tissues, normal lung, and lung cancer cell lines. VDAC3 is the least investigated isoform of voltage-dependent anion channels. They are localized in the mitochondrial outer members and are pore-forming structures that control the exchange of metabolites between the mitochondria and cytoplasm [33]. They also play crucial roles in oxidative stress, maintaining redox status, and mediating cytochrome c apoptosis [34,52]. Other studies have reported that VDAC3deficient cancer cells have reduced permeability for ADP/ ATP and decreased mitochondrial membrane potential [53][54][55]. The pathways VDAC3 were most involved in were cancer and reproductive system disease, according to Ingenuity Pathway Analysis (IPA) [56]. Deletion of VDAC3 significantly increases cell resistance to antitumor agent Erastin, which targets VDAC2 and VDAC3 [57,58].
Further analysis shows that Black or African American patients have a higher MAF than those of White or Asian patients and patients with MAF had tumors located more predominantly on the right lungs.  The eQTL SNP, LD SNP, and reference and alternative alleles of the LD SNPs are shown. The coefficient of linkage disequilibrium (D`) and the square of correlation coefficient (r 2 ) are shown along with the frequency of the alternative allele in African (AFR), American (AMR), Asian (ASN), and European (EUR) populations. Motifs whose position weight matrix scores are affected by the LD SNPs are listed.   Collectively, these analyses support the notion that the SNP variant rs58163073 affects the binding of SOX2, which in turn affects SOX2's modulation of target genes such as VDAC3 (Figure 9). This discovery is found only when SOX2 activity is first considered. Our analysis can be used to understand the downstream mechanisms of transcription factors in other diseases and cancers.

LSCC patient data
Patient gene expression, copy number variation, methylation, and SNP data were collected from The Cancer Genome Atlas (TCGA) database. Specifically, Illumina HiSeq RNAseqv2 was used for gene expression data, Genome Wide SNP 6.0 was used for both copy number variation and SNP genotype data, and Human Methylation 450 bioassay data set was used for methylation data. A total number of 366 LSCC patients had all three datatypes available.

Matrix eQTL computation
Expression quantitative trait loci (eQTL) analysis was conducted with Matrix eQTL: Ultra-fast eQTL analysis via large matrix operations by Andrey Shabalin [26]. Software and resources were obtained from http:// www.bios.unc.edu/research/genomic_software/Matrix_ eQTL/. Output threshold p-values were set at 0.05. Local eQTLs (or cis eQTLs) were set at the default distance of 1 million base pairs of each other.
SOX2 siRNA lung cancer cell line SOX2 silencing experimental data was obtained from NCBI GSE48871 [32]. Two LSCC cell lines (H520 and LK2) were treated with either pooled siRNA sequences of SOX2 or scrambled control siRNAs) and their gene expression was profiled using Illumina's BeadChip Human HT-12v3 array. The gene expression data distribution was normalized to 0.

Linkage disequilibrium SNPs and binding motifs
LD SNPs were explored using HaploRegv4.1 hosted by Broad Institute [39]. The database table is curated from information from the 1000 Genomes Project. Pairwise LD was calculated for all pairs of SNPs within 250 kb. A LD threshold of r 2 > 0.8 was used to query variants. The 16 eQTLs were within strong LD to 263 SNPs. HaploRegv4.1 also contained information on regulatory motif changes. Only the SNP variants that overlapped the binding motif of SOX2 and changed the position weight matrix (PWM) score were considered. Under these conditions, only seven unique LD SNPs are identified.

SOX2 and H3K27ac ChIP-seq in HCC95 cell line
SOX2 ChIP-seq data was obtained from NCBI-GSE46837 [30]. A LSCC cell line with SOX2 amplification was used in this study. SOX2 binding sites were detected using Model-based Analysis of ChIP-Seq (MACS) and normalized for copy number variation. There were 5371 peaks with high SOX2 interaction. H3K27ac ChIP-seq data was obtained from NCBI-GSE66992 [38]. LSCC cell line, HCC95 was also used and H3K27ac binding sites were called by MACs.

Lung tissue and cell line Hi-C data
Hi-C data was obtained and visualized on 3D Genome Browser [59]. IMR90 Hi-C data was from Rao S, et al. (2014) [48], A549 cell line Hi-C data was from ENCODE Encyclopedia version 3 (2010) [29,60], and two normal lung tissue Hi-C data were from Shmitt A, et al. (2016) [49]. Hi-C heat maps resolutions are at 10 kb, 40 kb, 40 kb, and 40 kb respectively, and between 41000000 and 44000000 position on chromosome 8, which is where VDAC3 gene and associated SNPs are located.

ACKNOWLEDGMENTS
We would like to acknowledge the members of Center for Bioinformatics and Systems Biology at Wake Forest School of Medicine for valuable discussions and advices. The authors acknowledge the Texas Advanced Computing Center (TACC) at the University of Texas at Austin (http://www.tacc.utexas.edu) and the DEMON high performance computing (HPC) cluster at Wake Forest University School of Medicine for providing HPC resources that have contributed to the research results reported within this paper.