Convergent evidence from systematic analysis of GWAS revealed genetic basis of esophageal cancer

Recent genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) associated with risk of esophageal cancer (EC). However, investigation of genetic basis from the perspective of systematic biology and integrative genomics remains scarce. In this study, we explored genetic basis of EC based on GWAS data and implemented a series of bioinformatics methods including functional annotation, expression quantitative trait loci (eQTL) analysis, pathway enrichment analysis and pathway grouped network analysis. Two hundred and thirteen risk SNPs were identified, in which 44 SNPs were found to have significantly differential gene expression in esophageal tissues by eQTL analysis. By pathway enrichment analysis, 170 risk genes mapped by risk SNPs were enriched into 38 significant GO terms and 17 significant KEGG pathways, which were significantly grouped into 9 sub-networks by pathway grouped network analysis. The 9 groups of interconnected pathways were mainly involved with muscle cell proliferation, cellular response to interleukin-6, cell adhesion molecules, and ethanol oxidation, which might participate in the development of EC. Our findings provide genetic evidence and new insight for exploring the molecular mechanisms of EC.


INTRODUCTION
Esophageal cancer is the 6 th leading cause of death from cancer and the 8 th most common cancer in the world [1]. Epidemiological researches have demonstrated that both environmental factors (eg. alcohol consumption) and genetic factors (genetic variants) contribute to the risk of EC development [2]. Meanwhile, genome-wide association study (GWAS) offers the opportunity to investigate genetic factors involved in this complex disorder and several single nucleotide polymorphisms (SNPs) have been identified to be significantly associated with risk of EC [3]. However, results Research Paper www.impactjournals.com/oncotarget from current GWAS of EC mainly focus on individual SNPs with highly statistical significance (P-value < 5.0E-08), investigation of genetic basis from the perspective of systematic biology and integrative genomics remains scarce.
Due to the polygenic risk of complex disorders, the effect size attributable to individual genetic variants was typically modest, suggesting that individual genetic variants identified by GWAS may only accounted for a very small amount of the genetic risk and heritability of complex disorders [4]. The combined effect of multi genetic variants or genes with modest effect also plays important roles in genetic basis of complex disorders such as esophageal cancer [5]. GWAS provides us an important data source for the investigation of multi-variant/ gene effect. Moreover, combining GWAS data with bioinformatics methods such as expression quantitative trait loci (eQTL) analysis, pathway based analysis, and network analysis, the integrative genomics approach could provide systematic evidence to genetic basis of disease [6].
In this study, we explored genetic basis of EC by comprehensive data mining and systematical data analysis based on GWAS data and a series of bioinformatics methods, which may provide better understanding for the molecular mechanisms that contribute to the development of EC.

Identification of SNPs associated with risk of esophageal cancer
By comprehensive data search and collection, we obtained a total of 7 published GWAS of esophageal cancer [21]- [27], in which the sample size ranged from four thousands to twenty thousands and the ethnic groups of samples were mainly Asian descent except one study with European descent, detecting 500 thousands to one million of SNPs from the whole genome in each GWAS. A total of 211 SNPs reported with P-value <5.0E-05 were obtained and considered as risk SNPs of esophageal cancer. Summary of GWAS including disease, ethnic groups, sample size, genotyping platform, and number of detected SNPs was shown in Table 1 and details of reported SNPs, their P-values and odds ratios were shown in Supplementary Table S1. As shown in Supplementary Table S5, results of power analysis demonstrated for two-stage designed GWAS, in the initial analysis, studies with initial sample size larger than 3000 had more than 90% power of detecting risk SNPs. While initial sample size of study [27] was 1109, the power of detecting risk SNPs with allele frequency of 0.1 and 0.9 was less than 70%. When combining initial sample and replicated sample, all studies achieved more than 90% power at any allele frequency level.

Functional annotation and expression quantitative trait loci (eQTL) analysis
As shown in Supplementary Table S2, for 211 risk SNPs, related chromosome, genome position, allele change, and mapped gene/region were annotated. These 211 risk SNPs were mapped into 170 genes, which were considered as genes associated with risk of esophageal cancer. By eQTL analysis, among 211 risk SNPs, we observed 44 SNPs with significant gene expression changes in several esophageal tissues, including esophagus muscularis (sample size: 241), esophagus mucosa (sample size: 218), esophagus gastroesophageal junction (sample size: 235), with permutation adjusted P-values < 0.05. Detailed results including SNPs, esophageal tissues, gene with altered expressions and P-values were displayed in Supplementary Table S3.

Pathway enrichment analysis
By pathway enrichment analysis, with the threshold of Benjamini-adjusted P-value < 0.05, we obtained 38 significant GO terms and 17 significant KEGG pathways, which were considered as significant pathways of esophageal cancer. Meanwhile, fold enrichment of risk genes in each significant pathway were all larger than 1.5, demonstrating risk genes were significantly enriched in these pathways. The details of significant pathways including pathway ID, P-values, involved genes, and fold enrichment were shown in Table 2.

Pathway grouped network analysis
As shown in Supplementary Table S4, a pathway grouped network was constructed with significant interacted pathways involved and 55 pathways of EC were significantly grouped into 9 sub-networks with Group P-value < 0.05. As shown in Figure 1, Group 1 included smooth muscle cell proliferation related pathways; Group 2 included phosphatidylcholine biosynthetic process related pathways; Group 3 were cellular response to interleukin-6 involved pathways; Group 4 were muscle cell proliferation related pathways; Group 5 and Group 6 was related with cell adhesion molecule binding and Cell adhesion molecules (CAMs) respectively; Group 7 was amyloid precursor protein catabolic process; Group 8 was related with ethanol oxidation and Group 9 was negative regulation of cAMP biosynthetic process.
Besides, 3 significant pathways including protein O-linked glycosylation, positive regulation of synapse assembly, glycerolipid metabolism were independent and not grouped into any cluster.

DISCUSSION
In this study, we employed an integrative genomics approach to investigate genetic risk factors and biological functions of EC. By systematic data analysis, evidence from large-scale GWAS, eQTL, pathway and network were obtained. As shown in Supplementary Figure S1, nine risk SNPs on alcohol dehydrogenase genes (eg. ADH4, ADH1C) were identified to have significantly differential gene expression levels under different genotypes on esophageal tissues including esophagus muscularis and esophagus mucosa, as alcohol drinking has been considered as an important risk factor of EC [2], and previous animal studies also demonstrated impairment of aldehyde dehydrogenase could increase accumulation of acetaldehyde-derived DNA damage in the esophagus after ethanol ingestion [7]. Our eQTL results indicated compared with non-risk alleles/genotypes, risk alleles/genotypes of these GWAS identified SNPs had differential gene expression levels, thus altered expression of risk genes might contribute to the molecular mechanisms of EC and were worthy of further investigation.
By functional annotation with genome information, 211 risk SNPs were mapped into 170 genes, which were enriched into 38 significant GO terms and 17 significant KEGG pathways by pathway enrichment analysis. Then these EC related pathways were significantly grouped into 9 sub-networks according to shared risk genes among pathways. Two pathway groups related to muscle cell proliferation were identified, with genes such as FGFR2 and FOXP1 involved. In accordance with our results, have shown FGFR2 are able to promote tumor development and progression in esophageal carcinoma [8] and FOXP1, as a member of Forkhead-box (FOX) family genes, was reported to be associated with poor prognosis of multicancer [9]. In addition, the alcohol related pathway group including alcohol dehydrogenase (NAD) activity, aldehyde dehydrogenase (NAD(P)+) activity, ethanol oxidation, alcohol catabolic process was identified, which provided genetic evidence and biological explanation for the risk of alcohol drinking on development of EC [2]. Meanwhile, some interleukin-6 (IL-6) mediated immunity pathways were also grouped, such as cellular response to IL-6, T-helper cell differentiation and positive T cell selection, which also demonstrated an important involvement of IL-6 on the development of EC [10]. Moreover, the identification of cell adhesion molecules (CAMs) related pathway groups were supported by previous studies reporting altered expression of CAMs during prognosis and tumor behavior in EC [11].
Results from pathway grouped network analysis demonstrated some pathways were shared among different groups, such as immune related pathways including T-helper cell differentiation, alpha-beta T cell activation and positive regulation of interferongamma production; as well as muscle development related pathways such as regulation of muscle organ development and smooth muscle cell proliferation, indicating EC related genes and pathways did not function independently, but functioned in the form of interacting with each other. Therefore, results from our study revealed the multi-gene effect on genetic basis of EC, supporting the view indicating that combined effect of multi genetic variants or genes with modest effect were also involved in genetic basis of complex disorders such as EC [5].
In conclusion, in this study, we explored genetic basis of EC by comprehensive data mining and systematical data analysis based on GWAS data, evidence from SNP, gene, gene expressions, pathway and network were identified, which might provide new insight for exploring the molecular mechanisms of EC.

Identification of SNPs associated with risk of esophageal cancer
In order to identify SNPs associated with risk of esophageal cancer, GWAS of esophageal cancer were collected from GWAS catalog (https://www.genome. gov/gwastudies/), which collected all currently published GWAS of various traits. Besides, we also searched public database of Pubmed to collect recently published GWAS of esophageal cancer. Information of GWAS studies including sample size, genotyping platform, ethnic groups, reported SNPs and their P-values were collected. Due to the polygenic risk of complex disorders, individual genetic variants may only accounted for a very small amount of the genetic risk and heritability of complex disorders such as esophageal cancer [4], in order to more comprehensively capture SNPs with small effect size, we used genetic association P-value of 5.0E−05 as a criterion for identifying SNPs that are associated with risk of esophageal cancer. To detect the power of each GWAS in identifying risk SNPs, we performed power analysis by QUNTO (http://biostats.usc.edu/Quanto.html) [12]. To comprehensively investigate the power of GWAS, three levels of risk allele frequency was assumed, which were 0.1, 0.5 and 0.9 respectively. The odds ratio was assumed as 1.20, demonstrating a "weak to moderate" gene effect, and two-tailed α was set as 0.05.

Functional annotation and expression quantitative trait loci (eQTL) analysis
To identify genes of SNPs and candidate regulatory SNPs at disease-associated loci, we annotated genome information to SNPs including related chromosome, genome position, allele changes, mapped genes by using data from 1000 Genomes Project [13] and ENCODE (Encyclopedia of DNA Elements) projects [14]. Genes mapped by risk SNPs were considered as risk genes of esophageal cancer.
Meanwhile, some GWAS identified SNPs had regulatory functions by causing differential gene expressions with different genotypes and understanding the functional consequence of genetic variants was essential for biological interpretation on genetic etiology of disease [15]. Expression quantitative trait locus (eQTL) analysis was the most common approach used to dissect the effects of genetic variation on gene expression. As esophageal cancer occurred in esophageal tissues, the expression effect of risk SNPs in these tissues was worthy of being investigated. To detect the potential impact of risk SNPs on gene expression in esophageal tissues, we performed eQTL analysis by investigating the tissue specific expression distributions of SNPs in diverse human tissues using the Genotype Tissue Expression portal (GTEx) [16], a database that contained RNA sequencing data from 1641 samples across 43 tissues from 175 individuals. For each tissue, significance correlations between genotypes and gene expression levels were determined by linear regression on quantile normalized gene-level expression values, with permutation-adjusted P-value < 0.05 as significance. As referred in [16], The eQTL was calculated for SNPs within ±1 Mb of the transcriptional start site (TSS) of each gene. If more than one target gene was identified for one SNP by eQTL analysis, gene with the most significant P-value was chosen.

Pathway enrichment analysis
To investigate whether risk genes of esophageal cancer identified from GWAS were enriched in functional pathways, we performed pathway enrichment analysis. Information from Kyoto encyclopedia of genes and genomes (KEGG) database [17] and Gene ontology (GO) terms [18] was used to annotate related pathways. The pathway enrichment test was based on hypergeometric test, the P-value was corrected by Benjamini-Hochberg methods and the significance was set as 0.05. To measure the magnitude of risk gene enrichment, we calculated the fold enrichment of involved risk genes in each pathway. The fold enrichment was obtained by calculating proportion of involved risk genes versus proportion of involved genes in human genome with a total of 29960 genes in each pathway according to the method applied in [19], with a suggested threshold of fold enrichment as 1.5 and above.

Pathway grouped network analysis
To investigate whether identified pathways were biologically interconnected, we constructed a pathway grouped network of risk genes of esophageal cancer by using a Cytoscape plug-in called "ClueGO" [20]. The relationship between pathways was defined based on their shared genes and calculated by chance corrected kappa statistics. Then the created network represented the pathways as nodes which were linked based on a predefined kappa score level. In our pathway grouped network analysis, we set the kappa score level as "0.4" as ClueGo referenced. The group P-value was determined by hypergeometric test, the P-value was corrected by Benjamini-Hochberg methods and the significance was set as 0.05. The final network was visualized by Cytoscape software (Version 3.1.1).