Tobacco smoking and methylation of genes related to lung cancer development

Lung cancer is a leading cause of cancer-related mortality worldwide, and cigarette smoking is the major environmental hazard for its development. This study intended to examine whether smoking could alter methylation of genes at lung cancer risk loci identified by genome-wide association studies (GWASs). By systematic literature review, we selected 75 genomic candidate regions based on 120 single-nucleotide polymorphisms (SNPs). DNA methylation levels of 2854 corresponding cytosine-phosphate-guanine (CpG) candidates in whole blood samples were measured by the Illumina Infinium Human Methylation450 Beadchip array in two independent subsamples of the ESTHER study. After correction for multiple testing, we successfully confirmed associations with smoking for one previously identified CpG site within the KLF6 gene and identified 12 novel sites located in 7 genes: STK32A, TERT, MSH5, ACTA2, GATA3, VTI1A and CHRNA5 (FDR <0.05). Current smoking was linked to a 0.74% to 2.4% decrease of DNA methylation compared to never smoking in 11 loci, and all but one showed significant associations (FDR <0.05) with life-time cumulative smoking (pack-years). In conclusion, our study demonstrates the impact of tobacco smoking on DNA methylation of lung cancer related genes, which may indicate that lung cancer susceptibility genes might be regulated by methylation changes in response to smoking. Nevertheless, this mechanism warrants further exploration in future epigenetic and biomarker studies.


INTRODUCTION
Lung cancer is the most common cancer and a leading cause of cancer-related mortality globally [1]. In recent years, several large genome-wide association studies (GWASs) have been conducted to identify genetic risk factors of lung cancer [2]. They have successfully identified numerous single-nucleotide polymorphisms (SNPs) that might play a role in the pathophysiology of lung cancer, such as loci located in chromosomal regions 15q (nicotinic acetylcholine receptor subunits: CHRNA3, CHRNA5), 5p (TERT-CLPTM1L) and 6p (BAT3-MSH5).
Smoking, the best established environmental hazard of lung cancer, accounts for 80% of the worldwide lung cancer burden in males and at least 50% in females [1]. Recent studies have shown that smoking could interact with genetic variation to influence lung cancer, including lung tumor initiation and progression [3,4]. DNA methylation, which could be employed as a useful and stable surrogate of the genetic response, has recently been suggested to be one of the potential mechanisms of such interaction for smoking-related health outcomes [5,6].
Recently, a number of epigenome-wide association studies (EWASs) have established the important role of Research Paper tobacco smoking in genomic DNA methylation profiles within whole blood samples. They identified smoking related CpG sites in various genes, such as AHRR, F2RL3 and GPR15, in whole blood samples, and showed that these sites could be utilized as quantitive biomarkers of current and past smoking exposure and predictors of smoking-associated health risks [5][6][7][8]. Another two studies by Steenaard et al. and Ligthart et al. have demonstrated that smoking is associated with differential DNA methylation of the risk genes of coronary artery disease and diabetes [9,10]. However, no previous studies have systematically addressed the impact of smoking on DNA methylation of risk loci for lung cancer. Hence, we conducted an epigenetic investigation in the ESTHER study, focusing on the association of smoking with whole blood DNA methylation of loci at/near confirmed lung cancer related genes, with the aim of identifying methylation signals that could have the potential to aid in the development of risk prediction models or in advancing the understanding of the exact links of smoking with lung cancer.

Participant characteristics
Characteristics of the study population in the discovery (n=978) and validation panels (n=531) were comparable with respect to age, lifestyle factors, smoking behavior, as well as prevalent diseases, and are summarized in Table 1 . Average age in the two subsets was about 62 years. More than half of the participants in each subset were ever smokers, and around 18% still smoked at the time of recruitment. In both subsets, the proportions of men were much higher in current smokers than that in never smokers: 60.8% vs. 29.4% in the discovery panel and 48.0% vs. 21.1% in the validation panel (data not shown). Average cumulative smoking exposure in current smokers and former smokers were 36.8 and 23.3 pack-years, respectively, in the discovery panel, and 33.9 and 19.9 pack-years, respectively, in the validation panel. Average cessation time for former smokers in the two subsets was also similar, approximately 17 years.

Associations between tobacco smoking and methylation of lung cancer related genes
DNA methylation levels of 2854 CpG candidates corresponding to 75 genes were measured by the Illumina Infinium Human Methylation450 Beadchip array. Associations between current smoking exposure (current vs. never; independent variable) and methylation levels of these candidates (dependent variable) were assessed by three mixed linear regression models (Models 1-3) with methylation assay batch as random effect and increasing adjustment for potential confounders (details were presented in Methods). Compared with Model 1 and Model 2 which were less powerful (Supplementary  Table S1), after fully controlling for confounding factors (Model 3), 31 of the 2854 CpG candidates passed the threshold of FDR <0.05 in the discovery phase (Figure 1,  Supplementary Table S2). The 31 CpG sites were then replicated in the validation panel by the fully-adjusted mixed linear regression model (Model 3). As a result, 13 of these 31 CpG sites were confirmed as significantly smoking-related loci ( Table 2, FDR < 0.05). Among these, only cg24287110 (KLF6), was previously reported to be related to smoking exposure [11]. The remaining 12 sites were located in 7 genes: STK32A (n=1), TERT (n=2), MSH5 (n=2), ACTA2 (n=1), GATA3 (n=3), VTI1A (n=2) and CHRNA5 (n=1). Current smoking was mostly associated with hypomethylation (11 sites), whereas hypermethylation was observed at cg17928584 (STK32A) and cg19696491 (CHRNA5). Effect sizes of the 13 CpG sites between never and current smokers ranged from 0.6% to 2.9%.
Furthermore, in the analyses of associations between other smoking indicators and the 13 validated CpG sites which were identified as the smoking-related loci, all loci except cg19696491 (CHRNA5) were significantly associated with pack-years (Table 3, FDR<0.05), whereas none of the 13 loci exhibited an association with the time since smoking cessation after FDR correction. In line with this, comparisons of methylation between current and former, or between former and never smokers generally were weaker, and did not reach significance, with the possible exception of cg19335412 (ACTA2) (adjusted p-value = 0.018 for the comparison of former and never smokers). However, methylation changes associated with former smoking were generally in the same direction as those associated with current smoking (detailed data not shown).

Characteristics of significant CpG sites
Genome characteristics of the 13 validated CpG sites are presented in Table 4 . They are located at chromosomes 5 (n=3), 6 (n=2), 10 (n=7) and 15 (n=1). Eight of these 13 CpG sites are located at the gene bodies, 4 at the transcription start sites (TSS200/ TSS1500) and only one at the untranslated region (3′UTR). None of them is located at the cis-eQTLs. With the exception of three CpG sites within GATA3, the distances between other significant CpG sites and their corresponding lung cancer related SNPs were less than 1Mb. Correlations between methylation at the 13 sites are described in Supplementary  Table S3, significant moderate pairwise correlations were frequently observed, stronger positive correlations were seen between CpG sites located on the same genes. In particular, cg19696491 within CHRNA5 has the strongest correlations (p<0.0001) with other CpG sites except loci cg11430077 (GATA3) and cg24287110 (KLF6).  [27] i: A pack-year was defined as having smoked 20 cigarettes per day for 1 year, including all participants from validation panel, pack-year= 0 for never smokers j: Former smokers only, data missing for 9 and 3 participants, respectively, in discovery and validation panels; cessation time equals age at recruitment minus age at cessation

DISCUSSION
In the present study, based on two independent subgroups of a population-based cohort of older adults from Germany, we identified 13 smoking-related CpG sites within 8 genes suggested to be associated with lung cancer development by GWASs. Smoking-induced hypomethylation was observed for loci within KLF6, TERT, MSH5, ACTA2, GATA3 and VTI1A, and hypermethylation was observed for loci within STK32A and CHRNA5. The effect sizes between never and current smokers ranged from 0.6% to 2.9%. These findings may indicate that lung cancer susceptibility genes might be regulated by methylation changes in response to smoking. The associations with smoking may also partly explain the positive correlation of methylation levels between the identified sites.
Altogether, we were able to identify 12 novel smoking-related CpG sites and replicate one previously identified locus within two independent cohorts. Although their methylation alterations were not as pronounced as well-established smoking-related CpG sites, such as cg05575921 (AHRR) and cg03636183 (F2RL3) [8,[12][13][14], clear patterns of lowest (highest) and intermediate methylation levels, respectively, among current and former smokers, compared with never smokers were consistently observed for all hypomethylated (hypermethylated) loci. Although differences between former and never smokers were weaker and not statistically significant, they were in the same direction as differences between current and never smokers, and additional associations were observed between cumulative smoking exposure and methylation at the identified sites. This pattern of "methylation recovery" after quitting smoking is consistent with findings from recent epigenetic studies of smoking cessation [11,14,15]. Accordingly, it appears worthwhile to further explore dose-response relationships of life-time smoking exposure with methylation at the identified loci in larger cohorts.
Our study also discloses evidence that might narrow the apparent ethnical discrepancy of lung cancer susceptibility. We identified methylation changes in three genes, VTI1A, STK32A and GATA3 that were rarely reported in relation to lung cancer among Caucasians previously. The corresponding SNP rs7086803 of VTI1A (vesicle transport through interaction with t-SNAREs 1A) was only identified in female non-smoking Asians as the strongest association signal of lung cancer [16]. A recent study further identified it as a potential contributor to lung cancer susceptibility and poor survival in smoking Chinese [17], but this locus never demonstrated a significant association with lung cancer in GWASs among other ethnicities. Likewise, STK32A (encoding serine/ threonine kinase 32A) was only reported by a GWAS in a Chinese population, and the risk allele, rs2895680, was significantly associated with smoking dose [18]. Lastly, for GATA3 (GATA binding protein 3), no corresponding SNP was disclosed by any GWASs on lung cancer yet, while only an adjacent SNP, rs1663689, was identified in a Chinese population and might mediate genetic damage among workers exposed to polycyclic aromatic hydrocarbons [18,19]. Overall, our study might provide some indications that these loci may play some roles in the pathway between smoking and lung cancer development in the Caucasian population as well, which should be followed up in further research. Furthermore, we also identified CpG sites within two well-established lung cancer related genes. CHRNA5 is one of the three cholinergic nicotine-receptor genes within genome region 15q25, encoding nicotine acetylcholine receptors (nAChRs) in neuronal and other tissues [20]. Its association with smoking quantity was reported in 2008, suggesting that SNPs in nAChRs may alter the risk of lung cancer through smoking behavior and regulate direct effects of nicotine as well [20]. Our finding of hypermethylation of cg19696491 within CHRNA5 under smoking exposure possibly reflects altered expression of CHRNA5, which could render a potential mechanism to support this suggestion. TERT (telomerase reverse transcriptase) is another plausible lung-cancer gene candidate which is known for its function in telomere replication and maintenance [21]. It is located at the 5p15.33 region, which is not only involved in lung cancer, but also in brain, bladder and prostate cancer development [22]. Moreover, locus cg12324353 within TERT was recently reported to be related to coronary artery disease [9]. These findings indicate that the genotypes and epigenotypes of TERT might provide valuable contributions to signatures for risk of a wide range of cancers and chronic diseases, which warrants further exploration. The same applies to another three genes KLF6 (Krüppel-like zinc finger transcription factor) [23], MSH5 (MutS protein homolog 5) [24] and ACTA2 (Alpha-smooth muscle actin) [25], which were also found to be associated with lung cancer by several previous GWASs, albeit not as prominently as CHRNA5 and TERT.   a: According to GRCh37/hg19 b: This SNP is located close to GATA3 c: CHRNA5 is cis-eQTL gene of this SNP Major strengths of the present study include the relatively large sample size with detailed information on a broad range of covariates in a large populationbased cohort and the comprehensive validation in an independent group. Although smoking and lung cancer related changes of methylation would be expected to primarily manifest in buccal tissues [26], we were able to disclose such changes in DNA of whole blood samples, which would be the primary sample matrix available in screening settings in general practice. Even though associations of smoking with DNA methylation in whole blood may be affected by smoking related shifts in leukocyte distribution, the observed associations persisted after control for leukocyte distribution by the Houseman algorithm [27]. Furthermore, even potential (residual) confounding by leukocyte distribution would not impair the potential utility of the methylation patterns for risk prediction. Lastly, one plausible explanation for our observation could be that DNA methylation lies on the regulatory pathway linking smoking with lung cancer, which would be in line with Zhang et al.'s finding that the association between smoking and lung cancer was strongly attenuated or even disappeared when DNA methylation was included in predictive models [28]. Therefore, further studies focusing on elucidating potential causal pathways would be desirable. Still, other alternative/ additional explanations, such as DNA methylation being a more reliable marker of smoking exposure or DNA methylation reflecting susceptibility to smoking exposure would also have to be kept in mind. In addition, genomic variations might influence the DNA methylation patterns identified in our study. However, due to the lack of gene expression data and the limited number of lung cancer cases in our study population, we were not able to address potential underlying pathophysiological mechanisms.
Even with significant strides in diagnosis and treatment, the prognosis of lung cancer remains poor, with overall 5-year survival rates around 15%, primarily owing to detection at advanced stages [29]. Screening by available routine assays like sputum cytological examination and chest radiography, but also by lowdose computed tomography have serious limitations [30,31]. Therefore, novel approaches for enhanced risk stratification and performance of lung cancer screening would be highly desirable. DNA methylation signatures might be a promising approach toward this end. Recently, Zhang et al. demonstrated the potential of methylation of F2RL3, a strongly smoking associated locus, as a predictor of lung cancer risk [28]. Further studies should evaluate the extent to which the identified CpG sites may be more predictive of lung cancer than self-reported smoking indicators or genetic background, and then address the potential of such CpG sites, alone or in combination with other markers, to predict lung cancer

Study population
All study subjects were selected from the ESTHER study, an ongoing statewide population-based cohort study conducted in southwest Germany. Details of study design have been reported previously [32]. Briefly, 9949 older adults (aged 50-75 years) were enrolled by their general practitioners during a routine health check-up between July 2000 and December 2002, and followed up thereafter. Two independent subgroups were selected as discovery panel and validation panel, respectively, for epigenetic analyses. The discovery panel included 1000 participants who were recruited consecutively at the start of ESTHER study between July and October 2000. The validation panel included 548 participants randomly selected from participants recruited between October 2000 and March 2001. The study was approved by the ethics committees of the University of Heidelberg and the state medical board of Saarland, Germany. Written informed consent was issued by all participants.

Data collection
Information on socio-demographic characteristics, lifestyle factors, health status, and history of major diseases at baseline was obtained by standardized selfadministrated questionnaires. Participants were asked about past and present cigarette, cigar and pipe smoking behavior and were then categorized into current, former and never smokers. Furthermore, detailed information on smoking history was also obtained from questionnaires, including age at initiation and smoking intensities at various ages, as well as age of quitting smoking for former smokers. Twenty-two and seventeen participants were excluded from the discovery and the validation panel, respectively, because of missing information on smoking status, respectively. Additional information on body mass index (BMI) and prevalent diseases, such as diabetes, cancer, or cardiovascular disease was extracted from a standardized form filled by the general practitioners during the health check-ups. Prevalent cardiovascular disease at baseline was defined by either physicianreported coronary heart disease or a self-reported history of myocardial infarction, stroke, pulmonary embolism or revascularization of the coronary arteries. Prevalent cancer [ICD-10 C00-C99 except non-melanoma skin cancer (C44)] was defined by either self-report or records from the Saarland Cancer Registry. Blood samples were taken during the health check-up and stored at -80°C until further processing. Whole blood DNA was extracted by using a salting out procedure [33].

DNA methylation data
DNA methylation of whole blood samples was assessed by the Illumina Infinium Human Methylation 450 Beadchip array (Illumina, San Diego, CA, USA). As previously described [34], samples were analyzed following the manufacturer's instruction at the Genomics and Proteomics Core Facility of German Cancer Research Center, Heidelberg, Germany. Illumina's GenomeStudio® (version 2011.1; Illumina.Inc.) was employed to extract DNA methylation signals from the scanned arrays (Module version 1.9.0; Illumina.Inc.). Methylation status of a specific CpG site was quantified as a β value ranging between 0 (no methylation) and 1 (full methylation). According to the manufacturer's protocol, no background correction was done and data were normalized to internal controls provided by the manufacturer. All controls were checked for inconsistencies in each measured plate. Signals of probes with a detection p-value >0.05 were excluded from analysis. We used the Illumina normalization and preprocessing method implemented in Illumina's Genomestudio ("Illumina normalization").

Identification of CpG candidates
GWASs for lung cancer conducted among smokers, non-smokers and the general population that were published from 2007 to July.2015 [2, 16-21, 23-25, 35-39] were reviewed by one of the authors (XG), from which 120 lung cancer related SNPs within 59 genetic regions were identified ( Figure 2). Furthermore, since cis-expression-quantitive trait loci (cis-eQTL) might affect the gene expression levels of nearby genes [40], we therefore identified 33 cis-eQTL within 1 Mb of the identified SNPs from the blood cis-eQTL database (FDR < 0.05) [40]. After excluding 17 duplicates, we identified 3044 corresponding methylation probes within the remaining 75 lung cancer related genes from the probe database of the Illumina 450K assay. Subsequently, we excluded 3 probes containing SNPs with a minor allele frequency above 1% from the candidate list, since variations in these SNPs are able to cause bias in the methylation measurement [41]. We also excluded known cross-reactive and polymorphic probes (n=187), as they could introduce bias in the results [42]. Finally, we obtained a list of 2854 probes considered for further analysis (Supplementary Table S1).

Statistical analysis
The study populations in the discovery and validation panels were described with respect to major socio-demographic characteristics, lifestyle factors, smoking behavior and prevalent diseases.
Firstly, we chose the current and never smokers from the discovery panel to investigate the associations between current smoking exposure (current vs. never; independent variable) and methylation levels of 2854 CpG candidates (dependent variable). Three mixed linear regression models with methylation assay batch as random effect were employed, controlling for potential confounding factors, including factors that have been shown to be associated with DNA methylation in previous studies [43][44][45][46][47]. Model 1 was adjusted for age (years) and sex. Model 2 was additionally adjusted for the leukocyte distribution estimated by the Houseman algorithm [27]. Model 3 was further adjusted for alcohol consumption (abstainer, low [women: 0 -<20 g/d, men: 0 -<40 g/d], intermediate [20 - ), the prevalence of cardiovascular diseases (yes/no), diabetes (yes/no) and cancer (yes/no). After correction for multiple testing by the false discovery rate (FDR, Benjamini-Hochberg method [48]), CpG sites with corrected p-values <0.05 were selected (raw p-value <5.4×10 -4 ). A Manhattan plot was plotted by the R-package 'qqman'. Identified sites were then validated in current and never smokers from the validation panel. Loci with replication FDR <0.05 were considered as smokingassociated loci.
To evaluate the impact of cumulative smoking exposure and smoking cessation on DNA methylation, we separately performed additional analyses on the associations of pack-years and time since cessation of smoking with the validated smoking-associated CpG sites in the validation panel. Furthermore, the differences in the methylation of the validated CpG sites were compared for current smokers vs. former smokers and for former smokers vs. never smokers. In all aforementioned analyses, the models were adjusted for covariates as in Model 3 and p-values were corrected by FDR (FDR <0.05). Mutual correlations between methylation at the validated CpG sites were assessed by Spearman's correlation coefficients. All data analyses were conducted by SAS version 9.3 (SAS Institute Inc., Cary, NC, USA).