Functional intronic ERCC1 polymorphism from regulomeDB can predict survival in lung cancer after surgery.

We searched for potential regulatory single nucleotide polymorphisms (SNPs) in excision repair cross-complementing group 1 (ERCC1) using RegulomeDB, a database integrating information from the Encyclopedia of DNA Elements (ENCODE) project, and investigated their association with survival after surgery in non-small cell lung cancer (NSCLC). Among 364 SNPs found within ERCC1 region using RegulomeDB, four top priority SNPs (rs2298881C>A, rs1049739A>G, rs10415949A>G and rs6509214G>T) were selected for this study. The four SNPs were investigated in 316 patients. A replication study was performed (n = 579). Of the four SNPs analyzed in the discovery set, rs2298881C>A and rs6509214G>T were significantly associated with survival outcomes. The association was consistently observed only for rs2298881C>A in the validation cohort. In combined analysis, rs2298881C>A was significantly associated with worse overall survival and disease-free survival (P = 0.0002 and 0.02, respectively). A decreased reporter gene expression for rs2298881 A allele was observed compared with C allele by luciferase assay (P = 0.02). ERCC1 rs2298881C>A, an intronic SNP, is the first genetic polymorphism with functional evidence of regulating its expression, and the SNP is associated with prognosis of NSCLC. Our result supports the role of RegulomeDB as a comprehensive source of prioritized candidate SNPs for genetic association studies.


INTRODUCTION
Excision repair cross-complementing group 1 (ERCC1) is involved in nucleotide excision repair pathway that eliminates bulky DNA adducts caused by carcinogens in tobacco smoke and platinum-based chemotherapeutic agents [1,2]. Therefore, ERCC1 has been linked to protection against development and progression of cancer, and resistance to platinum-based anticancer drugs at the same time: the double-edged sword. Based on the biological significance of ERCC1, its use as a predictive or prognostic biomarker has been pursued by a large number of cancer researchers. The expression of ERCC1 by quantitative real-time polymerase chain reaction or immunohistochemistry has been correlated with the clinical outcomes of non-small cell lung cancer (NSCLC) [3][4][5]. Genetic polymorphism of ERCC1 has also been investigated for the association with the risk and clinical outcome of many types of cancer including NSCLC [6][7][8][9][10][11][12][13][14]. The most widely studied single nucleotide polymorphisms (SNPs) include rs11615T>C (N118N) which is the only SNP tested in the exon region of ERCC1, and rs3212986C>A in 3′-UTR of ERCC1 (Q504K for CD3EAP, antisense to ERCC1). Not surprisingly, the association of these SNPs with NSCLC has not been consistent across studies.
The human genome project has revealed that only 2% of human genome contains protein-coding genes, with the vast majority of human genome remained as 'junk DNA' [15]. However, despite intensive studies focused on protein-coding genes, our understanding of the genome has been far from complete [16]. In addition, nearly 90% of the variants identified as phenotypeassociated SNPs in genome-wide association studies (GWAS) have been located within intergenic or intronic regions, posing an obstacle to its interpretation [16,17]. Therefore, it has been suggested that the genome region outside the protein-coding genes may have the key to open the treasure chest of the vast genetic information of human genome.
The Encyclopedia of DNA Elements (ENCODE) project has the aim of describing all functional elements encoded in the human genome [16]. It revealed that 80% of the genome, especially outside of proteincoding regions, contains elements linked to biochemical functions such as DNA-transcription factor binding, providing new insights into the mechanisms of gene regulation [16]. RegulomeDB is a database which integrates a large collection of regulatory information from ENCODE and other data sources [17], being a rich source of information that may provide putative mechanistic explanations for genetic association studies including GWAS [18]. Until recently, a few studies utilized RegulomeDB to predict regulatory function of SNPs in non-coding regions that were identified by GWAS or candidate gene study [19][20][21][22].
RegulomeDB provides a scoring system prioritizing SNPs based on the degree of experimental or computational evidence that a variant lies in a functional location and likely results in a functional consequence, e.g., alteration of transcription factor binding and gene expression [17]. Therefore, selection of potential regulatory SNPs using RegulomeDB may help to improve power to detect true causal variants in genetic association studies. In the present study, we selected SNPs with high confidence of functional consequence in ERCC1 gene region using RegulomeDB and investigated the association between those SNPs and the survival of NSCLC patients after curative surgery.

Patient characteristics and clinical predictors
The clinical and pathologic characteristics of patients in the discovery and validation sets and the association with OS and DFS are shown in Table 1. Upon univariate analysis, pathologic stage was significantly associated with OS and DFS in both sets (log-rank P [P L-R ] for OS = 2.0 × 10 −6 and 0.0006; and P L-R for DFS = 4.0 × 10 −10 , 5.0 × 10 −7 , respectively). Gender was associated with OS in the discovery set (P L-R for OS = 0.04), and age was associated with OS and DFS in the validation set (P L-R for DFS = 0.0001 and 0.01, respectively).

Associations between SNPs and survival outcomes
Among the four SNPs analyzed in the discovery set, the rs2298881C>A and rs6509214G>T were significantly associated with survival outcomes when adjusted for age, gender, smoking status, tumor histology, pathologic stage, and adjuvant therapy (Table 2). However, the association was consistently observed only for the rs2298881C > A in an independent validation set, which was in the same direction as the discovery set. In combined analysis, the rs2298881C > A was significantly associated with worse OS and DFS (adjusted HR [aHR] for OS, 1.37; 95% CI, 1.16-1.63; P = 0.0002; aHR for DFS, 1.17; 95% CI, 1.03-1.34; P = 0.02; under additive genetic model; Table 2 and Figure 1).

Effect of rs2298881C > A on the promoter activity of ERCC1
To investigate whether rs2298881C > A affects promoter activity of ERCC1, we generated three pGL3-ERCC1 constructs: pGL3-ERCC1pro with cloned ERCC1 promoter region alone, and pGL3-ERCC1pro_C  The numbers of patients in discovery, validation, and combined cohort for rs2298881 were 316, 579, and 895, respectively. § The numbers of patients in discovery, validation, and combined cohort for rs6509214 were 316, 412, and 728, respectively, due to lack of available samples for validation. and pGL3-ERCC1pro_A with both promoter region and the fragment containing rs2298881C>A (Figure 2A). As shown in Figure 2B, luciferase activity was significantly higher in H1299 cells transfected with pGL3-ERCC1pro_C or pGL3-ERCC1pro_A compared with pGL3-ERCC1pro, suggesting that the fragment containing rs2298881C>A enhanced the activity of ERCC1 promoter. A decreased expression of the reporter gene for the A allele of rs2298881C>A was observed compared with the C allele by luciferase assay (P = 0.02; Figure 2B). These results suggest that an intronic SNP rs2298881C>A may alter ERCC1 expression by affecting ERCC1 promoter activity.

DISCUSSION
We investigated the association between potential regulatory SNPs in ERCC1 gene region selected from RegulomeDB and survival of patients with surgically resected early stage NSCLC in a relatively large two-stage study including 895 patients. Our study showed significant association between ERCC1 rs2298881C>A and the prognosis of patients with early stage NSCLC, which was reproducible in an independent set of patients. We also report that rs2298881C>A, an intronic SNP of ERCC1, is the first genetic polymorphism with functional evidence of regulating ERCC1 expression. These findings suggest that ERCC1 rs2298881C>A could be used as a prognostic marker for early stage NSCLC, and that RegulomeDB may be useful in selecting potentially functional SNPs in the regulatory region for genetic association studies.
In the present study, we searched for regulatory SNPs in ERCC1 gene region using RegulomeDB and showed that rs2298881C>A was associated with worse prognosis of NSCLC patients after curative resection. In vitro luciferase assay showed that the ERCC1 rs2298881C-to-A change was associated with reduced promoter activity of ERCC1. According to RegulomeDB, the rs2298881C>A has the highest level of evidence for regulatory role among SNPs in ERCC1 gene region. In addition, based on RegulomeDB, rs2298881C>A is the only SNP throughout the whole genome reported to be in the eQTL that is predicted to regulate the expression of ERCC1. Our result is in line with the realization of regulatory function of non-coding DNA and suggests the need for investigating variants in regulatory region.
Recent results of ENCODE project provided evidence revealing that genetic variation in non-coding DNA play an important role in the regulation of gene expression. The ENCODE data show that the results of GWAS are typically enriched for variants within non-coding functional units, suggesting that many of these regions could be causally linked to disease [17,18,24]. Therefore, RegulomeDB containing ENCODE data is a powerful tool for predicting the likelihood of a SNP being in a functional location, thereby facilitates prioritizing SNPs for genetic association studies. The result of our study supports the role of RegulomeDB in selecting putative regulatory SNPs for future genetic association studies.
Genetic polymorphisms of ERCC1 have been investigated in terms of the risk and the clinical outcomes reporter gene assays. Promoters are marked by white blocks and the fragments including rs2298881C> A site by black blocks, and arrow indicates the direction of transcription. The first base of translation start site is denoted as +1. ERCC1 promoter was amplified from human genomic DNA and cloned into the pGL3 basic vector (pGL3-ERCC1pro). DNA fragments containing the SNP site were cloned into other multi cloning site (pGL3-ERCC1pro_C and pGL3-ERCC1pro_A). B. Luciferase activity according to ERCC1 rs2298881C>A. H1299 cells were transfected with pGL3-ERCC1pro, pGL3-ERCC1pro_C and pGL3-ERCC1pro_A constructs, respectively. Each bars represent mean ± S.E.M. of firefly luciferase activity normalized to Renilla luciferase activity. Experiments were performed in triplicate. P value, a Student's t-test. luc, luciferase. www.impactjournals.com/oncotarget in many types of cancer including NSCLC [6][7][8][9][10][11][12][13][14]. However, most of the studies have focused on only a few SNPs, such as ERCC1 rs11615T>C (N118N) and rs3212986C>A in 3′-UTR, and the results have not been consistent among studies. We previously investigated these two SNPs in terms of the clinical outcomes of earlystage NSCLC after surgery and advanced NSCLC after platinum-based chemotherapy in Koreans [13,25,26]. However, neither rs11615T>C nor rs3212986C>A showed significant association with the outcome of NSCLC [13,25,26]. In the present study, we searched RegulomeDB for potential regulatory SNPs in ERCC1, and investigated their association with survival after surgery in NSCLC. In fact, rs2298881C>A and other intronic SNPs such as rs3212961A>C and rs3212948G>C were selected as haplotype tagging SNPs in previous studies that first evaluated those SNPs [27,28]. However, a small number of studies on rs2298881C>A with variable number of patients have shown discrepant association with various types of cancer [27,[29][30][31].
In this study, we included a total of 895 patients which is relatively large for studies on surgically resected NSCLC. The association between ERCC1 rs2298881C>A and survival outcomes was replicated across both discovery and validation sets of the study, which would largely reduce false positivity [32,33]. In addition, the association of rs2298881C>A with survival outcome was biologically plausible. It is possible that the ERCC1 rs2298881 C-to-A change in the putative regulatory region may lead to reduced promoter activity and decreased ERCC1 expression, resulting in decreased DNA repair capacity and therefore worse disease outcome. However, in our preliminary analysis, significant difference in the relative expression level of ERCC1 mRNA among genotypes of rs2298881C>A was not observed in either tumor or paired non-malignant lung tissues (Supplementary Figure 1). Future studies are required to understand the biologic mechanism of the observed association between the SNPs and survival outcomes.
In conclusion, this study showed that ERCC1 rs2298881C>A could predict the survival outcomes of patients with surgically resected early stage NSCLC. RegulomeDB may be useful as a practical tool for selecting potentially functional SNPs in the regulatory region for future genetic association studies.

Study population
The discovery set included 316 patients with pathologic stages I, II, or IIIA (micro-invasive N2) NSCLC who underwent curative surgical resection at Kyungpook National University Hospital (KNUH) between September 1998 and August 2007. Genomic DNA samples from tumor and corresponding non-malignant lung tissue specimens were provided by the National Biobank of Korea -KNUH, which is supported by the Ministry of Health, Welfare and Family Affairs. Written informed consent was obtained from all patients prior to surgery. All materials derived from the National Biobank of Korea -KNUH were obtained under institutional review board-approved protocols. The validation set included 579 patients with pathologic stages I, II, or IIIA NSCLC who underwent curative surgical resection at KNUH (n = 99) and Seoul National University Hospital (n = 307), Seoul National University Bundang Hospital (n = 173). Written informed consent was obtained from all patients before surgery and research protocol was approved by the institutional review board at each hospital. All of the patients included in this study were ethnic Koreans. Patients who underwent chemotherapy or radiotherapy prior to surgery were excluded to avoid the effects on DNA. The pathologic staging of the tumors was determined according to the International System for Staging Lung Cancer [23].

SNP selection and genotyping
Three hundred and sixty four SNPs were found within ERCC1 gene region, NC_000019.9 (45910591..45982241, complement) by Genome Reference Consortium Human Build 37 patch release 13 (GRCh37.p13) assembly, using RegulomeDB (http://regulome.stanford.edu). The RegulomeDB provides a scoring system with categories ranging from 1 to 6 based on the degree of experimental or computational evidence of functional consequence of a given variant. Category 1 includes variants that are known expression quantitative trait loci (eQTLs), which have been shown to be associated with expression of target genes, and is further divided into subcategories 1a to 1f. Because the lower score indicates the stronger evidence for a variant to be located in a functional region, a variant scored as 1a most likely affects transcription factor binding and expression of a target gene. We prioritized the 364 SNPs using the RegulomeDB, and selected five SNPs that were classified into category 1: two SNPs (rs2298881C>A and rs1049739A>G) had a score of 1b, and three SNPs had a score of 1f (rs10415949A>G, rs6509214G>T, rs7245548C>T). Among those, rs7245548 was not genotyped because it was in linkage disequilibrium with rs6509214 by HapMap JPT database. Finally, four SNPs (rs2298881C>A, rs1049739A>G, rs10415949A>G, and rs6509214G>T) were selected for genotyping to investigate the relationship with the survival of NSCLC. Genomic DNA was extracted from tissues with QIAamp ® genomic DNA kit (Qiagen, Hilden, Germany) according to the manufacturer's protocol. The rs2298881C>A was genotyped using the Taq-Man ® assay (Applied Biosystems, Foster City, CA) following the manufacturer's instructions. The Taq-Man probes were predesigned and synthesized by Applied Biosystems. The rs10415949A>G and rs6509214G>T were genotyped using SEQUENOM's MassARRAY ® iPLEX assay according to instructions of the manufacturer. Duplicate samples and negative controls were included to ensure accuracy of genotyping. For validation of genotyping, approximately 5% of samples of the cohort were randomly selected to be genotyped again with a restriction fragment length polymorphism assay by a different investigator and the results were 100% concordant.

Cloning of the luciferase reporter gene and luciferase assay
The rs2298881C>A is an eQTL for ERCC1 gene located in intron 1. RegulomeDB suggests that the SNP lies in a location which overlaps transcription binding site and regulates gene expression. We investigated whether rs2298881C>A affects ERCC1 promoter activity by luciferase reporter assay. The pGL3-Basic Vector (Promega, Madison, WI, USA) was used to construct luciferase reporter plasmids using manufacturer's protocols. Briefly, promoter region of ERCC1 (-970 to -29 bp, the transcriptional start site is designated as +1) was synthesized by polymerase chain reaction from human genomic DNA and cloned into the pGL3-Basic vector to generate pGL3-ERCC1pro. Two fragments including rs2298881C or rs2298881A allele of ERCC1 rs2298881C>A were amplified from genomic DNA sample and the 160 bp products were cloned into pGL3-ERCC1pro, respectively. All constructs were verified by direct sequencing before use. Human non-small cell lung cancer cells (H1299) were maintained at 37°C in 5% CO 2 atmosphere in RPMI-1640 medium containing 10% heat-inactivated fetal bovine serum (FBS). The cells were transfected with 300 ng of each plasmid DNA (pGL3-ERCC1pro, pGL3-ERCC1pro_C, or pGL3-ERCC1pro_A) and 30 ng of pRL-SV40 Vector (Promega, Madison, WI, USA) using Effectene ® (Qiagen, Hilden, Germany) according to manufacturer's protocol. Luciferase activity was measured on an Orion L Microplate Luminometer (Berthold Detection Systems GmbH, Pforzheim, Germany) using the Dual-Luciferase ® Reporter Assay System (Promega, Madison, WI, USA). Firefly luciferase activity measurements were normalized with respect to pRL-SV40 Renilla luciferase activity to correct for variations in transfection efficiency. Each experiment was conducted in triplicate.

Statistical analysis
Differences in the distribution of genotypes according to the clinicopathologic factors of patients were compared using χ 2 tests. Hardy-Weinberg equilibrium was tested using a goodness-of-fit χ 2 test with 1 degree of freedom. The genotypes for each SNP were analyzed as a three-group categorical variable, and those were also grouped according to the dominant, recessive and additive model. Overall survival (OS) was measured from the day of surgery to the date of the last follow-up or until the date of death. Disease-free survival (DFS) was calculated from the day of surgery until recurrence or death. The survival estimates were calculated using the Kaplan-Meier method. The difference in OS and DFS according to the SNPs was compared using log-rank tests. Cox's proportional hazard regression model was used for the multivariate survival analyses, and the analyses were always adjusted for age (> 63 years versus ≤ 63), gender (male versus female), smoking status (ever versus never), tumor histology (squamous vs. non-squamous), pathologic stage (II-IIIA versus I), and adjuvant therapy (yes vs. no). The hazard ratio (HR) and 95% confidence interval (CI) were also estimated. A cut-off p-value of 0.05 was adopted for all the statistical analyses. The statistical data were obtained using SAS Genetic software (SAS Institute, Cary, NC).