Risk assessment models for genetic risk predictors of lung cancer using two-stage replication for Asian and European populations

In the past ten years, great successes have been accumulated by taking advantage of both candidate-gene studies and genome-wide association studies. However, limited studies were available to systematically evaluate the genetic effects for lung cancer risk with large-scale and different ethnic populations. We systematically reviewed relevant literatures and filtered out 241 important genetic variants identified in 124 articles. A two-stage case-control study within specific subgroups was performed to assess the effects [Training set: 2,331 cases vs. 3,077 controls (Chinese population); testing set: 1,937 cases vs. 1,984 controls (European population)]. Variable selection and model development were used LASSO penalized regression and genetic risk score (GRS) system. Further change in area under the receiver operator characteristic curves (AUC) made by the epidemiologic model with and without GRS was used to compare predictions. It kept 38 genetic variants in our study and the ratios of lung cancer risk for subjects in the upper quartile GRS was three times higher compared to that in the low quartile (odds ratio: 4.64, 95% CI: 3.87–5.56). In addition, we found that adding genetic predictors to smoking risk factor-only model improved lung cancer predictive value greatly: AUC, 0.610 versus 0.697 (P < 0.001). Similar performance was derived in European population and the combined two data sets. Our findings suggested that genetic predictors could improve the predictive ability of risk model for lung cancer and highlighted the application among different populations, indicating that the lung cancer risk assessment model will be a promising tool for high risk population screening and prediction.


INTRODUCTION
Lung cancer is one of the most commonly diagnosed malignancies and the leading cause of cancer-related deaths in the world, with almost 1.6 million deaths per year (19.4% of total cancer mortality) [1]. As well as known that the major environmental cause is tobacco smoking accounting for over 80% of all lung cancer cases. However, only less than 20% of smokers developed lung cancer cases, suggesting that individual variation in genetic susceptibility may play an important role [2]. Over the past ten years, both candidate-gene studies and genome-wide association studies (GWAS) have successfully identified dozens of loci associated with lung cancer risk. Although researchers have tested whether genetic variants identified from previous papers increased the models' predictive ability of such common disorders: cardiovascular disease [3], breast cancer [4,5], prostate
Despite significant advances in medical therapy, prognosis of lung cancer remains poor with a five-year survival rate of 16.6% [13], as most cases are diagnosed at advanced stage. Indeed, when lung cancer is detected before metastasis, the five-year survival rates should be 60-80% [14]. Therefore, early detection and diagnosis for lung cancer was the focus of our future research. In this respect, screening high risk population of lung cancer is an important element.
As a result, we systematically reviewed all the relevant literatures and screened out the genetic variants associated with lung cancer risk. Then we performed a two-stage casecontrol design with nearly ten thousand samples to assess the effects of selected genetic predictors This study showed that genetic predictors could improve the predictive ability of risk model for lung cancer among different populations, facilitating the clinical and public health.

General description of subjects
NJMU GWAS contains 2 331 lung cancer cases and 3 077 healthy controls, which was used as the training set to construct the model, while EAGLE study containing 1 937 cases and 1 984 controls was used to validate the model. Compared to controls with 52.66% smoking rate, cases had a significantly higher rate of smoking with 76.85% among the two data sets (Supplementary Table S1).

General information of genetic risk score
Forty of 241 lung cancer-associated SNPs were statistically significantly associated with lung cancer risk in this study at P less than 0.05 through univariate analysis (data not shown). Further, LASSO penalized regression based on univariate analysis selected 38 SNPs in the training set as shown in Table 1. To assess the cumulative risk values for the genetic predictors, we calculated a "genetic risk score" (GRS). For all the population combining European with Chinese samples, the mean of risk score among lung cancer cases (1.04 ± 0.14) was higher than that among cancer-free controls (0.99 ± 0.15), with an average risk score of 1.01 ± 0.15 for all population. We further split the GRS for lung cancer into two subgroups according to its 90% percentage: low risk group (GRS < 1.21), high risk group (GRS ≥ 1.21). Based on the classification of the GRS system, we found that in all the population (9,329 individuals), 8,446 population were classified into the low risk group with 3,792 (44.90%) lung cancer cases and 883 population were classified into the high risk group with 476 (53.91%) lung cancer cases.

Cumulative effects of genetic and environmental factors with lung cancer
The odds ratios for lung cancer were examined by percentiles of GRS and the total effect combining the smoking statue. In the discovery stage, the estimated OR of subjects in the upper quartile GRS was 4.64 (95% CI: 3.87-5.56) compared to the low quartile (P for trend: 7.52E-69). When combined the smoking factor, we found that the risk increases more obviously (P for trend: 5.41E-94, Table 2). In addition, this trend was validated in the external data, the risk for lung cancer increased 4.36 times when combined smoking factor with GRS (P for trend: 1.81E-53).

Discrimination performance
To further assess the discriminative accuracy of the model, we measured the area under curves by C-statistic (Table 3). We found that the model based only on the smoking factor has low discriminatory accuracy in the training data set (AUC = 0.610, Table 3). However, when combining the genetic factors, the performance improves (AUC = 0.697, Table 3, Figure 1A), whether in squamous cell carcinoma, adenocarcinoma or other type of lung cancer (Supplementary Figure S2A-S2C)). Similar performance was also derived among testing samples [C statistics: 0.625 (95% CI: 0.613-0.637) vs. 0.647 (95% CI: 0.630-0.664), P = 0.004, Figure 1B] and combining the two data sets [C statistics: 0.625 (95% CI: 0.615-0.634) vs. 0.658 (95% CI: 0.647-0.669), P < 0.001, Figure 1C]. We used the Hosmer-Lemeshow goodness-of-fit test to assess the extended model, indicating that it was an adequate model with P value > 0.05 (Table 3). In addition, we found that the genetic model performed moderately with an AUC of 0.604 among non-smokers in the two data sets (Supplementary Figure S2D).

DISCUSSION
In this study involving 4,268 lung cancer cases and 5,061 cancer-free controls, 38 of 241 SNPs identified systematically by previous studies were used to calculate genetic risk score. Risk assessment models combining the genetic variants and smoking factor were a good tool to predict the risk value for lung cancer. In our present study, we find that the model with only the smoking factor shows low discriminatory accuracy (AUC = 0.610, in the discovery data set). However, when we plus a genetic risk score based on 38 SNPs into the model, the AUC increases to 0.697 (P < 0.001), indicating that genetic predictors could improve the discriminatory ability of the traditional risk model. Furthermore, these results were validated in the external data set EAGLE study and the combined data sets, which mean this risk prediction model can be applied in the European population directly.
Oncotarget 53961 www.impactjournals.com/oncotarget Risk prediction models have improved our ability of diagnosis, treatment, and even prevention for diseases by screening high-risk individuals [15]. Recently, a lot of risk prediction models about lung cancer have been developed, such as Bach, LLP and Etzel models [16][17][18], but most predictors focused on traditional factors (age, smoking status, family history, occupational exposure and so on) with a moderate predictive ability (AUC: 0.55-0.70). As we all know, these models were constructed based on the European population, wondering whether that can be For the testing set (the EAGLE study), the smoking status has five missing data.
Oncotarget 53963 www.impactjournals.com/oncotarget applied in the Chinese population directly. In addition, genetic information might be used to improve the prediction accuracy of above models which offer the stability of the risk prediction during the individual lifetime.
Many studies have indicated that genetic variants might play an important role for lung cancer risk [19,20]. So far, GWAS have identified some important lung cancer susceptibility loci: 22q12 (MTMR3-HORMAD2), 3q28 (TP63) and 5p15 (TERT-CLPM1L) [21][22][23]. Of the 38 SNPs evaluating the clinical utility in the present study, we found that the top 3 genetic variants with a strong signal depending on the β coefficient were mainly located on these loci. The variant rs17728461 included in our model was located in the intron at 22q12.2, a region which includes the HORMA domain-containing protein 2 (HORMAD2). The putative functions of the gene include mitotic checkpoints, chromosome synapsis and DNA repair. And also HORMAD2 has been identified as a CT (cancer-testis) gene by silico methods [21] which indicate that HORMAD2 may contribute to the lung adenocarcinoma risk [24]. The SNP rs753955 was located in the intron at 13q12.12 region between MIPEP and TNFRSF19 identified as a risk locus of lung cancer by recent GWA studies [21]. The protein of MIPEP is primarily involved in the maturation of oxidative phosphorylation -related proteins and TNFRSF19 which is a member of the TNF-receptor superfamily actives JNK signaling pathway when overexpressed in cells.
The 5p15 region containing TERT and CLPM1L genes was thought to be related to lung cancer risk by recent GWA studies in European [22,[25][26][27], East Asian and African -American populations [23,28]. The marker  Oncotarget 53964 www.impactjournals.com/oncotarget of lung cancer rs465498 [21] located in CLPM1L encoding the cleft lip and palate-associated transmembrane 1 like protein had strong contribution to our genetic risk model. Of the 38 SNPs included in our model, the β coefficient calculated by LASSO was from 0.0075 to 0.0535, this suggested that the genetic variants only show a small contribution risk in our risk prediction model when considered alone, and are of little value in the application.
Recently, several studies have been published that a better prediction could be achieved if we combined genetic determinants into traditional approaches to assess an individual risk [10][11][12]. Weissfeld et al. [12] constructed a lung cancer risk prediction model and found that the area under the receiver operator characteristic curve improved from 0.717 to 0.725 when adding GWAS susceptibility regions to an age and smoking risk factor-only model. However, only six SNPs were included into risk prediction model. In our current study, more genetic variants were incorporated even though the performance of the risk assessment model was limited. The AUC increased from 0.610 to 0.697 when adding the 38-GRS to the smoking risk factor-only model in our discovery set.
This study has several notable strengths. First, this risk prediction model developed in our Chinese population and externally validated in both European and Asian populations, which means this model has a good extrapolation. Therefore, we are able to use the model to predict the risk of lung cancer among different ethnic populations. Furthermore, to our knowledge, this study constructed the risk prediction model by the system of screening and evaluating genetic susceptibility from the past papers that has high predictive ability accuracy. The 38-SNP GRS has public health utility by screening highrisk individuals. As shown in Table 2, the risk for lung cancer in the highest GRS increases 131% compared with the lowest for combing Chinese and European populations. It can help us make a better decision about whether to be screened by locating themselves along the spectrum of lung cancer risk [29]. In addition, for never smokers the predictive value of the genetic model was moderate and for all population our risk model combining genetic variants with smoking factor can improve the ability of prediction significantly. Therefore, we use the risk model with GRS combining multiple loci to improve the identification of persons at high risk for lung cancer.
However, some limitations in our study should be noted. This research only included smoking statue as the traditional non-genetic factors, which led to the poor discrimination. Some other studies, such as Spitz MR et al. were concentrated on the data of other clinical information such as family history of lung cancer and asbestos-exposure besides of tobacco smoke [18]. Moreover, GWAS and candidate-gene studies mainly focus on common proxy SNPs with many rare and low frequency loci or copy number variants for lung cancer to be discovered. Combing these additional variants might result in improvement in classification of lung cancer risk.
In conclusion, this is the first attempt to explore the risk predictive effects of genetic risk factors associated with lung cancer in both Chinese and European populations. In our study, 38 genetic variants identified by GWAS or candidate-gene strategies were used to construct the risk prediction models. Risk predictive models that incorporate both a genetic risk score based on these SNPs and smoking factors for lung cancer may be useful in identifying high-risk populations for targeted cancer prevention. More genetic risk variants and other epidemiological factors should be well evaluated and incorporated into the risk-predicting models to improve the ability of personalized risk assessment.

Study subjects
For the training set, derived from a lung cancer GWAS in NJMU (Nanjing Medical University) [21,30] 2,331 lung cancer cases and 3,077 cancer-free controls were enrolled in this model; for the testing set, 1,937 cases and 1,984 controls were used to validate the risk prediction model, which were derived from NCI GWAS: Environment and Genetics in Lung Cancer Etiology (EAGLE) [25].
Subjects used in the two stages were genotyped using the Affymetrix Genome-Wide Human SNP Array 6.0 microarray [21,30] and Illumina Human660W-Quad v1.0 DNA Analysis BeadChip platform (Illumina, San Diego, CA, USA) [25] respectively. To facilitate further analysis, imputation analysis were performed by IMPUTE2 software taking 1000 Genomes Project data (Phase III) as reference set. We implemented a 4-Mb sliding window to impute across the genome, resulting in 744 windows. A pre-phasing strategy with SHAPEIT software version 2 was adopted to improve the imputation performance. The phased haplotypes from SHAPEIT were fed directly into IMPUTE2.

Literature review strategy and SNP selection
Eligible studies were identified by performing a literature search on the PubMed (last search in June 30, 2015 by using the following keywords: "Lung cancer AND polymorphism". Furthermore, we scrutinized the full text of each paper to follow these criteria (Supplementary Figure S1): i) The studies were about human population and the publishing language was English; ii) these papers had an observational (casecontrol or cohort) study design (the sample size was at least 500 vs 500); iii) the authors offered odds ratios (ORs) and their 95% confidence intervals (CIs) of the relevant SNPs. In cases where the studies met the www.impactjournals.com/oncotarget inclusive criteria, 241 genetic variants in 124 papers were selected in our study.
We screened all the SNPs based on the relevant papers mentioned above followed three criteria (Supplementary Figure S1)

Statistical analyses
We used the NJMU GWAS samples as the training set to guide model development and the European samples as the validation set to assess the accuracy of the risk model. Four steps were performed to develop the risk model (listed in Supplementary Figure S1).
Step I SNPs screening. 40 significant SNPs (P < 0.05) were picked out using PLINK 1.07 through univariate analysis. Further, we used the Least Absolute Shrinkage and Selection Operator (LASSO) penalized regression model in the discovery stage (2,331 cases/3,077 controls) and 38 genetic variants were included in our predictive models.
Step II Model construction. To evaluate the contribution of the genetic factors, we conducted 2 risk models, one was the epidemiologic model (containing smoking factor only) and the other was the extended model (adding genetic variants evaluated by genetic risk score). In this model, "genetic risk score" (GRS) means the cumulative effect of multiple genetic risk variants as follows: Where k is the number of SNPs replicates in the study; SNP i is the number of the risk alleles (0, 1, 2); β i is the regression coefficient for SNP i , which was derived by using LASSO selection. It's worth noting that we rescaled the weighted score to reflect the number of risk allele: each point of the genetic risk score corresponded to one risk allele.
Step III Model evaluation. Model discrimination was evaluated by receiver-operator characteristic curves (ROC) and the C statistics. A nonparametric approach was used to compare the area under the receiver operating characteristic (ROC) curves (AUC) for the two models [31]. To quantify discriminatory improvement for models with and without the genetic risk score, we also set a cut-off value of the genetic risk score (GRS).
Step IV Model validation. We validated the risk model in the EAGLE samples (1,937 cases vs 1,984 controls) with the same risk predictors and evaluation strategies.
All statistical analyses were performed with PLINK 1.07 and R software (version 3.1.1; The R Foundation for Statistical Computing). P < 0.05 was used as the criterion of statistical significance and all statistical tests were two sided.