Lung cancer susceptibility from GSTM1 deletion and air pollution with smoking status: a meta-prediction of worldwide populations

Glutathione S transferase mu 1 (GSTM1) gene has been associated with lung cancer (LC) risk, for GSTM1 enzyme playing a vital role in detoxification pathway and protective against toxic insults. The major objective of this study was to investigate GSTM1 deletion pattern and its association with LC in the world’s population by using meta-prediction techniques. The secondary objective was to examine the effects of air pollution, smoking status, and other factors for gene-environment interactions with GSTM1 deletion and LC risk. We completed a comprehensive search to yield a total of 170 studies (40,296 cases and 48,346 controls) published from 1999 to 2017 for meta-analyses. The results revealed that GSTM1 deletion type was associated with increased risk of LC, while GSTM1 present type provided protective effect for all populations combined worldwide. Subgroup analysis on the rank order of risks from highest to lowest, among racial–ethnic groups, were Chinese, South East Asian, other North Asian, European, and finally American. Additional predictive analyses presented that air pollution played a significant role with increased risks of GSTM1 deletion and LC susceptibility, and the risks increased for smokers with higher levels of air pollution. Based on the findings of meta-predictive analysis, increased air pollution levels and smoking status presented additive effects to the LC risk susceptibilities and GSTM1 gene polymorphisms, for gene-environment interactions. Future studies are needed to examine gene-environment interactions for GSTM1 interacting with environmental factors and dietary interventions to mitigate the toxic effects, for LC prevention.


IntroductIon
Lung cancer (LC) accounts for the second most commonly diagnosed cancer among adults and 25% of all cancer deaths, with delayed diagnosis at a late stage being associated with poor prognosis [1][2][3][4]. Glutathione S transferase mu 1 (GSTM1) gene has been associated with LC risk, with GSTM1 enzyme playing a vital role in detoxification pathway and protective effect against toxic insults [2,[5][6][7]. GSTM1 is one of phase II detoxification enzymes that detoxify electrophilic compounds, including carcinogens, therapeutic drugs, environmental toxins, and byproducts of oxidative stress by conjugation with glutathione (GSH). GSTM1 gene was known to be highly polymorphic and the polymorphism affects the expression of enzyme levels [5][6][7][8][9][10][11]. Two identified variants in GSTM1 are a deletion and a substitution. A deletion of GSTM1 or null mutation deactivates the enzymes, which results in the loss of function within the detoxification pathway [2][3][4]. GSTM1 null genotype has been associated Meta-Analysis www.oncotarget.com with increased risk of many cancers [8], and increased environmental toxins and carcinogens further increase the susceptibility of LC [2,4,5,7,12].
Environmental toxicants such as air pollution and smoking can expose lung, an organ, to oxidative stress and dis-regulate reactive oxygen species [2,4,5,[13][14][15]. Studies suggested that exposure to oxidative stress cause damage to cellular DNA that leads to mutations, genomic instability, and ultimately malignancy [2-4, 13, 14, 16]. Several studies indicated that consumption of cruciferous vegetables can reduce the risk of LC. These plants contain isothiocyanates (ITC) and indole-3-carbinol, which are known to induce phase II enzyme in the detox pathway [14,17,18]. ITC and indoles may inhibit the bioactivation of carcinogen from air pollution and smoke, enhance excretion of carcinogenic metabolites before it causes damage to DNA, and induce cell cycle arrest and apoptosis [18,19]. These processes affirm the crucial role of micronutrients in the detoxification pathway for LC prevention.
To date, results from epidemiological studies on the association of GSTM1 mutation and LC have been inconsistent and mixed with heterogeneous findings. Meta-predictive analysis can be used to address heterogeneous findings, and to cross validate the findings using various analytical methods [20]. Additional studies indicated the effects of air pollution on the association with GSTM1 deletion. Despite these findings, previous metaanalyses did not examine the effects of gene-environment interaction, specifically air pollution and smoking status, on the association with GSTM1 and LC risk. To fill this gap and to provide further evidence, we conducted a meta-analysis by adding meta-predictive techniques to examine the impact of exposure to air pollution on the risk of GSTM1 deletion and LC susceptibility in various populations of the world, with subgroup analyses of LC types, smoking status, and gender status. In this meta-prediction study, we integrated the use of big-data machine-learning analytics in addition to the conventional pooled analysis, including the global maps and heat maps to visualize grouping patterns.

rEsuLts characteristics of study subjects
We have summarized how we selected studies in Figure 1. We initially identified 450 potential relevant studies published from 1999 to 2017. Through systematic screening process, we located a total of 163 papers (40,296 cases and 48,346 controls) that included data for GSTM1 deletion. These studies were conducted in 5 continents of the world and 7 studies also included data for more than one racial-ethnic groups, yielding a total of 170 studies (see Supplementary Table 1, see Figure 2 for % GSTM1 deletion in control and LC groups).

subgroup analyses by smoking and gender status per total population and ethnic groups
Per smoking status (Table 1), the risk of LC was mixed and presented inconsistent findings across ethnic subgroups. The risk was slightly higher for non-smokers (RR = 1.15, p < 0.0001) than smokers (RR = 1.09; p < 0.0001). However, the reversed findings were noted among Chinese (smokers: RR = 1.27, p < 0.0001; nonsmokers: RR = 1.22, p < 0.0001) (see Supplementary  Table 3). There were no significances for subgroup analyses of smoking status for other racial-ethnic subgroups.
The risk of LC was also mixed and presented inconsistent findings across gender subgroups (

subgroup analyses by countries
To identify sources of heterogeneity, we further performed subgroup analyses per countries using geographic information system (GIS) to visualize regional distributions and to validate the heterogeneity of the findings. Countries were divided based on geographical area. These geographical analyses showed the rank order of highest risk of LC with GSTM1 deletion, being Chinese, South East Asian, other North Asian, European countries and American countries ( Table 2, Supplementary Figure 1A-1D). The global maps demonstrated the variations in the GSTM1 deletion and their LC risk susceptibilities across regions. In the first two GIS maps, we used the continuous color spectrum from yellow to red, representing the increasing levels of polymorphisms, and in the third map, red-green colors -red indicating LC risk, and green indicating protective effects. Similar to the pooled meta-analysis, GIS maps showed that GSTM1 deletion played a risk role in LC in most countries except Australia, Pakistan, Poland, Sweden, Italy, United Kingdom (UK) and Portugal (Supplementary Figure 2).

Meta-prediction
Given the heterogeneous findings on the effects of GSTM1 deletion and the risk susceptibility of LC, we performed meta-predictive analysis using both big-data machine-learning predictive analytics and conventional analyses (Table 3). We used both partition tree and Tukey's tests to examine the potential interaction between air pollution and deletions, and their impact on LC risks. Based on the guidelines from the World Health Organization on air pollution, we used the levels of death from air pollution (APD) as the measure of air quality (Level 2: 51-100, Level 3: 101-250, and Level 4: > 251 deaths/million) [33][34][35][36][37][38]. The partition tree and Tukey's test results converged and showed significant differences between APD Levels 3 and 4 (p < 0.0001), and between Levels 2 and 4 (p = 0.0056) for percent GSTM1 deletion by APD for LC cases. The same trend of statistical significance was noted on GSTM1 present type for LC cases. Furthermore, on the risk for GSTM1 deletion, significant differences were identified between Levels 3 and 4 (p = 0.0479), with the smallest AICc of -24.28. There were no significant differences based on gender status. To further illustrate the significance, we plotted those results on nonlinear curves. We noticed increased percentages of GSTM1 deletion for all groups of LC (Supplementary Figure 3A), NSCLC (Supplementary Figure 3B) and mixed LC type (Supplementary Figure  3C); and non-smoker groups (Supplementary Figure 3D) with the increased air pollution (Level 2: < 100, Level 3: 101-250, and Level 4: > 251 deaths/million). In contrast, the increase in deletion rates per air pollution levels were not as noticeable for the control groups. The results on the heat map were revealing for data density with the red blocks being the areas of high data concentration and the nonlinear fit line following the dense data (the red cells) for the percentages of GSTM1 deletion (Supplementary Figure  Higher percentages of GSTM1 deletion was also noted with the smoking status for smokers ( Figure 3, left graph). We noticed obvious increased risks of LC for smokers with the increased air pollution from low levels (Level 2 and Level 3) to high level (Level 4) (Figure 3 right graph). Similar trends, however, no obvious increases of LC risks were noted for non-smokers or other LC types with GSTM1 deletion (Supplementary Figure  3A-3D). The most noteworthy finding is that with the increased air pollution levels, in smokers, the LC risk ( Figure 3, right graph) was significantly higher in Level 4 (RR = 1.25) than other two levels (RR = 1.01) (p < 0.05 for both Tukey's tests between Level 4 versus Levels 3 and 2), based on GSTM1 deletion. These significantly increased LC risks for smokers at higher air pollution, contrary to not noticeable increases for other subgroups (Supplementary Figure   Note: GSTM1 = Glutathione S transferase mu 1; RR = relative risk; LC = lung cancer; NSCLC = non-small cell lung cancer; NA = not available; --no data; A 46 studies had data for both smoker and non-smoker groups; B 20 studies had both male and female groups. www.oncotarget.com effects of gene-environment interactions based on GSTM1 deletion interacting with air pollution and smoking status. The results on the heat map were revealing for data density with the red blocks being the areas of high data concentration and the nonlinear fit line following the dense data (the red cells) for the percentages of GSTM1 deletion ( Figure 4A, 4B) and LC risk ( Figure 4C).

dIscussIon
To date, previous studies presented the combined effects of GST family on the LC risks [21][22][23][24][25][26]. By using the meta-predictive analyses, we provided the most inclusive analyses of LC risk susceptibility based on GSTM1 deletion interacting with air pollution and smoking status. We completed a comprehensive search to yield a total of 170 studies (40,296 cases and 48,346 controls) published from 1999 to 2017. The analyses by countries indicated increased GSTM1 deletion rates and LC risks in Asian countries. Subgroup analysis on the rank order of risks from highest to lowest, among racial-ethnic groups, were Chinese, South East Asian, other North Asian, European, and finally American. These studies were conducted around the globe and its continents (e.g., Australia, Europe, North and South America, and Asia).
Additional noteworthy findings from subgroup analyses showed that higher risk of LC was presented among non-smokers than smokers with GSTM1 deletion in worldwide populations combined. Conversely, smokers had higher risks of LC than non-smokers of LC with GSTM1 deletion in Chinese subgroup. The findings about nonsmokers having overall higher risk of LC than smokers with GSTM1 deletion in our findings are consistent with a previous meta-analysis of Chinese populations [2]. However, we used risk ratios to standardize these risks (with the total counts as the denominator instead of one of the deletion or present type as the denominator) as contrary to the use of odds ratios (using one of the deletion or present type as the denominator) in previous meta-analyses. The standardized ratio is necessary when conducting gene-environment interactions across various factors for their standardized effects on the outcomes of polymorphism or LC risk [20,33,34]. The mechanism of higher LC risks from tobacco included inhibiting GST detoxification pathway and on the phase 1 metabolism of cytochrome P4501A promoting the carcinogenic effect and limiting the detoxification property of GSTM1 [35,36]. Furthermore, the high prevalence of air pollution in China and in other countries may have more impact on the results for smoking and LC risk [3,[35][36][37][38][39][40]. Future studies can continue to use the standardized risk ratios to see the differences on the risks across different subtypes.
In the gender subgroup analyses, in Southeast Asian subgroup, male gender had higher risk of LC than female gender. For two Southeast Asian studies, male patients had the history of cigarette smoking, tobacco chewing and drinking alcohol [35,36] with more smokers in the LC group (66%) than the control group (37%). Possible additional explanation on difference of risk between gender may lay in dietary intake, that quercetin-rich foods taken in South Asians could reduce the risk of LC through overall upregulation of GSTM1, especially for smokers [19,39]. Individual studies noted increased GSTM1 deletion in squamous cell carcinoma (SCC) than other LC subtypes [37,38], in younger and female LC patients [39], and in smokers [2,40]. A previous meta-analysis for Chinese populations presented higher risks of LC with GSTM1 deletion for SCC and adenocarcinoma (AC) than the small cell (SC) LC types [2]. A second meta-analysis in Chinese population also reported association of SCC and SC LC than AC subtypes being associated with smoking history [28]. No previous meta-analysis studies reported interaction structure or nested structure of LC subtypes with smoking and gender status, rather all studies reported grouping strata of these factors with GSTM1 without interaction structure. For LC subtypes, we found similar risks for NSCLC and mixed LC subgroups, with the strata of smoking status according to the data presented  in the original studies. Future studies are needed to report GSTM1 deletion with interactions of LC subtypes nested with the smoking and gender strata. Our findings illustrated the complexity of geneenvironment interactions with smoking status across regions and ethnic groups. As the studies on ITC and indole from crucifer-vegetable consumptions showed critical role of micronutrients in detoxification pathway and LC prevention [14,[17][18][19], studies are needed to further identify ways to decrease LC risks in population studies. Specifically, future studies are needed to examine how diet, environmental factors including air pollution and smoking status interact with GSTM1 deletion and polymorphisms across different regions and ethnic groups to prevent LC. Dietary management can be further examined in future intervention studies associating gene-environment interactions for LC prevention. Additionally, future research can be designed to examine other factors for gene environment interactions, such as ITC vegetable consumptions, smoking status, and other risk factors in association with gene-environment interactions for LC prevention.
Using meta-predictive techniques, we further presented the potential impact of air pollution on increased GSTM1 deletion rates and LC risks. Air pollution played a significant role with increased GSTM1 deletion and LC susceptibility for smokers. In countries with high levels of air pollution (Level 4), for smokers, GSTM1 deletion was a risk to LC susceptibility for most countries except Turkey. Among countries with lower levels of air pollution (Level 2), for smokers, GSTM1 deletion was a risk in Finland and India. From the risk analyses, we found that smoking and increased air pollution had additive effects to the LC risk susceptibilities in addition to the effects of GSTM1 gene polymorphisms on LC risks, for gene-environment interactions (Figure 2).
This meta-analysis should be interpreted within the context of its potential limitations. Limitations of the study include that this study is a population-based study. While we added the effects of air pollution and smoking as possible important contributors of GSTM1 deletion and LC risks, this study is not a study to examine the mechanisms to delineate the interaction effects of air pollution and smoking on GTSM1 deletion and LC. From our metaprediction analysis, we found that air pollution is the most influential factor but not gender and other factors for their effects on GSTM1 polymorphism and LC risks interacting with gene polymorphisms [41][42][43][44][45]. As none of the original individual studies reported the GSTM1 deletion within the interaction or nested contexts of LC subtypes with smoking, we were unable to delineate additional interaction effects for other factors such as gender status with the current meta-analysis data layout. Future studies are needed to accumulate common data elements for the important factors in addition to the gene polymorphisms with a data repository that enables the examination of gene-environment interactions of GSTM1 with various LC subtypes with new emerging interaction analytics [46,47].

MAtErIALs And MEtHods characteristics of original studies
A literature search was conducted using PubMed database for human studies on LC and GSTM1. The database was periodically searched for latest articles over the course of investigation till 2017, until no additional eligible studies were identified. Additionally, previous metaanalysis and review papers were used to cross reference and trace back to all original studies (See Supplementary References 1-33 following Supplementary Table 1). Of the 163 papers included, additional factors such as gender, smoking status, and types of LC were entered into the database for analysis. Seven papers have data for two racial-ethnic groups for both LC cases and controls [21][22][23][24][25][26]48], yielding additional 7 study groups. These studies were conducted around the globe and its continents (e.g., Australia, Europe, North and South America, and Asia). Furthermore, the racial and ethnic composition of each study were checked. The most investigated racialethnic populations for GSTM1 in association with LC was Asian (85 studies), Caucasian (68 studies), African (8 studies), and mixed-race groups (8 studies).

Inclusion and exclusion criteria
The inclusion criteria were studies that 1) examined the association of GSTM1 and LC risk, reporting the genotype allele counts for both LC cases and controls, 2) were written in English or 3) had abstract written in English with tables of genotype counts that were clearly presented. We excluded studies that 1) were written in non-English languages without genotype counts, 2) did not provide GSTM1 genotype allele counts for LC cases and controls, and 3) were of duplicate studies. Figure 1 presents the study selection process. Of 451 identified potential relevant articles, 240 were excluded as they did not provide GSTM1 genotype counts for LC cases and controls, and 34 were previous meta-analyses.

Quality measures
Data extractions and entry were checked for accuracy, and systematically organized to identify possible patterns. Preliminary analysis was run to ensure that the ranges of entries and pooled results were accurate for all studies. Each study was evaluated for quality using a set of appropriate indicators adapted from multiple sources on the assessment of studies. Integrated sources for these criteria included the www.oncotarget.com U.S. QUOROM consensus process on the quality of metaanalysis [49], quality reporting for observational studies [50,51], and in recent studies using the similar analytics [45,52]. Details on quality indicators that were used to assess the studies were presented in Supplementary  Table 1. The total range of quality score was 0-30 based on three domains: 1) external validity with 10 items on demographic factors (score range of 0-11); 2) internal validity with 12 items on methods and procedures (score range of 0-12); and, 3) report quality with 7 items on study results (score range of 0-7) [52]. The total quality score of included studies ranged from 8 to 28 (out of 30 maximum score). Studies scored above 50% for the possible total score were judged to have trustworthy findings [49]. We included all studies as we did not observe differences with pooled analyses when studies with low quality scores were analyzed in separate groups for sensitivity analyses.

data synthesis and analysis
We entered the air-quality data for various countries. Specifically, we verified from various sources for the most current and complete air-pollution data including the death rates from air pollution (death rates per million, Level 1: < 50, Level 2: 51-100, Level 3: 101-250, Level 4: 251-400, Level 5: > 401 [53,54]. We further verified these levels with current scales on air pollution data [55-58], and the most complete and current data on air pollution data was used for the analyses. There were no studies with Levels 1 or 5 pollution, therefore only Levels 2-4 pollution were included for final analysis. Prior to analyses, we entered all data into Excel spreadsheets (Microsoft Corp, Redmond, WA). Hardy-Weinberg Equilibrium (HWE) analyses were checked, which was developed to assess the distribution equilibrium for the evolutionary mechanisms on the population genetics [59,60]. Departure from the HWE with a p value P < 0.05 may be associated with factors such as population migration or stratification, and disease association. The associations of GSTM1 deletion with LC risk was estimated by calculating pooled risk ratio (RR)s and 95% CI between cases and controls, using StatsDirect version 3.0 software (Cheshire, UK). Pooled RR has been used in most recent consensus reports for standardized risk ratios for more conservative reports and for standardization across all factors included in the geneenvironment interaction analysis [20].
We utilized JMP pro 13 program (SAS Institute, Cary, NC) for meta-prediction analysis to examine the association of air pollution associated death (APD) to GSTM1 deletion polymorphisms and LC risk. We used partition tree to examine the association between independent and dependent variables. The "goodness of the partition" can be judged using Akaike's information criterion (AIC) or AIC with correction (AICc), in which a smaller AIC or AICc suggests a better model [52,61].
AIC is a fitness index for trading off the complexity of a model against how well the model fits the data. Increasing the number of free parameters to be estimated improves the model fitness, however, the model might be unnecessarily complex. To reach a balance between fitness and parsimony, the "best" model is the one with the lowest AIC value. In this sense, AIC is better than R 2 and adjusted R 2 used in meta-regression [20], which always go up as additional variables enter in the model, favoring complexity. However, AIC does not necessarily change by adding variables. Rather, it varies based upon the composition of the predictors and thus it is a better indicator of the model quality.
Additionally, we used nonlinear fit, heat maps, and Tukey's posthoc test to further validate meta-prediction findings. In particular, we used Tukey's tests to compare AICc results with the partition trees [62]. All p values were two-tailed with a significant level at P < 0.05. GIS maps was prepared to better visualize the heterogeneity of GSTM1 deletion polymorphism with LC risks on the world map. We applied meta-predictive analytical techniques using recursive partition tree, nonlinear fit and heat maps for data visualization to reveal nonlinear patterns in this study, in addition to the conventional pooled-analysis technique, to visualize the heterogeneity. While meta-regression is used commonly for advanced meta-analysis for meta-prediction [20], it is important to point out that regression analysis, as a linear model, is unable to detect nonlinear patterns. Further, it is well known that regression based on R 2 tends to yield a complex and overfitted model because R 2 always goes up with additional predictors. On the other hand, AIC or AICc does not necessarily change with the addition of variables. Rather, it varies based upon the composition of the predictors; thus, it is more likely to yield an optimal model [63][64][65]. Agreed with manuscript results and conclusions: all authors reviewed and approved of the final manuscript. This metaanalysis was registered with PROSPERO, International prospective register of systematic reviews, number: 96460 at https://www.crd.york.ac.uk/PROSPERO/#myprospero; http://prisma-statement.org/Protocols/Registration. There was no prior study registered on this subject in this registry. www.oncotarget.com

conFLIcts oF IntErEst
The authors declare no conflicts of interest.

FundInG
Funding support includes the Doctoral Research Council Grants, Azusa Pacific University; Research Start-up fund from Augusta University awarded to the corresponding author; and Fu Jen University Hospital, Taiwan, awarded to the first author.