Genetic susceptibility to bone and soft tissue sarcomas: a field synopsis and meta-analysis

Background The genetic architecture of bone and soft tissue sarcomas susceptibility is yet to be elucidated. We aimed to comprehensively collect and meta-analyze the current knowledge on genetic susceptibility in these rare tumors. Methods We conducted a systematic review and meta-analysis of the evidence on the association between DNA variation and risk of developing sarcomas through searching PubMed, The Cochrane Library, Scopus and Web of Science databases. To evaluate result credibility, summary evidence was graded according to the Venice criteria and false positive report probability (FPRP) was calculated to further validate result noteworthiness. Integrative analysis of genetic and eQTL (expression quantitative trait locus) data was coupled with network and pathway analysis to explore the hypothesis that specific cell functions are involved in sarcoma predisposition. Results We retrieved 90 eligible studies comprising 47,796 subjects (cases: 14,358, 30%) and investigating 1,126 polymorphisms involving 320 distinct genes. Meta-analysis identified 55 single nucleotide polymorphisms (SNPs) significantly associated with disease risk with a high (N=9), moderate (N=38) and low (N=8) level of evidence, findings being classified as noteworthy basically only when the level of evidence was high. The estimated joint population attributable risk for three independent SNPs (rs11599754 of ZNF365/EGR2, rs231775 of CTLA4, and rs454006 of PRKCG) was 37.2%. We also identified 53 SNPs significantly associated with sarcoma risk based on single studies. Pathway analysis enabled us to propose that sarcoma predisposition might be linked especially to germline variation of genes whose products are involved in the function of the DNA repair machinery. Conclusions We built the first knowledgebase on the evidence linking DNA variation to sarcomas susceptibility, which can be used to generate mechanistic hypotheses and inform future studies in this field of oncology.


INTRODUCTION
Sarcomas are a family of rare malignant tumors arising from bone and soft tissues with more than 50 different histologies accounting for about 1-2% of cancers in adults and 15-20% in children (worldwide incidence: approximately 200,000 cases per year). The pathogenesis of sarcomas is multifactorial including environmental (such as exposure to ionizing radiations or chemical carcinogens) and genetic components, although the disease rarity represents an objective hurdle to the research in this field of investigation. Significant advances have been made in the understanding of the acquired genetic events leading to sarcomagenesis. It has been recognized that three types of somatic DNA alterations, translocations, mutations, and copy number variations, play a key role in these tumors [1]. As a consequence, sarcomas are grouped into two categories: balanced translocation associated sarcomas (BATS) and complex genotype/karyotype sarcomas (CGKS), which are characterized by a stable genome and genomic instability, respectively [2]. A potential therapeutic implication of such genetic taxonomy classification is that some recurrent chromosomal translocations might be exploited for the development of drugs targeting the protein products of fusion oncogenes [1].
Conversely, knowledge on the role of germline DNA variations in sarcomagenesis is sparse and limited. Although a minority of sarcomas arise within well characterized heritable cancer predisposition syndromes (e.g., osteosarcoma and Bloom syndrome, desmoid tumors and familial adenomatous polyposis) [3], the vast majority of sarcomas occur sporadically and the role of the genetic background in their pathogenesis is to be uncovered. Recent advances in molecular high-throughput technology, which conduct of genome wide association studies (GWAS), is accelerating the pace of discovery of sarcoma predisposition loci.
Looking at the already existing international literature, some investigators have meta-analyzed the evidence regarding a handful of SNPs such as XRCC3 rs861539 [4], MDM2 rs2279744 [5,6], and CTLA4 rs231775 [7]: however, to the best of our knowledge no comprehensive collection of the available data in this field of oncology has been published thus far.
With the present work we systematically reviewed and meta-analyzed the available evidence in this field in order to: 1) provide readers with the first knowledgebase dedicated to the relationship between germline DNA variation and sarcoma risk; 2) identify areas lacking of meaningful information thus helping to inform future studies; and 3) suggest a biological interpretation of current findings utilizing network and pathway analysis [8] after integrating multiple sources of biological data [9].

Characteristics of the eligible studies
We identified 90 eligible articles, comprising 47,796 subjects, 14,358 cases and 33,438 controls. The details of the literature search are summarized in Figure 1.
Based on the prevalent ancestry (ie. the race of at least 80% of the enrolled subjects) the majority of the studies were Asian (N=57 studies) the rest being Caucasian (N=25 studies), or mixed (N=8 studies). Based on study design, half of included studies were population based case-controls studies (N=40 studies), the remaining were hospital based (N=39 studies), with a few (N=11) being mixed or not specified. Two studies were GWAS [10,11].
According to histology, the majority of the eligible studies investigated bone tumors (N=65) and the remaining investigated Ewing's sarcoma (N=9), soft tissue sarcomas (N=6), chordoma (N=4), hemangiosarcoma (N=1), and mixed sarcomas (N=5). Thirteen studies investigated pediatric subjects or young adults. Although pediatric/young age ranged from 0 to 35 years old in eligible studies, most of the studies considered subjects < 20 years old.
We evaluated the included studies following the criteria of the Newcastle-Ottawa scale (NOS) scoring system. The mean score was 7.8. The main features of all the eligible studies and the NOS score are available on Table 1.

Characteristics of the retrieved genetic variants
Overall, data on 1,126 polymorphisms involving 320 genes were retrieved. Variations were mainly SNPs, only six being insertion/deletions of more than one nucleotide. Based on the number of different genetic variations studied, the 11 most studied genes were the following: Thirty-seven of these genetic variants were located no more than 2kb upstream the relevant gene, ten no more than 500bp downstream the relevant gene, 493 in introns, 100 in exons (non-UTRs), 19 in the 3'-UTR, seven in the 5'-UTR. Moreover, 413 SNPs were located in intergenic regions more than 2kb upstream or more than 500 bp downstream the relevant gene and 41 in non-coding transcripts. Among the exonic SNPs, 63

Meta-analysis findings
At least two independent datasets were available for 51 genetic variations allowing us to perform 118 metaanalyses, 16 of them were histology-based meta-analysis on osteosarcoma and Ewing's sarcoma. Moreover, 13 sensitivity analysis were performed considering the ethnicity of the different datasets. The results of data metaanalyses are comprehensively reported in Supplementary  Table 2. Polymorphism "rs" identifier, nucleotide change and amino acid change are reported in Supplementary  Table 3.
The number of subject (cases plus controls) enrolled in the 118 meta-analyses ranged from 144 to 5,347 (median: 1,195). Based on the number of subjects, the 10 most studied genetic variants, all with 5,347 subjects, were the following: EGR2 rs224292 and rs224278, ADO rs1848797 and rs1509966, MDM2 rs1690916, LOC107984012 rs9633562, rs944684 and rs6479860, ZNF365 rs11599754 and rs10761660.
Of the 118 meta-analyses and 13 sensitivity analysis (131 total analyses) performed, 55 resulted to be statistically significant (P-value <0.05). The level of summary evidence, among the significant associations identified by meta-analysis, was high, intermediate, and low in 9, 38, and 8 analyses respectively. The most frequent single cause of non-high-quality level of evidence was between-study heterogeneity followed by the small sample size. Considering all statistically significant metaanalyses FPRP was optimal (<0.2) at least at the 10E-3 level for 10/55 analysis, 9 of them with high level of summary evidence.
The details of significant associations are reported in Table 2.
In order to provide an estimate of the impact of germline variants on sarcoma risk, the PAR (population attributable risk) was calculated. As an example, we considered the following three independent SNPs with high quality evidence on their relationship with sarcoma risk: rs11599754 of ZNF365/EGR2 (chromosome 10, risk allele: C, risk allele frequency in European ancestry population: 0.39, meta-

Associations based on single studies
Beside the variations resulted to be statistically significantly associated with sarcoma risk in this metaanalysis, we retrieved from the included articles 906 SNPs statistically significantly associated with sarcoma risk (P-value <0.05) based on single-study analysis. In Table 3 are reported 53 SNPs strongly associated with Ewing's sarcoma or osteosarcoma risk (P-value <E-06), retrieved from the included studies.
One dataset was available for each of those genetic variants. Although it was not possible to perform a metaanalysis, a strong association with sarcoma risk was found (P-values range from E-20 to E-06). Ewing's sarcoma associations in European and US European-descendant population mainly involved the candidate risk loci at 1p36.22, 10q21 reported by Postel-Vinay et al [10] GWAS and in the following related study of Grünewald et al [29]. The 1p36.22 variants associated with Ewing's sarcoma are located 25 kb proximal to the TARDBP gene. TARDBP (Tat activating regulatory DNA-binding protein, or TDP-43, transactive response DNA-binding protein) is a highly conserved DNA-and RNA-binding protein involved in RNA transcription and splicing. The 10q21 variants strongly associated with Ewing's sarcoma are located in a block containing four genes: ADO (encoding cysteamine dioxygenase), ZNF365 (encoding zinc-finger protein 365), EGR2 (encoding early growth response protein 2) and LOC107984012 (unknown function).

Network and pathway analysis findings
Using the 36 genes whose SNPs were significantly associated with sarcoma risk (including data from both meta-analysis and single studies) and were also characterized by a significant eQTL effect, we found that the corresponding protein products interact with each other beyond chance (observed edges: 120; expected edges: 12; PPI enrichment P-value <10E-20), with an average node degree equal to 6.7 (see Figure 2). Such enrichment indicates that the input molecules -as a whole group -are at least partially biologically connected. This high connectivity prompted us to conduct pathway analysis, which showed that the identified network is significantly enriched in DNA repair proteins, as shown in Table 4.
In particular, many sarcoma risk genes appear to be involved in all main DNA repair pathways, including single strand break repair pathways (

DISCUSSION
We described the findings of the first field synopsis and meta-analysis dedicated to the relationship between germline DNA variation and risk of developing bone and soft tissue sarcomas, which is based on genotyping data from 90 studies enrolling almost 48,000 people with a control-to-case ratio equal to 2. The resulting knowledgebase will be hosted by our cancer-dedicated website (at www.mmmp.org) [100] as a freely available online data repository that will be annually updated.
Overall, our findings support the hypothesis that genetic polymorphism does contribute to sarcoma susceptibility. This is exemplified by the population attributable risk (PAR=37.2%) calculated for three SNPs associated with the risk of sarcoma at a high level of evidence (rs11599754 of ZNF365/EGR2, rs231775 of CTLA4, and rs454006 of PRKCG), which indicates that more than one third of sarcoma cases would not occur in a hypothetical population where these three risk variants were absent. This remarkable influence of just three SNPs is linked not only to the high frequency of the risk alleles but also to the interesting fact that the risk, defined as odds ratio, associated with single variants ranged between 1.35 and 1.48, which are values higher than those usually observed for other malignancies such as breast [101], colorectal [102], and gastric carcinomas [103], which generally include odds ratios between 1.10 and 1.30. Considering that the mean risk among variants significantly associated with sarcoma predisposition was even higher (approximately 1.70, see Table 2), one might speculate that germline DNA variation is especially important in the determinism of the susceptibility to this family of tumors.
Overall, the quality of the available data, which was thoroughly assessed by means of both Venice criteria and false positive report probability (FPRP), was satisfactory considering that the statistically significant evidence on 47 of 55 variants for which a meta-analysis was feasible was classified as high to moderate level of quality with 10 SNPs considered adequate according to the FPRP ( Table 2). A statistically significant association was also demonstrated for additional 906 SNPs, for which only a single data source was available, which pinpoints the urgent need for replication studies in order to validate or refute these findings.
Conventional meta-analysis of single variants led us to identify 55 SNPs significantly associated with sarcoma risk (  in single studies (Table 3): these variants are linked to a variety of genes whose protein products are involved in several cell activities. Therefore, we tried to provide readers with a preliminary interpretation of these findings from the functional biology viewpoint. Using modern SNP-to-gene and gene-to-function approaches such as integrative analysis of genetic variation with expression quantitative trait locus (eQTL) data [9] and respectively pathway/network analysis [8], we hypothesize that germline variation of the DNA repair machinery might be of special relevance for the development of this type of cancer ( Figure 2). This finding -which has been very recently confirmed in patients with Ewing's sarcoma [104] -is in line with the complex gene and chromosome abnormalities that characterized some sarcoma histologies, as well as with the epidemiological observation that people accidentally [105] or therapeutically [106] exposed to ionizing radiations and thus prone to develop DNA damage are at higher risk of different types of sarcomas. In this regard, it is interesting to note that peripheral blood mononuclear cells of patients diagnosed with sarcomas show a higher sensitivity to mutagens in vitro as compared to controls [107], which supports the hypothesis that the genetic background can make the difference on an individual basis in terms of response to environmental carcinogens potentially involved in sarcomagenesis. Finally, also somatic DNA alterations appear to confer a defective DNA repair capability to some sarcoma types such as Ewing's sarcoma [108], and thus the combinatory study of germline and somatic DNA variations characterizing sarcomas might lead to better understand the cascade of molecular events underlying sarcomagenesis, as recently proposed for the EWSR1-FLI1 fusion gene and the SNPs near EGR2 in Ewing's sarcoma patients [29].
Overall, these converging data suggest that more investigation aimed to fully elucidate whether the germline individual capacity of repairing genomic damage can actually affect the predisposition to a complex and heterogeneous trait such as sarcomas might be particularly fruitful.
In our work we also confirmed the association between sarcoma risk and variants of single genes, such as ZNF365, ADO, EGR2, CTLA4, TP53, CD86, NUDT6, MDM2, ERCC5 and ADAMTS6 just to mention the top ten by statistical significance. Many of these genes are not known to be involved in DNA repair and thus the relationship between these single gene findings and network/pathway analysis might appear of unclear interpretation and doubtful importance. However, we must remember that current evidence (and thus our analysis) is based on 88 candidate gene studies and only two GWAS: therefore, more extensive investigation is needed on the variation of pathways for which data on single genes are currently available. In this regard, our meta-analysis data can be utilized to inform future studies on candidate pathways whose genetic variation could affect sarcoma susceptibility. This systematic review also underscores the main limitation of the evidence on the genetic susceptibility of sarcomas. In fact, most of current information is driven by data from studies investigating bone tumors (78 of 90, 86.6%). Studies focusing on soft tissue sarcomas are thus eagerly awaited, the formation of international consortia being advocated in order to overcome the hurdle of disease rarity. Hopefully, technological improvements in direct DNA sequencing such as next generation sequencing (NGS) methods will further accelerate the discovery pace in this field of investigation, as recently reported [104].
Nevertheless, we also recognize some limitations of this synopsis: data from different tumor types and population ethnicity were pooled together to find associations despite the diversity of sarcoma histologies, leading to high level of between-study heterogeneity. To overcome to this limitation we performed subgroup and sensitivity analysis whenever possible. Moreover, despite our efforts to avoid the issue of overlapping series, it is always possible that partial overlaps between multiple series published by the same research groups that cannot be detected by full text reading did remain included in pooled analyses: however, we believe that the influence  of this potential residual overlapping on the overall results is reasonably low.
In conclusion, we hope that the creation of the first knowledgebase dedicated to the relationship between germline DNA variation and sarcoma risk can not only represent a valuable reference for investigators involved in sarcoma research but also inform future studies based on the gaps of the current literature.

Search strategy, eligibility criteria, quality score assessment and data extraction
This study followed the principles proposed by the Human Genome Epidemiology Network (HuGeNet) for the systematic review of molecular association studies [109].
We considered eligible all the studies concerning the association between any genetic variant and the predisposition to sarcoma in humans, providing the raw data necessary to calculate risk of developing a sarcoma or the summary data. Exclusion criteria were: virusinduced sarcomas (HHV8 -Kaposi sarcoma); sarcomas secondary to radiation therapy; sarcomas secondary to burns/scars/surgery; associations between mitochondrial DNA variations and sarcomas; gastrointestinal stromal tumors (GIST).
Database search of original articles analyzing the association between any genetic variant and susceptibility to sarcoma was conducted independently by two investigators though the following database: MEDLINE (via the PubMed gateway); The Cochrane Library; Scopus; Web of Science. The search included the following three groups of keywords: 1) sarcoma, solitary fibrous tumor, chordoma, tenosynovitis, fibromatosis, desmoids, myofibroblastic, myopericytoma, myxoma, Ewing, desmoplastic, PEComa, haemangioendothelioma, lymphangioma, myoepithelioma; 2) risk, sarcomagenesis, tumorigenesis, predisposition, susceptibility; 3) polymorphism, SNP, variant, genome wide association study and its acronym GWAS. Searches were conducted using all combinations of at least one keyword from each group. References from eligible articles were also used to refine the literature search.
The quality of the studies was evaluated according to Newcastle-Ottawa quality assessment scale (NOS) [110]. In brief, the following three parameters were evaluated with a "star system": the selection of the study groups (0 to 4 "stars"), the comparability of the groups (0 to 2 "stars"), and the ascertainment of either the exposure or outcome of interest for case-control or cohort studies respectively (0 to 3 "stars"). The maximum total score was 9 "stars" and represented the highest quality. Data were extracted independently by two investigators using a template. Every disagreement was resolved by a third investigator in order to reach consensus. Authors were contacted whenever unreported data were potentially useful to enable the inclusion of the study into the systematic review. The data extracted from eligible studies were: authors, journal, year of publication, region or country where the study was conducted, hospital where the patients were diagnosed, number of patients with sarcoma enrolled and healthy control subjects, period of enrolment, prevalent ethnicity (>80%, categorized in Caucasian, Asian, African and mixed), subjects age, genetic polymorphisms and allelic frequency in both cases and controls (if no raw data were available, summary data were collected, i.e. odds ratios and confidence intervals), study design (population-based versus hospital-based), statistical methods used, and sarcoma histology.
We considered data published in different articles by the same Author/s with the same (or similar) number of subjects enrolled in the same period of time in the same hospital, to be derived by the same group of patients. In publications with either overlapping cases or controls, the most recent or largest population was chosen.
For analysis purposes, the search was closed in August 2017.

Statistical analysis
We calculated summary odds ratios (ORs) and their corresponding 95% confidence intervals (95%CI) starting from raw data to measure the strength of association between each polymorphism and sarcoma risk.
Whenever possible, we calculated the pooled ORs assuming 3 different genetic models: per-allele (additive), dominant and recessive. If the included studies reported exclusively per-allele ORs, as in GWAS, we calculated the pooled OR assuming the per-allele (additive) model.
Random effects meta-analysis based on the inverse variance method was used to calculate summary ORs; this model reduces to a fixed effect meta-analysis if betweenstudy heterogeneity is absent. We chose this model for the large between-study heterogeneity usually expected in genetic association studies. A meta-analysis was performed only if at least two independent data sources were available. In case of GWAS, we considered as data source the joint analysis between the discovery and the validation phases. Subgroup analysis by histological subtype (Ewing's sarcoma vs osteosarcoma) was planned if data permitted.
Regarding ethnicity, analyses were divided in 4 groups: African (if the datasets were all African population-based), Asian (if the datasets were all Asian population-based), Caucasian (if the datasets were all Caucasian population-based), and mixed (if the datasets were African, Asian and Caucasian or if the datasets were from mixed ethnicity). In order to test any dominant study driving effect, sensitivity analysis by ethnicity (Asian vs Caucasian/other) was performed in mixed meta-analyses, with more than two datasets, excluding either the Asian study or the Caucasian study from the meta-analysis.
Between-study heterogeneity was formally assessed by the Cochran Q-test and the I-squared statistic, the latter indicating the proportion of the variability in effect estimates linked to true between-study heterogeneity as opposed to within-study sampling error.
All statistical analyses were performed with RevMan 5 (Review Manager computer program, version 5.3; Copenhagen, The Nordic Cochrane Centre, The Cochrane Collaboration, 2014).

Assessment of cumulative evidence
With the aim to assess the credibility of statistically significant associations based on the results of data metaanalysis, we used the Venice criteria [111]. In brief, we defined credibility levels based on the strength (classified as A=strong, B=moderate or C=weak) of three following parameters: amount of the evidence, replication of the association and protection from bias. We graded the amount of evidence, which approximately depends on the study sample size, based on the sum of cases and controls. Grade A, B or C was assigned to meta-analyses with total sample size >1000, 100-1000 and <100, respectively. Also, the replication of the association was graded considering the amount of between-study heterogeneity. We assigned grade A, B or C to meta-analyses with I-squared <25%, 25-50% and >50%, respectively. We graded protection from bias as A if no bias was observed, B if bias was potentially present or C if bias was evident. While assessing protection from bias we also considered the magnitude of the association. We assigned a score of C to an association characterized by a summary OR<1.15 or a summary OR>0.87 if the effect of the polymorphism was protective.
In addition to the Venice criteria, we assessed the noteworthiness of significant findings by calculating the false positive report probability (FPRP) [112], which is defined as the probability of no true association between a genetic variant and disease (null hypothesis) given a statistically significant finding. FPRP is based not only on the observed P-value of the association test but also on the statistical power of the test and on the prior probability that the molecular association is real following a Bayesian approach. We calculated FPRP values for two levels of prior probabilities: at a low prior (10E-3) that would be similar to what is expected for a candidate variant, and at a very low prior (10E-6) that would be similar to what would be expected for a random variant. To classify a significant association as 'noteworthy', we used a FPRP cut-off value of 0.2. www.oncotarget.com Overall, we defined the credibility level of the cumulative evidence as high (Venice criteria A grades only coupled with "noteworthy" finding at FPRP analysis), low (one or more C grades combined with lack of noteworthiness), or intermediate (for all other combinations).
To estimate the impact of genetic variation on the risk of sarcomas, we calculated the so called population attributable risk (PAR) using the following formula: Pr (RR − 1)/[1 + Pr (RR − 1)], where Pr is the proportion of control subjects exposed to the allele of interest and the relative risk (RR) was estimated using the summary estimates (i.e. ORs) calculated by the meta-analysis. The joint PAR for combinations of polymorphisms was calculated as follows: 1 − (∏ 1→n [1 − PARi]), where PARi corresponds to the individual PAR of the ith polymorphism and n is the number of polymorphisms considered [113].

Network and pathway analysis
In order to explore the mechanisms underlying the pathogenesis of sarcomas, we utilized network and pathway analysis to test the hypothesis that genes whose variations are associated with sarcoma risk interact with each other possibly within the frame of some specific molecular pathways [8].
To this aim, we first selected SNPs significantly associated with sarcoma risk. In case of SNPs located in intergenic regions we selected the first closest and the second closest genes, not necessarily upstream and downstream of the SNPs of interest.
Since most SNPs are intergenic or intronic and thus no obvious functional effect can be inferred, expression quantitative trait locus (eQTL) analysis was used to identify genes whose expression is affected by DNA variants [114]. The resulting gene list was the input for both network and pathway analysis.
For the former, the STRING web server was employed to study protein-protein interaction (PPI) across the selected genes [115], the confidence score being set >0.4. As a measure of across network connectivity STRING provides the average node degree, where degree is the conceptually simplest centrality measure as it measures the number of edges between protein connections attached to a protein; moreover, STRING computes the PPI enrichment P-value, which is significant when input proteins have more interactions among themselves than what would be expected for a random set of proteins of similar size, drawn from the genome.
As regards pathway analysis, the Enrichr web server was utilized to identify in our list over-representation of genes involved in specific pathways described in dedicated databases [116]. Hypergeometric distribution with Fisher's exact test was used to calculate the statistical significance of gene overlapping, followed by correction for multiple hypotheses testing using the false discovery rate [FDR] method.

Declarations
Ethics approval and consent to participate: Not applicable Consent for publication: Not applicable Availability of data and material: All data generated or analysed during this study are included in this published article [and its supplementary information files].