gave comprehensive molecular portraits of human Exploration of immunogene in colon cancer recurrence

Colon adenocarcinoma is the third most common cancer with high risk of recurrence and deteriorative consequences. Given the importance of immune genes in tumor regulation and cancer immunotherapy, there is a need to comprehensively profile the immunoregulatory genes from multiple types of colon cancer patient genomic data for discovering important associations and potential therapeutic targets of colon cancer recurrence. We used publicly available colon tumor tissue genomic data from The Cancer Genome Atlas database and immune genes data from innateDB database in this study. We derived the immune genes profiles by exploring multiple genomic profiles (gene expression, clinical and somatic mutation) in colon cancer. Some of the synthetic lethal genes we identified, such as CASP14, MS4A6E, KIR2DL1, KIR3DL1, KIR2DL3, CCL1, IL36B, FOXO3, POU2F1, SMAD3, HOXA9, PACS1, PROM1, DIDO1, SRC, CBFA2T2, NCOA6, PGAM1 and PROC, have been suggested to be potential targets correlated with immune genes for colon cancer recurrence treatment. Moreover, TLR2 could be promisingly new early stage indicator for colon adenocarcinoma recurrence. This is a systematic study that combines three different types of genomic data to


INTRODUCTION
Colon adenocarcinoma (COAD), as a kind of colorectal cancer, has traditionally been treated surgically [1]. However, many cases of colon cancer are systemic at the time of diagnosis, and apparently curative surgery is turned out to be at a late stage. But, tumor recurrence as a consequence of circulating tumor cells is unmanageable before the surgery [2][3][4]. It has been recognized that cancer is associated with the genetic, genomic and epigenetics changes [5,6]. Meanwhile, a demonstrative influence of the host immune response on tumour invasion, recurrence and metastasis has come from analyses of the in situ immune components and how these components are organized within human tumours. Indeed, immune infiltrates are heterogeneous between tumour types, and are very diverse from patient to patient. All immune cell types may be found in a tumour, including macrophages,

dendritic cells, mast cells, natural killer (NK) cells, naive and memory lymphocytes, B cells and effector T cells (including various subsets of T cell: T helper cells, T helper 1 (T H 1) cells, T H 2 cells, T H 17 cells, regulatory T (T Reg ) cells, T follicular helper (T FH ) cells and cytotoxic T cells)
. These immune cells can be located in the core (the centre) of the tumour, in the invasive margin or in the adjacent tertiary lymphoid structures (TLS) [7]. In some cases, immune cells constitute an additional, prominent component of the host response to cancer, but their participation in tumour pathogenesis remains incompletely understood [8]. Therefore, finding genes correlated with immunogenes for COAD recurrence is becoming more and more important.
A large number of COAD genomic data are emerging and promoting colon cancer research. The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer. gov/) gave comprehensive molecular portraits of human colon cancer by integrating various types of "omics" data including genomic DNA copy number arrays, DNA methylation, exome sequencing, messenger RNA arrays, microRNA (miRNA) sequencing, and reversephase protein arrays. The related investigations have greatly advanced our understanding of colon cancer in molecular profiles [9], although translation of genomic findings into clinical applications remains challenging. The high-quality TCGA primary colon tumor samples and their comprehensive molecular profiles could be an invaluable source of information for molecular exploration of colon cancer and discovery of new treatment targets. Immune gene list from InateDB, as a manually-curated knowledgebase of the genes, proteins, and particularly, the interactions and signaling responses involved in mammalian innate immunity [10], include Immport [11], Immunogenetic Related Information Source (IRIS), Septic Shock Group, MAPK/NFKB Network, Calvano et al., Nature 2005 [12], and Immunome Database. Considering the dependability of all immune genes, we selected immune genes that are distributed in more than two above gene lists, and then incorporated the innate immunity genes that are not distributed in other gene lists.
Microsatellite instability (MSI) is caused by a defect in the mismatch repair (MMR) machinery, one of the main mechanisms responsible for recognizing and repairing errors in newly synthesized DNA. MSI results from biallelic inactivation of one of the MMR genes [13]. MSI, as one of immunogenic subtype in Colorectal cancer, has been exploded in research paper [14,15]. Mutation burden defined by the number of mutated genes in every cancer sample mirrors the degree of mutation [16]. The relationship of mutations and MSI or gene expression level and MSI need to be analyzed.
In this study, we explored immune genes correlated with COAD recurrence by survival analyses based on immunogene mutations profiles. We analyzed immune gene somatic mutation and gene expression data to identify potential SL genes for above immune genes, evaluated MSI status, mutation burden and compared the relation of MSI with somatic mutation and gene expression level. We also identified potential sign for recurrent COAD by immunogene mutations and clinical profiles and verified the results by PubMed references (http://www.ncbi.nlm. nih.gov/entrez/query.fcgi?db = PubMed).

RESULTS
By using Gene Set Enrichment Analysis (GSEA) software [17], we identified 10 canonical pathways significantly associated with this 1219-gene set that have significant differences between recurrent cancer and normal samples. We obtained 10 pathways correlated with 1219 Immune genes involved in COAD by a threshold of adjusted P-values (FDR) < 0.05 ( Figure  1 and Supplementary Table 1), such as Genes involved in Immune System [18], Matrisome and Matrisome Associated [19,20], Innate Immune System, Complement and Coagulation Cascades, Adaptive Immune System, Pathways in cancer, Hemostasis, Hematopoietic cell lineage, Cytokine-cytokine receptor interaction [21].
We compared disease-free survival (DFS) between immune genes mutated and wildtype cancer samples, both of more than 3 samples in COAD. Finally, 69 Genes were found that all have a worse DFS prognosis, and then are bad for recurrent COAD patients to survive. We plot DFS curves for immune gene-muted and gene-wildtype cancer samples with significant poor prognosis by a threshold of adjusted P-value (FDR) < 0.05 ( Figure 2).
According to the MSI status, we divided all recurrent COAD samples and all non-recurrent cancer samples into three groups, including MSI group, Microsatellite stability (MSS) group, and non-available group. Then, we concluded that the COAD recurrence has no correlation with the stability of microsatellites.
DNA mutation types were classified by TCGA mutation dataset, which included Silent, Missense_ Mutation, Frame_Shift_Del, Nonsense_Mutation, In_ Frame_Ins, Splice_Site, Frame_Shift_Ins, In_Frame_Del ( Figure 3 and Supplementary Table 2). We found that higher mutation burden of recurrent cancer samples concentrated in missense mutation except for silent mutation.
Based on the TCGA data, the stability of microsatellites has no significant correlation with cancer recurrence. Then, we concluded that immunogene cell adhesion molecule L1-like (CHL1, P-value = 0.019, odds ratio = 7.644) and insulin receptor substrate 1 (IRS1, P-value = 0.026, odds ratio = Inf (denominator is zero)) somatic mutation have correlation with the MSI and MSS. At the same time, the differences of expression levels of the 69 immunogenes in MSI cancer samples compared to those in MSS cancer samples were shown in Table 1  (Supplementary Table 3).
Comparison expression level of immune genes that are correlated with recurrent COAD in recurrent cancer and normal samples, which shows that 31 immune genes have significant difference between the two groups (FDR < 0.05, and fold change > 1.5). The FDR was estimated using the method of Benjami and Hochberg [22]. Heat map of 31 immune genes expression ( Figure 4 and Supplementary Table 4) shows the expression trend in recurrent cancer and normal samples. Figure 4. Differential gene expression levels of immune genes in recurrent cancer compared to those in normal samples. The columns represent recurrent cancer samples (63), normal samples (41) without cancer, and the rows represent immune genes. The red color indicates that a gene is more highly expressed, and the green color indicates that it isn't.
We verified the 31 immune genes correlated with COAD recurrence by using the PubMed database. All www.impactjournals.com/oncotarget research papers related to the 31 genes provided direct or indirect evidences suggesting that COAD recurrence is affected by immunogenes ( Table 2). We found that the related experiments of the 1 2 immune genes have been published in research papers.
We identified a set of candidate synthetic lethal (SL) genes [38] for 31 immune genes from above results. 19 SL genes for 6 immune genes have been found in mutation data (Table 3).
Moreover, we compared IC 50 (drug concentration that reduces viability by 50%) values between immune genes that identified for 19 SL genes' higher-expressionlevel and lower-expression-level cancer cell lines for each of the 265 compounds mentioned above. We found that GULP1 had significantly lower IC 50 values in lower-expression-level cancer cell lines than in higherexpression-level cancer cell lines, and other 4 immune genes turned out to be the reverse (P-value < 0.05 and top 5 ascending order by FDR) ( Table 4 and Supplementary  Table 5). This indicates that immune genes expression levels of the cancer cell lines significantly influence the sensitivity of these cell-lines to drugs. Table 4 lists drug sensitivity differences in differential expression levels of immune genes obtained by the Cancer Cell Line Project.
We compared the immune gene mutation rates among different clinical phenotypes of cancer patients    using Fisher's Exact Test. Stage phenotype that represents tumor size and spread was divided into two classes: early stage (Stage I-II) vs. late stage (Stage III-IV) [50]. Only gene TLR2 mutation was correlated with stages, and its mutation rate was higher in early stage than late stage subjects (unadjusted P-value = 0.041, odds ratio = Inf). Gene toll-like receptor 2 (TLR2) mutation rate is 3.7%, that is, 8 samples, which include Missense Mutation (3),  (1) and Silent Mutation (2).

DISCUSSION
In this study, we performed extensive analyses of immune gene mutation, gene expression, and clinical data from COAD datasets in TCGA. We computed the somatic mutation burden, MSI status and correlation with each other of 69 immunogenes, verified the immunogenes correlated with COAD recurrence by literature, identified potential druggable SL partners of immune genes that are correlated with COAD recrudesce, analyzed correlation between early and late stages.
Druggable SL gene partners that were identified for immune genes that are correlated with COAD recidivation may yield insights into the personalized treatment of patients with immune gene-mutated cancers, since no druggable immune gene mutants. An example of successful application of the synthetic lethality approach is the targeting of cancers with dysfunction of the breast-cancer susceptibility genes 1 and 2 (BRCA1 and BRCA2) by poly(adenosine diphosphate ADP-ribose) polymerase (PARP) inhibitors [51]. In the present study, we identified potential immune genes SL partners based on the assumption that mutation of SL gene has more effects on the expression of immune genes in SL gene-mutated cancers than those in both SL gene-wildtype cancers and normal tissue. Moreover, we validated the correlation between drug sensitivity and gene expression by exploring the pharmacogenomic data from the Cancer Cell Line Project. We identified a threshold of P-value < 0.05 in the result of different sensitivity based on the pharmacogenomic data. Drugs had significant differences between higher-expression-level and lower-expressionlevel genes. ATP5A1, IL18R1, and SPP1 all have higher sensitivity in higher-expression-level than lower-expressionlevel. GULP1 and LYL1 all have lower sensitivity in higherexpression-level than lower-expression-level. CGP-082996 [52], as inhibitor of CDK4, is the only one drug that has lower sensitivity in higher-expression-level than in lowerexpression-level based on different drug sensitivity of ATP5A1. We identified that CDK4 expression significantly correlates with ATP5A1 expression in COAD (Pearson product-moment correlation, P-value < 0.05). The result is based on COAD dataset from TCGA indicating that CDK4 expression positively correlates with ATP5A1expression (P-value = 0.004, Correlation coefficient = 0.221), and then explains the reason of lower drug sensitivity in ATP5A1 higher-expression-level.
TLR2 plays an important role in Lewis lung carcinoma metastatic growth [53]. Recurrence pattern of COAD includes metastatic recurrence [54] and disseminated recurrence [55]. TLR2 is also required for rapid inflammasome activation [56,57] and is identified as an indication gene correlated with cancer stages and cancer recurrence by this study. TLR2 interacts with a number of gene products (proteins) (Figure 5, generated by the BioGRID [58]). For example, autosomal recessive deficiencies of IRAK1 and MYD88 impair Toll-like receptor (TLR) to recurrent life-threatening bacterial diseases [59]. TIRAP is dispensable in TLR2 signalling at high ligand concentrations in macrophages and dendritic cells, with MyD88 probably coupling to the TLR2 receptor complex at sufficient levels to allow activation but having an inhibitory role in the signalling of TLR3 to JNK [60,61]. Although these results need to be validated through experimental investigation, they represent a promising direction for future studies.
Overall, we found 31 immune genes related to colon cancer recurrence. In particular, 12 of the 31 identified genes have been reported in the literatures to be related to colon cancer recurrence [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37], which can be regarded as validation to our prediction results for these 12 genes. For the other 19 genes identified by us, we cannot exclude the possibility that some of them are due to false positive prediction. Therefore, there is a need for further experimental studies and external database of our predicted genes to validate our method and the results.
Our study was based on the data from the TCGA database. It is desirable to test and apply our method on the data from other sources. We therefore searched for the relevant data from other databases including Gene Expression Omnibus (GEO) [62], Ensembl [63], Hugo Gene Nomenclature Committee (HGNC) [64], ArrayExpress [65], and Catalogue of Somatic Mutations in Cancer (COSMIC) [66]. These databases provide comprehensive genome data for colon cancer, such as gene expression dataset, somatic mutation dataset and gene location dataset even gene Methylation dataset. However, none of them provide the information of the clinical prognosis tracking, which is needed by our study. Hence, we were unable to test and apply our method on the data from these databases. It is expected that more comprehensive information such as clinical prognosis tracking will be made available in more databases, which will better serve the need of various research and discovery efforts.

Materials
We downloaded the colon carcinoma gene expression (microarray), gene somatic mutation data and clinical dataset from the TCGA data portal (https:// portal.gdc.cancer.gov/). For TCGA samples, somatic mutations were revealed from exome sequencing of matched tumor and normal tissue genome pairs. In the gene expression data, a total of 287 colon cancer samples and 41 normal samples were found. Considering that the gene expression activity is our primary concern, and for statistical consistency, we analyzed the same 287 colon cancer samples in the other 2 data types. There are 79 and 4 samples missing in gene somatic mutation and clinical data, respectively. Thus, we analyzed 287 colon cancer samples in the gene expression, 208 colon cancer samples in the gene mutation data and 283 colon cancer in the colon cancer clinical data. We used clinical data for survival analyses and indication of tumor status from FireBrowse (http://gdac. broadinstitute.org/). According to the dataset mentioned above, 63 recurrent tumor samples and 177 non-recurrent tumor samples have been divided into two cancer groups. Considering that the TCGA dataset activity is our primary concern, and for statistical consistency, we analyzed 2546 immune genes contained in COAD from InnateDB (http://www.innatedb. ca/redirect.do?go=resourcesGeneLists) [10]. We obtained pharmacogenomic data from the Cancer Cell Line Project (http://www.cancerrxgene.org/) covering 265 screened compounds and their targets, and including cancer cell lines' drug response, drug sensitivity, and gene expression data. Ethical approval was avoided since we used only publicly available data and materials in this study.

Comparison of immune genes expression levels
We normalized TCGA RNA-Seq gene expression data by base-2 log transformation. We identified differentially expressed genes between two classes of Every dot represents a sample, wherein, the red dots are recurrent caner samples and blue dots are non-recurrent cancer samples. The vertical axis shows the number of mutation genes in every cancer sample. www.impactjournals.com/oncotarget samples using Student's t test. To adapt to multiple tests, we calculated adjusted P-values (FDR) for t test P-values. We used the threshold of FDR < 0.05 and mean geneexpression fold-change > 1.5 to identify the differentially expressed genes.

Comparison of the immune gene mutation rates and expression levels
We compared the immune gene mutation rates among different clinical phenotypes of cancer patients using Fisher's Exact Test. Tumor stage phenotype was divided into two classes: early stage (Stage I-II) vs. late stage (Stage III-IV). A threshold of P-value < 0.05 was used to evaluate the correlation in mutation rates between the two classes of phenotypes.

Gene-set enrichment analysis
We performed pathway analysis of the gene sets using KEGG(www.genome.jp/kegg/), REACTO ME (www.reactome.org/) and the GSEA tool (http:// software.broadinstitute.org/gsea/msigdb/). We carried out network analysis of gene sets interacting with a  (41) without cancer, and the rows represent immune genes. The red color indicates that a gene is more highly expressed, and the green color indicates that it isn't. number of gene products (proteins) generated by the BioGRID [58].

Survival analyses
We performed survival analyses of TCGA patients based on COAD mutation data. Kaplan-Meier survival curves were used to show the survival (disease free survival) differences between gene-mutated cancer patients and gene-wildtype cancer patients. We used the log-rank test to calculate the significance of survival-time differences between the two classes of patients with a threshold of P-value < 0.05.

Microsatellite instability status in all cancer samples
We divided the MSI into three groups, including MSI, MSS and non-available, and analyzed the correlation between stability of microsatellites and somatic mutation by using Fisher's Exact Test. Based on the cancer recurrence immunogenes, we compared the correlation of 69 genes between gene somatic mutation and MSI by using Fisher's Exact Test (P-value < 0.05) in COAD. At the same time, we analyzed the differential expression levels in MSI cancer samples compared to those in MSS cancer samples by Student's t test with a threshold of FDR < 0.05 and a fold change > 1.5.

Somatic mutation burden of all gene-mutated samples
We integrated the clinical data and somatic mutation data. Then, we showed mutation burden and mutation types of all recurrent caner samples.

Verification of genes by the research based on the PubMed
We obtained related research papers of experimental data from PubMed searches, and integrated all genes information to verify the 31 immune genes correlated with COAD recurrence.

Identification of potential SL genes for immune genes
We identified the set of immune genes whose expression has significant difference between gene-mutated cancers and gene-wildtype cancers (Student's t test, FDR < 0.05, fold change > 1.5), and has significant difference between gene-mutated cancers and normal tissues (Student's t test, FDR < 0.05, fold change > 1.5). We identified potential SL genes for immune genes from the intersection of these two gene sets. To identify genes whose elevated expression is specifically related to other gene-mutated cancers, we believe that it is necessary to exclude as many genes as possible whose expression is significantly different between in gene-wildtype cancers and in normal tissues.

Comparison of drug sensitivity in cancer cell lines
We compared IC 50 values belonging to intestine tissue cancer cell lines between immune genes higherexpression-level and lower-expression-level cancer cell lines for compounds using Student's t test. We identified the compounds for which immune genes higher-expressionlevel and lower-expression-level cancer cell lines have significantly different IC 50 values using a threshold of P-value < 0.05.

Author contributions
JL conceived of and performed the research. QS performed data analyses, and wrote the manuscript. YC and SY performed data analyses. All authors read and approved the final manuscript.