Identification of novel prognostic markers of survival time in high-risk neuroblastoma using gene expression profiles

Neuroblastoma is the most common extracranial solid tumor in childhood. Patients in high-risk group often have poor outcomes with low survival rates despite several treatment options. This study aimed to identify a genetic signature from gene expression profiles that can serve as prognostic indicators of survival time in patients of high-risk neuroblastoma, and that could be potential therapeutic targets. RNA-seq count data was downloaded from UCSC Xena browser and samples grouped into Short Survival (SS) and Long Survival (LS) groups. Differential gene expression (DGE) analysis, enrichment analyses, regulatory network analysis and machine learning (ML) prediction of survival group were performed. Forty differentially expressed genes (DEGs) were identified including genes involved in molecular function activities essential for tumor proliferation. DEGs used as features for prediction of survival groups included EVX2, NHLH2, PRSS12, POU6F2, HOXD10, MAPK15, RTL1, LGR5, CYP17A1, OR10AB1P, MYH14, LRRTM3, GRIN3A, HS3ST5, CRYAB and NXPH3. An accuracy score of 82% was obtained by the ML classification models. SMIM28 was revealed to possibly have a role in tumor proliferation and aggressiveness. Our results indicate that these DEGs can serve as prognostic indicators of survival in high-risk neuroblastoma patients and will assist clinicians in making better therapeutic and patient management decisions.


INTRODUCTION
Neuroblastoma is the most common extracranial solid tumor in childhood accounting for approximately 15% of pediatric cancer death [1][2][3]. It develops anywhere along the sympathetic nervous system with 60% of the tumors occurring in the abdomen, commonly in the adrenal gland [4,5].
Outcomes ranging from spontaneous regression to relentless progression despite extensive therapies indicate the heterogeneity of neuroblastoma [6]. The Children's Oncology Group (COG) classifies neuroblastoma patients into low, intermediate and high-risk groups. Patients classified in low-risk groups have good outcomes contrary to high-risk groups who present poor outcomes despite extensive therapies [4] and with a disproportionate number dying or suffering profound treatment related morbidities [7,8]. Tumors in high-risk neuroblastoma patients are often metastatic, resulting in survival rates of less than 50% [1].
The objective of our study is to identify a genetic signature from gene expression data that can serve as prognostic indicators of survival time in high-risk neuroblastoma patients and that could be therapeutic targets in that patient category.

RESULTS
Querying the Xena TARGET dataset returned 20 and 12 SS and LS samples, respectively. Based on the gene expression levels in these samples, the edgeR filterByExpr function removed 35,873 low expressed genes and kept 24,610 genes for downstream analysis. The DGE analysis with DESeq2 identified 40 DEGs between the SS and LS groups, of which 21 genes were upregulated and 19 genes were downregulated. Table 1 shows information about the 40 DEGs.
The Gene Ontology (GO) Molecular Function enrichment analysis revealed that upregulated genes were mainly enriched in MAP kinase activity, retinol binding and RNA polymerase II activating transcription factor binding, as well as in other activities shown in ( Figure 1A). No statistically significant results (p-adjusted value < 0.05) were obtained for the downregulated genes as well as for the other GO categories; Biological Process and Cellular Component. In addition, the Disease Ontology enrichment analysis associated upregulated and downregulated genes with several genetic disorders; intellectual disability, cardiac dysfunction, bone development, impaired infertility and pulmonary dysfunction caused by diaphragm-associated abnormalities ( Figure 1B and 1C).
Reconstruction of gene regulatory networks, using the GENIE3 algorithm, for the SS and LS samples respectively deduced 1,966,606 and 1,967,020 weighted interactions involving the DEGs. Applying a weight threshold value of 0.00251 resulted in 1018 and 650 DEG interactions for the SS and LS groups, respectively. The visualization of the 1018 DEG interactions using Cytoscape enabled the detection of 4 essential regulatory networks ( Figure 2). The first network ( Figure 2A) involves SMIM28, LGR5, PRSS12, EVX2, NHLH2 and HOXD10. All of these DEGs are upregulated, and the last three genes are transcription factors. The following network ( Figure 2B) interconnect MAPK15, Lnc-ZNF814-1, EDIL3, NBAS and CYP17A1. The first two genes are upregulated and the last three genes are downregulated. Most of the DEGs in the third and last networks ( Figure 2C and 2D) are downregulated with the exception of MEG9 and STRA6 which are upregulated. Interestingly, these interactions between the DEGs are   Table 3 yielded 43 SS and 19 LS samples, respectively. DEGs that do not have an associated NCBI GeneID were not found in this dataset, particularly those that are identified as long non-coding RNAs (lncRNAs). Only 25 of the 40 DEGs have expression data in this dataset. Based on the results of feature selection with scikit-learn and several classification tests, the following 16 features were selected for the machine learning construction of the training and test sets; EVX2, NHLH2, PRSS12, POU6F2, HOXD10, MAPK15, RTL1, LGR5, CYP17A1, OR10AB1P, MYH14, LRRTM3, GRIN3A, HS3ST5, CRYAB and NXPH3. The evaluation of the Support Vector Machines (SVM) and Artificial Neural Networks (ANN) models using 5-fold cross-validation resulted in an accuracy of 78% and 87% for SVM and ANN, respectively. By testing the ML models on the GSE49711 test set, ANN again achieved better results with an accuracy of 82% of samples correctly classified as SS or LS compared to SVM which obtained an accuracy of 79% (Table 2).

DISCUSSION
We aimed at identifying genes that are differentially expressed between high-risk SS and LS patients that could be potential prognostic indicators and or therapeutic targets. The results of the DGE analysis between the two groups showed the upregulation and downregulation of genes associated with neuroblastoma and other cancers.

Upregulated genes
The upregulated DEGs included some genes whose overexpression in the SS group have been correlated with poor survival in neuroblastoma and other cancers. Higher expression levels of NHLH2 were found to be higher in unfavourable neuroblastomas and was significantly associated with a poor prognosis [22]. Additional roles of NHLH2 in the obesity and fertility has also been uncovered [23,24]. The upregulation of PRSS12 in this study is similar to the results of Hiyama et al. [25], which reported the overexpression of PRSS12 in neuroblastoma tumors with high telomerase activity correlating with unfavourable tumors. Serine proteases are often altered and significantly upregulated in cancer as malignant cells need proteolytic activities to enable their growth, survival and expansion. Our result support that upregulation of PRSS12 is indicative of poor survival in neuroblastoma. HOXD10 is a transcription factor whose expression is altered in many cancers. Its high expression gives cancer cells proliferative and migratory abilities [26]. Elevated expression of HOXD genes including HOXD10 was reported to be associated with unfavourable prognosis and poor outcome in neuroblastoma [27], which supports our results indicating a more aggressive disease in the SS group.
LGR5 is a stem cell marker which is highly expressed and associated with an aggressive phenotype in neuroblastoma [28,29]. It has also been associated with pancreatic ductal adenocarcinoma [30] and colorectal cancer [31].
LGR5 potentially contributes to stem cell maintenance and self renewal and is indicative of poor survival in high-risk neuroblastoma. SMIM28 is a less studied protein whose upregulation is indicative of poor survival in this study. Similar to our results, Jiang et al. [32] reported the upregulation of SMIM28 in prostate cancer. EVX2 is a homeobox transcription factor essential for vertebrate spinal cord neuronal specification [33]. POU6F2 belongs to the POU class homeobox family whose members are transcriptional regulators and is involved in hereditary predisposition to Wilms tumor, a pediatric malignancy of the kidney [34]. Functional studies are required to elucidate the roles of SMIM28, EVX2 and POU6F2 in high-risk neuroblastoma.
MAPK15, a protein kinase involved in many cellular activity including cell proliferation was upregulated in this study. Highest levels of MAPK15 was found in aggressive embryonal carcinomas and it acts by sustaining the progression of the cell cycle of embryonal carcinomas by limiting p53 activation and preventing the facilitation of p53 dependent mechanisms that results to the arrest of the cell cycle [35]. Neuroblastoma is a malignancy of embryonal origin and upregulation of MAPK15 would be expected to facilitate tumor progression and indicative of aggressive disease and poor survival as in the SS group. RTL1, a paternally expressed imprinted gene highly expressed in the fetus, placenta and brain. It has been reported to be a driver of hepatocellular carcinoma [36]. Being a protease, it possibly promotes tumor invasion and metastasis in neuroblastoma tumors. Higher expression of RTL1 is thus suggestive of poor prognosis in neuroblastoma. Also upregulated is STRA6, a plasma membrane protein that transports retinol and is involved in a signalling mediated by JAK2, STAT3 and STAT5 [37]. Its upregulation indicates poor survival in our study possibly through its maintenance of cancer stem cells and promotion of tumor formation as reported in colorectal cancer [37,38].
Ten lncRNAs were upregulated in this study. Three of these (lnc-SPG21-45, lnc-NSUN6-1 and lnc-KLHL28-1) are antisense to ANKDD1A, CACNB2 and C14orf28 genes, respectively, which are associated with other cancers and diseases, possibly by regulating their expression. C14orf28 has been observed to be overexpressed in colorectal cancer cells, promoting proliferation, migration and invasion [39]. ANKDD1A has been described as a functional tumor suppressor with germline variants predicting poor patient outcomes in low grade glioma [40], and is frequently methylated in glioblastoma multiforme [41] and in clinically nonfunctioning pituitary adenomas [42]. CACNB2 is a calcium channel protein linked to diabetic retinopathy [43], bipolar disorder [44], hypertension [45] and autism spectrum disorders [46]. The role of calcium signalling in cancer has been reviewed by [47][48][49]. MEG9, is located in an imprinted non-coding RNA genomic region, DLK1-DIO3 [50,51]. LINC01410 is a lncRNA highly expressed in pancreatic cancer tissues and cell lines [52]. High expression in cholangiocarcinoma and gastric cancer patients have been associated with poor prognosis and survival [53,54]. These lncRNAs may be facilitating the promotion of tumor progression, proliferation and invasion which might have impacted the survival of patients in the short survival group. It is well established that metastasis is the primary cause of cancer mortality. Functional studies are required to evaluate the roles of these lncRNAs in neuroblastoma.

Downregulated genes
The downregulated genes include genes such as AMIGO2, LRRTM3, GRIN3A, MYH14, EDIL3, FNDC9, involved in neural function and development, and in extracellular matrix (ECM) organization. AMIGO2 is a transmembrane molecule expressed in neuronal tissues and participates in their formation [55]. EDIL3 is an inducer of the epithelial-mesenchymal transition, that promotes angiogenesis and invasion in hepatocellular carcinoma [56]. FNDC9 which exhibits biased expressed in the brain is an ECM protein involved in tumorigenesis in different cancers [57]. MYH14 is a myosin, an actin-dependent motor protein that plays a role in neuritogenesis. Members of the myosin superfamily have been known to enhance or suppress tumor progression [58]. MYH14 could be suppressing tumor progression in high-risk neuroblastoma. The downregulation of these ECM associated genes could be promoting the invasion and metastasis of neuroblastoma tumors. OR10AB1P belongs to the olfactory receptor family of genes. Olfactory receptors are expressed in various human tissues and are involved in different cellular processes such as migration and proliferation. Some are biomarkers for prostate, lung and small intestine carcinoma tissues [59]. Decreased expression of CRYAB indicated its tumor suppressor function in bladder cancer [60]. It may thus also be functioning as a tumor suppressor in neuroblastoma. Ubiquitin C (UBC) is a polyubiquitin precursor. Ubiquitination has been associated with many cellular processes which play roles in tumorigenesis.
CYP17A1 is a key enzyme in the steroidogenic pathway with restricted expression in the adrenal gland. Neuroblastoma tumors commonly occur in the adrenal medulla. The downregulation of CYP17A1 may be an indicator of poor prognosis in neuroblastoma. LRRTM3 has high expression in the brain and belongs to a group of proteins involved in nervous system development. How it contributes to neuroblastoma remains to be investigated but it is currently a candidate gene for alzheimer's disease [61,62], a neurological disorder. NBAS is thought to be involved in golgi-to-ER transport and is typically amplified in MYCN-amplified neuroblastoma tumors [63]. GRIN3A is a glutamate receptor that promotes nerve outgrowth [64]. The downregulation of GRIN3A may suggest a higher level of disease because glutamate is a major excitatory neurotransmitter in the CNS that is involved in many neuronal processes. Functional studies are required to investigate how these genes contribute to poor survival in neuroblastoma.

Gene and disease ontology enrichment analyses
The molecular function activities; MAP kinase activity, retinol binding and RNA polymerase II activating transcription factor binding, enriched by the upregulated genes MAPK15, STRA16 and NHLH2, respectively, are activities that promote tumor cell proliferation. Deregulation of the MAPK signalling was associated with cancer development, progression and cell proliferation [65,66]. Retinol binding through the STRA6 upregulation activates a signalling cascade that is found to play a role in cancer initiation, maintenance and growth [38]. Furthermore, the increased global transcription activity (activation of RNA polymerase II) indicates an intensity of a rapid proliferation of cancer cells [67]. These activities again demonstrate the aggressiveness of the neuroblastoma tumors in SS patients compared to LS patients.
The enriched diseases by the upregulated and downregulated genes, particularly the disorders inducing heart failure (Cardiomyopathy and Congenital heart disease) and respiratory illness (Diaphragmatic dysfunction) may have negative impact on survival [68,69], and could be the cause of death in the SS neuroblastoma patients. There is no indication in the clinical information accompanying the gene expression count dataset if the patients suffered from additional disorders in addition to neuroblastoma. However, it is reported that high-risk neuroblastoma survivors treated with intensive multimodality therapy are at risk for a broad variety of treatment-related late effects including cardiac dysfunction, bone development disease, pulmonary dysfunction and impaired fertility [70]. These treatmentrelated morbidities were enriched by the upregulated and downregulated DEGs ( Figure 1B and 1C), suggesting that the cause of death in patients with SS neuroblastoma may be due to treatment complications.

Regulatory network analysis
After the application of a weight threshold of 0.00251 to the GENIE3 output, the numbers of interactions involving the DEGs in the SS and LS groups were 1018 and 650, respectively. There is a clear substantial difference in number of interactions between the two groups, indicating a higher cellular/tumor activity in the SS group which could be a sign of tumor aggressiveness in patients with SS neuroblastoma. In addition, it is noticeable from the networks in Figure 2 the importance of the genes SMIM28, MAPK15 and UBC as origins of most of the interactions. The role of MAPK15 and UBC in tumorigenesis has been reported in many previously discussed scientific works, while SMIM28 is a less studied gene with an unclear role in cancer. However, observation of the role of the final target genes of SMIM28 in Figure  2A network, which are LGR5 and HOXD10, can shed light on the role of this gene. As reported previously, both LGR5 and HOXD10 were associated with cancer cell proliferation and tumor aggressiveness in neuroblastoma. Thus, it is

Machine learning
Although not all the DEGs were included in the training and testing of the machine learning models, the obtained prediction results were significantly good. The ANN model obtained the highest accuracy of 82% for the classification of the external neuroblastoma samples (GSE49711) into short and long survival classes. This high classification accuracy makes it possible to consider that the DEG expression profiles have been preserved in the various high-risk neuroblastoma tumors, although neuroblastoma is known to be heterogeneous. Therefore, the DEG list can serve as prognostic indicators (genetic signature) for survival time in high-risk neuroblastoma patients and can be targets for drug discovery analyses. Relatively similar to our study, other studies have proposed genetic signatures for prognostic stratification of patients with neuroblastoma. Liu et al. [71] used an unsupervised biclustering machine learning technique to find high-risk neuroblastoma subtypes. They proposed a signature of 238 neuroblastoma-specific immune genes to identify ultra high-risk and high-risk neuroblastoma subtypes. Russo et al. [72] applied K-means clustering to 27 kinome gene signature to identify ultra high-risk subtypes of high-risk neuroblastoma. Formicola et al. [73] used Cox regression and Kaplan-Meier analysis methods to propose a 18-gene expression based risk scoring system to predict overall survival of patients with stage 4 neuroblastoma. We demonstrated the use of a shorter signature in a 16-gene expression classifier based on survival time to stratify high-risk neuroblastomas into SS (ultra high-risk) and LS subtypes. The number of genes in our classifier should provide the advantage of being less costly and easier to implement.

MATERIALS AND METHODS
The workflow describing the steps and methods undertaken in this study is illustrated in Figure 3. It includes 5 essential steps; dataset retrieval, differential gene expression, disease/gene ontology enrichment, gene regulatory network inference and machine learning.

Datasets
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiative employed comprehensive molecular characterization of hard-to-treat childhood cancers which included neuroblastoma. TARGET data is accessible via the TARGET data matrix as well as via the Xena browser. The Xena browser is a web based visualization and exploration tool for multi-omic data large public repositories and private datasets [74]. The TARGET neuroblastoma dataset in the Xena database [74,75] is composed of high-risk neuroblastoma samples with available clinical information. Gene expression RNA-seq read counts of the TARGET neuroblastoma dataset (dataset ID: TARGET-NBL. htseq_counts. tsv) were obtained from the GDC hub in Xena browser using xenaPython package. Fields used in querying the dataset are described in Table 3. The dataset itself was composed of 151 samples in total. Querying the dataset with the above fields returned 32 neuroblastoma samples; 20 of which with an overall survival time of less than 730 days (2 years) and vital status is "dead" were considered short survival (SS), while 12 samples with an overall survival time greater than 2555 days (7 years) and vital status is "alive" were considered long survival (LS).

Differential gene expression analysis
Normalized expression counts were converted to raw counts and filtered to remove low expressed genes using the edgeR filterByExpr function in R [76]. We then performed a differential gene expression (DGE) analysis between the short and long survival groups using the DESeq2 package in R [77]. DESeq2 uses shrinkage estimators for dispersion and fold change for its comparative differential gene expression estimation [77]. Differentially expressed genes (DEGs) were selected by meeting criteria of adjusted p-value < 0.05. Gene Ontology (GO) enrichment analysis was carried out to functionally annotate the DEGs using clusterProfiler [78] and visualized using enrichplot. The DOSE R library [79] was used to detect the diseases enriched by the upregulated and downregulated DEGs.

Machine learning
Scikit-learn [80] and LibSVM [81] libraries were used for machine learning (ML) model creation and classification tasks. Features (genes) not present in the test dataset (GSE49711) were not used for machine learning prediction. The top upregulated and downregulated genes used as features for machine learning (ML) prediction of  patient survival were; EVX2, NHLH2, PRSS12, POU6F2,  HOXD10, MAPK15, RTL1, LGR5, CYP17A1, OR10AB1P,  MYH14, LRRTM3, GRIN3A, HS3ST5, CRYAB, NXPH3. The features were extracted from the log-normalized counts data. For LibSVM, the feature values of the training and test sets were scaled using a built-in python script. With regards to ANN, the feature values of the training and test sets were scaled with the Scikit-learn MinMaxScaler function. Algorithms used for training and evaluation of the models include; Support Vector Machines (SVM) and Artificial Neural Networks (ANN). 5-fold cross validation was done to determine the best parameters of the SVM and ANN models which was then applied to our test samples. Both models, created with LibSVM and ANN were applied to predict the classification of samples in an external dataset (GSE49711) with same sample characteristics (overall survival < 730 days with vital status as dead for SS samples, and overall survival > 2555 days with vital status as alive for LS samples). The evaluation metrics for the LibSVM and ANN models were precision, recall and accuracy.

Regulatory networks
The GENIE3 package [82] in R was used for genetic regulatory network inference analysis. The GENIE3 algorithm uses a Random Forest or Extra Randomized Trees approach to infer gene regulatory networks from gene expression data [82]. It outputs a ranked list of each pairwise comparison from the most to the least confident regulatory connection. The library was run on the gene expression counts from the SS and LS samples separately. For the output analysis, only connections involving the DEGs were considered. In addition, a weighting threshold of 0.00251 was applied to reduce the large number of connections and also to focus on high confident regulatory connections. Cytoscape [83] was used for the visualization of gene regulatory networks.

CONCLUSIONS
The DGE analysis is a powerful technique for identifying DEGs in a studied condition. In this study, using DESeq2 we identified 40 DEGs between SS and LS neuroblastoma samples. Many of the DEGs were found to be related to different cancers (including neuroblastoma), thus strengthening their possibility of being associated with neuroblastoma. The ML models based on 16 DEGs were capable of stratifying high-risk neuroblastoma samples on the basis of survival time, demonstrating their ability to be used as a genetic signature or prognostic indicators of survival in high-risk neuroblastoma patients. This study furthers our understanding of the molecular mechanisms of neuroblastoma in high-risk patients. We identified SMIM28 gene to be critical for tumor proliferation making it as a possible gene therapy target. Nevertheless, additional studies are required to elucidate the role of SMIM28 in the pathogenesis of neuroblastoma. Finally, prognostic stratification of highrisk neuroblastoma patients will help clinicians in making better therapeutic and patient management decisions.

Author contributions
AC and HB funded the project. JG, AG and HB designed the study. AG and HB implemented the analysis. AG, AF and HB wrote the manuscript. All authors read and approved the final manuscript.

ACKNOWLEDGMENTS
We want to acknowledge the support received from the South African Medical Research Council and National Research Foundation of South Africa.

CONFLICTS OF INTEREST
Authors have no conflicts of interest to declare.