Constructing Bayesian networks by integrating gene expression and copy number data identifies NLGN4Y as a novel regulator of prostate cancer progression

To understand the heterogeneity of prostate cancer (PCa) and identify novel underlying drivers, we constructed integrative molecular Bayesian networks (IMBNs) for PCa by integrating gene expression and copy number alteration data from published datasets. After demonstrating such IMBNs with superior network accuracy, we identified multiple sub-networks within IMBNs related to biochemical recurrence (BCR) of PCa and inferred the corresponding key drivers. The key drivers regulated a set of common effectors including genes preferentially expressed in neuronal cells. NLGN4Y—a protein involved in synaptic adhesion in neurons—was ranked as the top gene closely linked to key drivers of myogenesis subnetworks. Lower expression of NLGN4Y was associated with higher grade PCa and an increased risk of BCR. We show that restoration of the protein expression of NLGN4Y in PC-3 cells leads to decreased cell proliferation, migration and inflammatory cytokine expression. Our results suggest that NLGN4Y is an important negative regulator in prostate cancer progression. More importantly, it highlights the value of IMBNs in generating biologically and clinically relevant hypotheses about prostate cancer that can be validated by independent studies.


Constructing IMBNs for prostate cancer by integrating gene expression and CNA data
Two prostate cancer data sets were used in the study, the TCGA prostate adenocarcinoma (PRAD) study [1] and Taylor prostate cancer study [2]. For the TCGA PRAD dataset, the gene expression and gene-based CNA data were downloaded from the TCGA data portal (https:// tcga-data.nci.nih.gov/tcga/). For the Taylor dataset, gene expression and CNA data were downloaded from the GEO repository with accession numbers GSE21034 and GSE21035. All data are log2 transformed.
It has been shown that CNA contributes most significantly to PCa tumorigenesis and progression [3]. CNA alters expression level of underlying genes directly. We defined cis-CNA genes as genes whose expression levels and their CNAs are significantly correlated. Then, we can decompose expression variance of a gene G into multiple parts as G cisCNA R * ε + (Supplementary Figure S12), due to its cis-CNA, due to its regulators R , and their interactions.
As searching for an optimal Bayesian network structure is a NP-hard problem, we can only include a limited number of genes with informative gene expression or CNAs as nodes in our network reconstruction procedure. For the TCGA PRAD data set, we selected 8,907 informative genes that were expressed in the tumor tissues (the mean expression levels >5) and whose expression levels varied (the standard deviation >0.5). We also included 3,012 cis-CNAs (p-value <0.01 for the Spearman's correlation between gene expression and CNA after multiple testing correction) as nodes. Similarly for the Taylor data set, we selected 6,955 informative genes with the mean expression levels >4 and the standard deviation >0.35 as well as 157 cis-CNAs as nodes.
The selected gene expression and gene-based CNA profiles were input into a Bayesian network reconstruction software package, RIMBANet (Reconstructing Integrative Molecular Bayesian Network) (4-7). A Bayesian network is a directed acyclic graph in which the edges of the graph are defined by conditional probabilities that characterize the distribution of states of each node given the state of its parents [8]. The network topology defines a partitioned joint probability distribution over all nodes in a network, such that the probability distribution of states of a node depends only on the states of its parent nodes: formally, a joint probability distribution p X ( ) on a set of nodes X can be decomposed as p X p X X ( ) ( |Pa( )), represents the parent set of X i . In our networks, each node represents expression level or CNA of a gene. These conditional probabilities reflect not only relationships between genes, but also the stochastic nature of these relationships, as well as noise in the data used to reconstruct the network.
Bayes formula allows us to determine the likelihood of a network model M given observed data D as a function of our prior belief that the model is correct and the probability of the observed data given the model: We can set a model M 's prior probability based on biological knowledge. For genes with cis-CNAs, we assume that the expression variations of these genes were directly affected by their CNAs. To represent the assumption, we set a structure prior p cis CNA G ( -) 1, i i → = which is equivalent to start a searching process with an initial structure with a set of → cis CNA G i i edges instead an empty initial structure. We also assume the cis-CNAs only affected expression levels of their cis genes directly, and any trans effects on other genes were through expression variations of their cis genes. Thus, we set the prior The number of possible network structures grows super-exponentially with the number of nodes, so an exhaustive search of all possible structures to find the one best supported by the data is not feasible, even for a relatively small number of nodes. We employed Monte Carlo Markov Chain (MCMC) [9] simulation to identify potentially thousands of different plausible network structures, which are then combined to obtain a consensus network (see below). Each reconstruction begins with a null network. Small random changes are then made to the network by flipping, adding, or deleting individual edges, ultimately accepting those changes that lead to an overall improvement in the fit of the network to the data. We assess whether a change improves the network model using the Bayesian Information Criterion BIC [10], which avoids overfitting by imposing a cost on the addition of new parameters. This is equivalent to imposing a lower prior probability P M ( ) on models with larger numbers of parameters.
Searching optimal BN structures given a dataset is an NP-hard problem. We employed an MCMC method to do local search of optimal structures. As the method is stochastic, the resulting structure will be different for each run. In our process, 1,000 BN structures were reconstructed using different random seeds to start the stochastic reconstruction process. From the resulting set of 1,000 networks generated by this process, edges that appeared in greater than 30% of the networks were used to define a consensus network. A 30% cutoff threshold for edge inclusion was based on our simulation study [11], where a 30% cutoff yields the best tradeoff between recall rate and precision. The consensus network resulting from the averaging process may not be a BN (a directed acyclic graph). To ensure the consensus network structure is a directed acyclic graph, edges in this consensus network were removed if and only if [1] the edge was involved in a loop, and [2] the edge was the most weakly supported of all edges making up the loop.

Network analysis
1) The degree of a node in a network is generally defined as the number of edges connecting the node to other nodes. As connections in constructed IMBNs are sparse, we define the degree of a node in an IMBN as the number of nodes that can be reached from the seed node within two hops. 2) Key regulator analysis is aimed to identify genes with high potentials to regulate a large number of genes when perturbed. We use the degree of a node in an IMBN defined above as a measurement of transcriptional regulation potential. Given d as the degree of a node, key regulators are defined as nodes with d d d 2 ( ). σ > + 3) A gene's network neighborhood was defined as genes whose distance to the seed gene is <= k. As the size of a gene's network neighborhood greatly affect the significance of functional annotation of the gene's network neighborhood by enrichment analysis, we aimed to define network neighborhoods resulting to similar sizes. We adjusted k according to the connectivity of each gene so that the defined neighborhood was of similar size for all genes. Particularly, for each gene, we chose the smallest k at which the number of genes in its neighborhood is >=100. 4) Sub-networks associated with Biochemical Recurrence (BCR): We first selected genes significantly associated (adjusted P-value < 0.01) with BCR using the Cox regression model, termed as "initial BCR genes". To remove sporadic associations and expand high confident associations, we then projected the initial BCR genes onto the IMBNs for prostate cancer constructed above and identified genes whose neighborhood is significantly enriched (p-value of Fisher's exact test <10 −6 ) for the initial BCR genes, which together formed BCR subnetworks. The genes in the BCR subnetworks are termed as "network selected BCR genes" collectively. We constructed sub-networks for genes positively and negatively associated with BCR separately.

Compilation of cancer genes and high confident prostate cancer related genes
We assembled a list of 152 high confident prostate cancer related genes (Supplementary Table S2). These genes have been identified in at least two previous studies related to prostate cancer (Supplementary Table S3). We also compiled a list of 813 known cancer genes (not restricted to prostate cancer) from cancer gene census [18] and KEGG cancer pathways [15] (Supplementary Table  S6).

ERG and AR signatures
A list of 87 TMPRSS2-ERG fusion signature genes (comparing PCa patients with and without the fusion) was collected from a previous study [19] (Supplementary Table  S7). We also compiled a list of 157 AR signature genes (Supplementary Table S8) from multiple sources including 27 AR transcriptional targets based on experimental perturbation [20] and two curated AR pathway gene sets: PID_AR_PATHWAY (61 genes) and HALLMARK_ ANDROGEN_RESPONSE (101 genes) from MsigDB database [16].

Human tissue atlas
Gene expression profiles of 126 primary human tissues were downloaded from http://xavierlab2.mgh. harvard.edu/EnrichmentProfiler/download.html. Given a gene, the preferentially expressed tissues are defined as the top 5 tissues with the highest gene expression level.

Patient's clinical information
For the Taylor's dataset, following radical prostatectomy, patients were followed with history, physical exam, and serum PSA testing every 3 months for the first year, 6 months for the second year, and annually thereafter. Biochemical recurrence (BCR) was defined as PSA ≥0.2 ng/ml on two occasions. For the TCGA PRAD dataset, all tumor samples in TCGA were from primary tumors prior any treatment. BCR events were reported in the dataset without specific definition of BCR.
In the Taylor's dataset, 21.6% of the patients were documented with clinical metastatic events, 72.7% were documented as "NO" in the metastatic event, and 5.6% were NA. For the TCGA dataset, there was no information about metastatic event.

Patient treatment information
The treatment information is sparse for the Tylor and TCGA datasets. For Tylor' dataset, 9.3% of patients are documented with chemotherapy, and the rest are documented as "NA"; 23.3% are documented with hormone therapy, and the rest "NA"; 16% are documented with radiotherapy, and the rest "NA". In summary., 6% of patients are documented with all three types of therapies, i.e., chemotherapy, hormone therapy and radiotherapy, 10.7% documented with two types of therapies, 9.3% documented with one type of therapy and 74% documented with NA for all three therapies. For TCGA dataset, 7.4% patients are documented as "YES" in adjuvant radiation, 40% as "NO", the rest 52.6% as "Not Available" or "Unknown". As for drug treatment, 11.4% are documented with hormone therapy, 1% with chemotherapy, and the rest with no information.