Using molecular functional networks to manifest connections between obesity and obesity-related diseases

Obesity is a primary risk factor for many diseases such as certain cancers. In this study, we have developed three algorithms including a random-walk based method OBNet, a shortest-path based method OBsp and a direct-overlap method OBoverlap, to reveal obesity-disease connections at protein-interaction subnetworks corresponding to thousands of biological functions and pathways. Through literature mining, we also curated an obesity-associated disease list, by which we compared the methods. As a result, OBNet outperforms other two methods. OBNet can predict whether a disease is obesity-related based on its associated genes. Meanwhile, OBNet identifies extensive connections between obesity genes and genes associated with a few diseases at various functional modules and pathways. Using breast cancer and Type 2 diabetes as two examples, OBNet identifies meaningful genes that may play key roles in connecting obesity and the two diseases. For example, TGFB1 and VEGFA are inferred to be the top two key genes mediating obesity-breast cancer connection in modules associated with brain development. Finally, the top modules identified by OBNet in breast cancer significantly overlap with modules identified from TCGA breast cancer gene expression study, revealing the power of OBNet in identifying biological processes involved in the disease.


INTRODUCTION
Obesity is a medical condition of accumulating too much body fat, which may have serious negative effects on human health. For example, it is known that obesity plays an important role in the development of many diseases, such as Type 2 diabetes (T2D), hypertension, cardiovascular disease, coronary heart disease and several types of cancers [1][2][3][4][5]. Obesity can also cause chronic low-grade inflammation, which contributes to the occurrence and development of metabolic disorders [6,7]. In the past twenty years, there is a sustained increase in obese population, and the levels of overweight persons in many countries have already reached epidemic proportions [8], which presents an urgent need to study obesity and its association with multiple obesity-related diseases (ORDs).

Research Paper
At present, most studies focus on single mechanisms in connecting obesity and multiple ORDs. For example, Ersilia et al. studied the role of adiponectin in obesity and obesity-related diseases and find that expression enhancement of adiponectin may represent a useful therapeutic method against obesity and ORDs [9]. There are also studies targeting the association between obesity and specific diseases. For example, Zhang et al. used an integrated network of obesity and T2D to study their connections [10]. However, a global view of the association between obesity and ORDs simultaneously incorporating most biological processes and pathways, and across multiple disease types is more or less missing.
The key bottleneck is the lack of sufficient knowledge on the diseases at the whole genome and transcriptome levels.
Recently, with the rapid advances of various sequencing techniques, we are posed at a better position to fix this gap. First, our knowledge about obesity and disease associated genes has been expanded quickly. For example, the genome-wide association study (GWAS) Catalog provides a quality controlled, manually curated and literature-derived collection of all published association studies, which has assessed more than 100,000 SNPs and their potential association with different traits and diseases [11]. By incorporating the SNP and gene associations, one can infer trait/disease associated genes through the GWAS Catalog [12]. The online Mendelian inheritance in man (OMIM) database also curates a comprehensive and authoritative compendium of human gene and known Mendelian disorders. By now, it has collected the information on all known Mendelian disorders and over 15,000 genes [46]. In addition, differential gene expression analysis and regression analysis between the control and disease samples provide data-driven approaches to infer tissue-specific obesity and disease associated genes [13,14].
Second, our knowledge about gene interactions and gene signalling cascades has been enriched. For example, there are various protein-protein interaction databases including human protein reference database (HPRD) [15], STRING [16], and so on. In addition, the tissue and disease specific gene co-expression and regulatory networks could be inferred from databases like genotypetissue expression (GTEx) [17] and the cancer genome atlas (TCGA) [18] using machine learning based methods like the weighted gene co-expression network analysis (WGCNA) [19] and the Bayesian network. As a result, it is possible to study the interaction between various types of gene signatures on multiple-level molecular networks. For example, Wang et al. constructed a network approach to analyse the connections among aging and a few diseases [20]. Guney et al. proposed a network-based in silico drug screening using the shortest path between drug targets and disease genes in a protein interaction network [21].
In this paper, we have developed and compared three different algorithms to identify the putative connections between obesity and obesity-related disease, namely (1) a random walk and gene set enrichment based network method called OBNet, (2) a shortest path based network method called OBsp and (3) a direct gene set overlap based method call OBoverlap. Using these methods, we try to answer a few questions including: (1) Which diseases are more relevant to obesity at molecular level? (2) Is there any common biological function or pathway involving in the connection between obesity and many ORDs? (3) Is there any disease specific obesity related biological function or pathways? (4) What genes are critical in mediating the connection between obesity and ORDs?

RESULTS
Previous studies suggested that the network concept can explain the connections between many traits and diseases [12,20,21]. As such, we modelled the obesitydisease association by the mutual reachability between obesity and ORD genes in protein interaction networks. Here the concept of reachability was employed to describe potential interaction (or mutual influence) between obesity and disease genes. Generally speaking, if two genes are close in a network, they might interact with each other. We developed 3 methods called OBNet, OBsp, and OBoverlap to quantify this reachability. Since the obesitydisease association might be enriched in a few biological processes [9], we also studied the network reachability on specific biological processes or pathways.

An overview of OBNet, OBsp and OBoverlap
We presented an overview of the OBNet and OBsp algorithms in Figure 1. The major steps of OBNet were shown in Figure 1A, which is similar to our previous algorithm GeroNet [12]. Specifically, a list of obesity and disease genes, a reference network, GO biological processes and KEGG pathways were first collected. The genes in specific GO term and KEGG pathway were mapped onto the reference network to define a modularized network, which could be further expanded by a random walk with restart (RWR) procedure to construct an expanded modularized network. The (expanded) modularized networks present a network view of specific functions related to the GO function or KEGG pathway that defines it. After that, the obesity genes and disease genes were mapped to each (expanded) modularized network. The mutual reachability between obesity genes and disease genes was estimated by using RWR and a gene set enrichment analysis (GSEA) on three types of networks including modularized network, expanded modularized network and the whole network, corresponding to OBNet-Modularized network, OBNet-Expanded modularized network, and OBNet-Whole network respectively. The significance of the mutual reachability was evaluated by using a permutation analysis, in which the obesity genes are randomly permuted, and the significance p-value is adjusted for multiple testing. Finally, the diseases were ranked by the minimum adjusted p-values across all (expanded) modularized networks and those with low adjusted p-values are obesity-related. By this way, we also identified GO biological processes and KEGG pathways in which obesity and disease genes are significantly associated (or reachable). The details of each step are presented in Materials and Methods.
OBsp (OBsp-Modularized network and OBsp-Expanded modularized network) is generally similar to OBNet (OBNet-Modularized network and OBNet-Expanded modularized network). The difference is that OBsp evaluates the reachability of obesity and disease genes by their shortest path in the (expanded) modularized network (see Figure 1B and Materials and Methods). OBoverlap calculates the Jaccard coefficient concerning obesity genes and disease genes and ranks the diseases based on this coefficient.

Comparison of OBNet, OBsp and OBoverlap
We adopted the obesity genes and disease genes from our previous work [12], which merges GWAS cataolg and OMIM disease genes (Supplementary Dataset 1), and used STRING PPI Network with confidence level 400 as the reference network. We then compared the 3 methods by its accuracy in predicting obesity-related diseases. Towards this purpose, we constructed a gold set of obesity-related diseases based on literature mining, which consists of 51 diseases (Supplementary Table 1). Specifically, we first searched all PubMed abstracts between 2009 and 2015, and ranked the diseases according to the Jaccard coefficient between abstracts in which the disease name and the term "obesity" occur (Supplementary Dataset 2). To make sure that the 51 diseases are really ORDs, we also searched literatures to confirm their association to obesity (Supplementary Dataset 3).
We compared 6 methods, i.e., OBNet-Modularized network, OBNet-Expanded modularized network, OBNet-Whole network, OBsp-Modularized network, OBsp-Expanded modularized network and OBoverlap, in predicting ORDs in the gold set and plotted their receiver operating characteristic curves (ROCs) in Figure 2 and sensitivity in Supplementary Figure 2. OBNet-Expanded modularized network has an area under curve (AUC) 0.79, outperforming other methods. It indicates that the association between obesity and ORDs are possibly mediated by a few biological processes and pathways, and different pathways may contribute differently to the association [9]. It is interesting that direct overlapping performs better than shortest path based methods, which are commonly used in studying the connections between traits like drugs and diseases [21]. As a suggestion, the selection of appropriate computational model is critical in data-driven studies. We adopted the best model OBNet-Expanded modularized network in all following studies.

Obesity related diseases predicted by OBNet-Expanded modularized network
We listed in Table 1 the top 40 diseases based on  its association with obesity, and presented a full table of  147 diseases in Supplementary Table 2. As a criterion to evaluate obesity, body mass index ranks first, which is followed by autism spectrum disorder-bipolar disorderschizophrenia. It has been well known that people especially children with autism spectrum disorder have a prevalence of obesity [22]. Interestingly, parental obesity is also a risk factor for children autism spectrum disorder [23], which indicates that autism and obesity are truly interacting with each other. However, the mechanisms behind the interaction is rarely known, on which the functional modules mediating the interaction might shed some lights.
Further navigating the list, we find a few heart diseases such as coronary artery disease and atrial fibrillation, metabolic diseases and traits such as Type 2 diabetes and HDL cholesterol, and cancers such as chronic lymphocytic leukaemia and Breast cancer. The connection between heart diseases and obesity has been well recognized. According to American Heart Association, obese can raise blood cholesterol, which increases blood pressure and induces many heart diseases like coronary artery disease (http://www.heart.org/ HEARTORG/HealthyLiving/WeightManagement/Obes-ity/ Obesity-Information_UCM_307908_Article.jsp#.WO5TY Pnyvcs). Many studies have shown that obesity is associated with diabetes especially Type 2 diabetes [24,25]. For example, a cross-sectional study revealed that 75% of the patients with Type 2 diabetes in Brazil are overweight (BMI>25 kg/m 2 ), among which 30% are obese [26]. The proportions of overweight Type 2 diabetes patients are 85% and 86% respectively in the United Kingdom [27] and United States (Centers for Disease Control and Prevention (CDC) 2004) and those for obesity are 52% and 55% respectively for the two countries. Similarly, of patients with Type 1 diabetes, 55.3% are overweight (BMI ≥25 kg/m 2 ), 16.6% are obese (BMI ≥30 kg/m 2 ), and 0.4% have morbid obesity (BMI ≥40 kg/m 2 ). Finally, it is of note that the association between obesity and a few cancers has also brought wide concerns (https://well.blogs.nytimes.com/2016/08/24/obesity-linkedto-at-least-13-types-of-cancer/?_r=0).
According to breast cancer research in Unite Kingdom, scientists estimated that 7% to15% breast cancer cases are caused by obesity in developed countries [28][29][30] and two cohort studies based on Cancer Research UK study and the Million Women Study have found that obese women have a 30% higher risk of postmenopausal breast cancer than women with a healthy weight [31,32].
It is worth to note that several diseases in Table 1 were not classified as ORD in the gold ORD set (Supplementary Table 1), e.g., metabolic syndrome, thyroid function and HDL cholesterol. They are possible www.impactjournals.com/oncotarget false-positive predictions by OBNet. However, we did find evidences to support their connection with obesity. For example, Rashild and Genest found that obesity can increase HDL cholesterol, which is associated with the development of coronary artery disease [33]. More attention should be paid to these diseases.

Function modules mediating obesity and diseases
Since the function modules mediating obesity and ORDs might be critical in revealing the biology underlying their connections, we zoomed into the significant modules, in which obesity genes and disease are significantly interacted (at FDR 0.05) for the GSEA analysis. As a result, we identified 1232 diseasemodule pairs involving 781 unique functional modules (Supplementary Dataset 4). Clearly, a disease should be more obesity-related if it interacts with obesity in multiple function modules. We plotted in Figure 3A the top 37 ORDs accordingly to their number of significant function modules. Body mass index, Type 2 diabetes, and inflammatory bowel disease (IBD) ranks in top 3. The interaction of obesity and IBD is a hot topic in recent years due to their highly prevalent in western societies. For example, Flores et al. showed that obesity is highly prevalent in IBD patients in the US population [34]. A processes and KEGG pathways are first collected. The genes in specific GO term and KEGG pathway are mapped onto the reference network to define a modularized network, which could be further expanded by a random walk with restart (RWR) procedure to construct an expanded modularized network. After that, the obesity genes and disease genes are mapped to each (expanded) modularized network. The mutual reachability between obesity genes and disease genes is estimated by using RWR and a gene set enrichment analysis (GSEA). The significance of the mutual reachability is evaluated by using a permutation analysis, in which the obesity genes are randomly permuted, and the significance p-value is adjusted for multiple testing. Finally, the diseases are ranked by the minimum adjusted p-values across all (expanded) modularized networks and those with low adjusted p-values are obesity-related. OBsp: The mutual reachability of obesity and disease genes is estimated by the average shortest path between the two sets. www.impactjournals.com/oncotarget recent study suggested that the association may related to share dietary or environmental exposures that exert their effect through changes in the intestinal microbiota [35].
It is of note that different function modules play different roles in mediating obesity-disease association. Some modules (networks) mediate the connection of obesity and a wide range of diseases, while others are disease specific. We plotted in Figure 3B the top 40 networks most frequently involved in obesity-disease interactions. Metabolic associated modules are most prevalent in the figure. For example, GO:0006091_ generation of precursor metabolites and energy, GO:0005996_monosaccharide metabolic process and GO:0006720_isoprenoid metabolic process rank at top 2, 7 and 8 respectively. It is known that obesity has a significant impact on the macronutrient metabolisms, which might be a key factor to induce obesity related diseases [36].

Network view of the connection between obesity and ORDs and the key connector genes
For a better view of the network modules in mediating obesity-disease associations, we plotted obesity genes and diseases genes in the significant subnetworks. In addition, we performed key connector analysis (KCA, see Materials and Methods for details) to infer key genes connecting obesity and ORDs [12,55]. KCA has been proven to be effective in identifying important genes associated with a set of target genes in a network [10,12]. We selected two commonly believed obesity related diseases including Type 2 diabetes and

Case study 1: Type 2 diabetes
The connection between Type 2 diabetes and obesity is most significant on the module corresponding to GO:0050795_regulation of behaviour with FDR 1.71E-6 ( Supplementary Dataset 4). Thus, we focused on the subnetwork associated with the function "regulation of behavior", which consists of 497 genes and 23,135 interactions. We then performed KCA of the connecting genes on the subnetwork. For a better view, we retrieved the subnetwork consisting of the top 5 key connector genes and their neighbouring genes (see Figure 4A). As we can see, top 5 key connector genes are ADCY2, NPY, GCG, KNG1 and SST, respectively, among which ADCY2 is most significant. Adenylyl cyclase type 2 (ADCY2) encodes a member of the family of adenylyl cyclases, which are membrane-associated enzymes that catalyze the formation of the secondary messenger cyclic adenosine monophosphate (cAMP) from ATP. Interestingly this gene has protein interactions with many known obesity genes (e.g., MC4R, POMC, ADRB2, ADCY9 and BDNF) and also many known Type 2 diabetes associated genes (INS-IGF2, GRK5, ADCY5) within the subnetwork, supporting its critical role in mediating the Type 2 diabetes-obesity interaction in the function module. In addition, other top key connections such as NPY and GCG are also related to both obesity and Type 2 diabetes in humans [37].

Case study 2: Breast cancer
The connection between breast cancer and obesity is most significant on the module corresponding to GO:0030902_hindbrain development with FDR 1.74E-3 (Supplementary Dataset 4). We focused on the subnetwork corresponding to the function "hindbrain development", which consists of 500 genes and 8186 interactions. Similarly, we performed KCA of the connecting genes and retrieved the subnetwork consisting of the top 5 key connector genes and their neighbouring genes (see Figure  4A). As can be seen from Figure 4A, the top 5 key connector genes are TGFB1, VEGFA, CTNNB1, GAPDH and PPARG, respectively, among which TGFB1 is most significant. Transforming Growth Factor Beta 1 (TGFB1) secrets protein that performs many cellular functions, including the control of cell growth, cell proliferation, cell differentiation and apoptosis [38]. A few studies have suggested that TGFB1 is critical to both obesity and breast cancer. For example, Yadav et al. suggested that TGFB1/SMAD3regulated white adipose tissue (WAT) transcriptome in a mouse model of diet induced obesity; and Candida et al. has shown a mechanistic relationship between TGFB1 and breast cancer [39]. As a result, TGFB1 might play roles in connecting obesity and breast cancer. Other top key drivers like VEGFA, GAPDH and PPARG also play a role in breast cancer and obesity [12,40].

Validation of OBNet by a gene expression study
In addition to the numerous literature supported findings identified by OBNet, we also considered an orthogonal validation by using gene expression data. Specifically, we downloaded the gene expression data of 531 breast cancer 62 matched normal samples from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) (on Dec 16, 2015). We then applied WGCNA to construct gene co-expression modules from the 531 cancer samples, and achieved 29 co-expression modules. After that, we performed module differential connectivity (MDC) [41] to identify the modules significantly perturbed by breast cancer. Specifically, for each module, MDC calculates the ratio between the average connectivity of all gene pairs for breast Figure 4: Network topology and key genes connecting (A) obesity and Type 2 diabetes in regulation of behaviour and (B) obesity and breast cancer in hindbrain development. We use node shape to denote key connectors: (1) square represents the top 5 key connectors; (2) circle represents expanded obesity and disease genes. We use fill colour to denote new (expanded) obesity and disease information: (1) red represents obesity gene; (2) blue represents disease gene. cancer samples and that of gene pairs for normal samples. A module with MDC larger than (less than) 1 gains (loses) connectivity when changing from normal to cancer state. The significance of MDC is estimated by a permutation study on the samples [13,41]. As such, there are 10 modules significantly perturbed (at FDR 0.05) by breast cancer (see Supplementary Table 3

and Supplementary Dataset 5).
We then compared the 10 MDC modules with the top 10 modules mediating the interaction between breast cancer and obesity as identified by OBNet (Supplementary  Table 4). Specifically for each module pair, we performed a Fisher's exact test on the overlap of their genes and calculated adjusted p-values using Benjamini Hochberg method. The adjusted p-values are shown in Table 2. As can be seen, 6 of OBNet modules (i.e., GO:0033135_regulation of peptidyl-serine, GO:0030003_cellular cation homeostasis, GO:0006521_regulation of cellular amino acid metabolic process, GO:0048871_multicellular organismal homeostasis, GO:0055080_cation homeostasis, GO:0055082_cellular chemical homeostasis) are significantly overlapped with at least one module from MDC (FDR<0.05). At the same time, 3 of MDC modules are also significantly overlapped with OBNet modules. We then performed a permutation study to assess the significance of the overlap (between the top 10 modules of OBNet and MDC). We randomly shuffled the genes in the 29 WGCNA modules (keeping the numbers of genes in each module), selected the top 10 differential modules by MDC, and overlapped the modules with OBNet modules. The process was repeated for 1000 times. We then calculated number of significant overlapped OBNet modules for each run (see Supplementary Figure 1). As a result, the mean number of significant overlap is 0.67 (±1.22) and the p-value of observed 6 or more significant overlaps is 5.16E-6 by assuming a normal distribution. As an indication, our method could help to find some significant co-expression modules associated with breast cancer without considering gene expression data.

DISCUSSION
In this paper, we used three different methods to identify the connections between obesity and obesityrelated diseases, namely OBNet, OBsp and OBoverlap. The co-expression modules were named by colors and the genes in each module were listed in Supplementary Dataset 5. www.impactjournals.com/oncotarget OBNet on expanded modularized network outperformed other methods, indicating that the interaction between obesity genes and ORD genes might enriched in specific functional modules and pathways. The observation is supported by many literatures. For example, Gluckman and Hanson found that developmental and epigenetic pathways are critical in connecting obesity and diseases [42]. Singla et al. identified the roles of metabolic functions and pathways in connecting obesity and metabolic diseases [36]. It is worth noting that GWAS and OMIM disease genes might suffer from false-positives and incompleteness, and thus the obesity and disease signatures used in this study might not be very accurate. Nevertheless, OBNet can predict obesity associated diseases based on these genes with reasonable accuracy. With the accumulation of our knowledge of diseases and obesity, the performance of OBNet could be further improved in the future. In addition, OBNet can identify important disease associated modules from gene expression study, which also confirm its potential in disease studies.
However, OBNet has a few limitations. First of all, OBNet does not reflect any tissue specificity. It is known that at different tissues, obesity might correlate with different diseases. For example, adipose tissue dysfunction relates obesity to diabetes and vascular diseases [43]. A possible solution is to make use of tissue-specific networks constructed from tissue specific data such as Genotype-Tissue Expression (GTEx) [17]. Unlike PPI network which reflect general protein interactions, this kind of network can catch more tissue specific gene interactions. Second, the mutual reachability of genes alone might not reflect all aspect of gene interactions. In OBNet, we treat each gene with equal importance, which is not generally true. A future direction is to infer the importance of obesity and disease genes based on their roles in shaping obesity and diseases, and combine the information into the algorithm. Third, it is known that GO and KEGG have some overlap, which might have some influences to the algorithms. However, a previous study suggests that the influence might not be critical [12]. Finally, it might be useful to integrate various omics data like gene expression into OBNet.
Finally, though we studied the interaction between obesity and diseases in this study, the three methods proposed may have some further applications. In principle, they could be used to study the interactions between any two traits and diseases. For example, by studying the reachability of drug target or perturbed genes and disease genes, one can predict the sensitivity of a drug to a specific disease and meanwhile infer the major biological functions and pathways involved in drug response. Another interesting topic is to study the interaction between environmental factors (like smoking, drinking or microbes) and diseases. However, it is out of the scope of this study.

Collection of data and data pre-processing
In this paper, the obesity genes and disease genes were obtained from NIH GWAS Catalog and Online Mendelian Inheritance in Man (OMIM) [44]. By merging the two studies, a list of genes about 257 diseases were obtained (see Supplementary Dataset 1). The detailed merging method was provided in our recent work (Yang et al. 2016b).
The reference protein-protein interaction (PPI) network was extracted from Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) [45]. In this paper, we adopted STRING400 network, i.e., STRING PPI network with median (400) confidence. Finally, we downloaded 2968 GO BPs from Gene ontology databases and 197 KEGG pathways to generate various network modules.

Constructing gold ORD set
We applied a literature based text-mining to evaluate whether a disease is associated with obesity. Specifically, we ranked a disease according to the Jaccard coefficient between the disease name and the term "obesity" in PubMed abstracts published from 2009 to 2015 (Supplementary Dataset 2). The PubMed abstracts containing the term "obesity" from 2009 to 2015 were retrieved by using Entrez Programming Utilities (http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi?db=pubmed&term=[obesity]+AND+2009:2 015[pdat]&retmax=999999), and the abstracts containing disease name and both disease name and "obesity" were retrieved similarly. The Jaccard coefficient is calculated as a reasonable measure to represent the co-occurrence of obesity and a disease [46]. Where PubMedID disease and PubMedID obesity are the PubMed IDs corresponding to the PubMed abstracts containing the disease name and the term "obesity", respectively.
Based on our knowledge, we selected diseases with Jaccard coefficient larger than or equal to 0.004 as obesityrelated diseases. However, it is of note that there are some obvious obesity-related diseases without taking into account. According to previous study, there are some diseases associated with obesity, such as cardiovascular diseases, metabolic diseases, serious psychiatric illness [47], coronary heart disease [48], Alzheimer's disease [49] and inflammatory diseases [50]. Therefore, schizophrenia, major depressive disorder, inflammatory bowel disease, Alzheimer's disease and other four diseases are added to the list. At last, we obtained 51 diseases that are defined as ORDs, which are used for our study (Supplementary Table 1). To further validate the 51 diseases as ORDs, we also provided literatures to support them (Supplementary Dataset 3).

Methods to identify obesity-disease association
We used three algorithms to identify the association between obesity and diseases, namely OBNet based on a procedure similar to gene set enrichment analysis and a random walk with restart procedure, OBsp performed by using the shortest path algorithm, and OBoverlap based on direct overlapping between obesity and disease associated genes.

OBNet
OBNet is generally similar to our previous software GeroNet [12]. Specifically, we first mapped different KEGG pathway genes or GO BP genes to the PPI network to generate a variety of modularized networks. It is worth to note that we only used GO BPs or KEGG pathways with the number genes less than 500 and ignored those with overly large gene sets. Based on each modularized network, an expanded modularized network was obtained by using a random walk with restart (RWR) procedure until it reaches 5 times the original gene size or a maximum of 500 genes. A modularized network or an expanded modularized network is considered as a module. Second, we mapped obesity genes and disease genes to each module or the whole PPI network and then use RWR and a procedure which is similar to gene set enrichment analysis (GSEA) to estimate the mutual reachability between obesity genes and disease genes (see below). If the number of obesity genes or disease genes mapped to network are too few, RWR may not do well. So we only consider modularized networks or expanded modularized networks that contain at least 5 obesity genes and 5 disease genes. Each of diseases is estimated respectively. After that, a permutation test is used to estimate the significance of the reachability between obesity genes and disease genes by randomly permute the obesity genes.
RWR: For a PPI network G = (V, E) which contains a set of proteins V and a set of interactions E, an n × n adjacency matrix A is used to represent the PPI network, where n is the number of proteins. The entry at row i and column j will be set to 1 if there is an interaction between protein i and protein j; otherwise it will be set to 0. The adjacency matrix A is then normalized as following The random walker algorithm starts from a set of seed genes, such as obesity, disease genes or modularized network genes. The initial state P 0 is represented by a column vector P 0 = [ψ 1 , ψ 2 , ≥, ψ n ] T , where ψ i is set to m 1 for the m seed genes and 0 for other genes located on the network. It then randomly visit adjacent genes in every tick of time (t → t + 1). The state probabilities P t+1 at time t + 1 is calculated as following where P t is the probabilities at time t, r is the restart probability (i.e., starting again from the seed genes). For simplity, we set r to be 0.5 in this study. This process will be stopped if it reaches a steady-state when the difference between P t and P t+1 is smaller than 1e-6 used by previous studies [51].
Obesity-disease association on module: We use a method similar to GSEA to calculate a score which is used to indicate the reachability between a set of obesity genes and a set of disease genes on module [52]. Using a set of disease genes as the starting points, we go across the sorted gene list of this module based on the probability of genes obtained by RWR, if we meet a gene not an obesity gene, is then added to the score, where N is the number of genes in the module, and G is the number of obesity genes; otherwise, N G G − is added. This generates a curve and the peak value of the curve is defined as ES 1 . Similarly, we calculate ES 2 using obesity genes as seed genes. The enrichment score is then defined as following In order to assess the significance level of ES β , we permute obesity genes in the module for 100 times to obtain the null distribution of enrichment scores. According to this, ES β is converted to a normal z-score statistic and then a p-value is calculated and adjusted. After adjusting p-value of obesity-disease connection for multiple testing, we defined the p-value of obesity-disease association to be the minimum p-value for all relevant modules. The diseases are ranked based on their p-values.
We evaluated OBNet based on area under the curve (AUC) of receiver operating characteristic (ROC) curve by comparing inferred rank of diseases and the gold ORD list. As such, we set β to be 0.1 since it achieves the best performance.

OBsp
OBsp is generally similar to OBNet except that we used the shortest path to calculate the reachability of obesity and disease associated genes. Specifically for a given disease and a given module, we first calculated the shortest pathway of all disease and obesity gene pairs shortest.paths{igraph} function in R. The disease-obesity distance for the module was calculated as the average length of the shortest paths (of all gene pairs). Finally, the disease-obesity reachability was calculated as the minimum distances for all modules, and the diseases were ranked based on the reachability.

OBoverlap
OBoverlap calculates the Jaccard coefficient concerning obesity genes and disease genes and ranked according to the Jaccard coefficient.

Key connector analysis
We adopted a previously established software package key driver analysis (KDA) [53] to identify key connectors in PPI network. KDA was originally designed to identify "key regulators" in a directed regulatory network. When applied to undirected networks like PPI networks, we consider the key nodes as "key connectors" since they do not necessarily contain the directional information [53]. Such key connectors function more like a "hub" gene, instead of being considered as "master regulators". Specifically, KDA takes a set of genes G and an undirected gene network N as inputs. It has two searching strategies namely dynamic neighbourhood search (DNS) and static neighbourhood search (SNS) for identifying key connectors. We adopted DNS in this study: (1) It first generates a subnetwork N G consisting of all nodes in N with no more than L (L = 2 in this study) steps away from the nodes in G. (2) For each gene g in N G DNS then searches for genes with distances no more than h = 1, 2, ≥, H (H = 2 in this study) in N G . The set of genes (not including g) is denoted by N G (HLN g,h) . The Hypergeometric test is then used to calculate the enrichment between N G (HLN g,h) and G with the genes in N G as background for each h. The final enrichment p-value of each gene g is calculated as the minimum p-value across h layers. (3) The Bonferroni correction is performed to adjust for multiple testing and the genes with significant Bonferroni p-values (≤ 0.05) are outputted as key connectors.

Function enrichment
The function enrichment was done by David Bioinformatics Resources 6.8.