Identification of the potential crucial genes in invasive ductal carcinoma using bioinformatics analysis

Invasive ductal carcinoma (IDC) is a common histological type of breast cancer. The aim of this study was to identify the potential crucial genes associated with IDC and to provide valid biological information for further investigations. The gene expression profiles of GSE10780 which contained 42 histologically normal breast tissues and 143 IDC tissues were downloaded from the GEO database. Functional and pathway enrichment analysis of differentially expressed genes (DEGs) were performed and protein-protein interaction (PPI) network was analyzed using Cytoscape. In total, 999 DEGs were identified, including 667 up-regulated and 332 down-regulated DEGs. Gene ontology analysis demonstrated that most DEGs were significantly enriched in mitotic cell cycle, adhesion and protein binding process. Through PPI network analysis, a significant module was screened out, and the top 10 hub genes, CDK1, CCNB1, CENPE, CENPA, PLK1, CDC20, MAD2L1, HIST1H2BK, KIF2C and CCNA2 were identified from the PPI network. The expression levels of the 10 genes were validated in Oncomine database. KIF2C, MAD2L1 and PLK1 were associated with the overall survival. And we used cBioPortal to explore the genetic alterations of hub genes and potential drugs. In conclusion, the present study identified DEGs between normal and IDC samples, which could improve our understanding of the molecular mechanisms in the development of IDC, and these candidate genes might be used as therapeutic targets for IDC.


INTRODUCTION
Breast cancer is the most common malignancy diagnosed in women worldwide, and is the second leading cause of cancer death among women. It has become one of the austerity issues in the world. Although the prognosis of patients is generally favorable due to the early detection and comprehensive treatment, the morbidity of breast cancer is rising. Additionally, the recurrence rate remains high and 20%-30% of patients will develop distant metastases with a median two-year survival time [1,2]. Several studies have revealed that the rise of breast cancer cases is often due to the inherent susceptibility of genes and it is easy to relapse even after surgery of removing the primary tumor [3].
Invasive breast carcinoma is a heterogeneous group of tumors that exhibit different morphological spectrums and clinical behaviors. Treatment strategies for patients are designed based on the histological characteristics of tumor and other prognostic factors [4]. Breast cancer type is one of the most vital characteristics, and it is an important prognostic factor for breast cancer patients. To treat patients with invasive breast carcinoma, it is necessary for us to understand the specific biological characteristics of a given histological type [4]. According to a reported data, invasive ductal carcinoma (IDC) is a

Research Paper
Oncotarget 6801 www.impactjournals.com/oncotarget common histological type of breast cancer, accounting for about 75% of all invasive breast carcinoma cases [5]. Despite clinicians suggest that invasive ductal carcinoma always requires radical treatment, chemotherapy, and radiotherapy, to date, a lack of knowledge regarding the precise molecular targets for IDC limits the ability to treat advanced diseases [6].
Microarray is a high-throughput platform for the analysis of gene expression profiles, it has been widely used for investigating the underlying regulatory network involved in different types of cancer with great clinical applications: to improve the clinical diagnosis, and to discover new drug targets. Using microarray technology, several studies have exploited gene expression profiles of breast cancer and demonstrated prognostic significance. The analysis of BRCA1/2 mutation has been already used in clinical practice as a prognostic marker for breast cancer [7]. In a recent meta-analysis, AMDC2, TSHZ2, and CLDN11 were significantly related to the disease-free survival of breast cancer patients [6]. However, the results are often inconsistent due to sample heterogeneity.
In the present study, we have downloaded public microarray data from Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) to identify DEGs between IDC samples and normal samples. Subsequently, functions of DEGs were further analyzed by gene ontology (GO) annotation, pathway enrichment and proteinprotein interaction (PPI) network construction using bioinformatics methods. By way of identifying DEGs and analyzing their biological functions and key pathways, we will have a better understanding of the mechanisms of IDC pathogenesis, and explore the potential candidate biomarkers for early diagnosis, individualize the prevention and therapy of IDC patients.

Identification of DEGs
In the present study, 143 IDC samples and 42 normal samples in the dataset of GSE10780 were analyzed. Based on the cut-off criteria (adjusted P-value < 0.01 and |log2 foldchange (FC)|> 1), a total of 999 DEGs were identified, including 667 up-regulated and 332 downregulated DEGs. DEGs expression heat map is shown in Supplementary Figure 1, and the hierarchical cluster analysis of the data demonstrated that the DEGs could be used to distinguish IDC samples from the normal samples. We then analyzed GSE21422 to validate the results, the overlapping DEGs were identified using Venn diagram in Supplementary Figure 2.

Gene ontology enrichment analysis
Functional enrichment analysis of DEGs was performed using DAVID online tool. The DEGs were categorized into three functional groups: biological process (BP), molecular function (MF) and cellular component (CC). GO analysis results (Table 1) showed that in the biological process, up-regulated genes were mainly enriched in cell division, sister chromatid cohesion, mitotic nuclear division, chromosome segregation and DNA replication; down-regulated genes were enriched in cell adhesion, hemidesmosome assembly, response to drug, and positive regulation of nitric oxide biosynthetic process. For molecular function, up-regulated genes were significantly enriched in protein binding, microtubule binding, protein kinase binding, identical protein binding and protein heterodimerization activity; while down-regulated genes were enriched in heparin binding, transcriptional activator activity, RNA polymerase II core promoter proximal region sequence-specific binding, growth factor activity, protein homodimerization activity and Wnt-protein binding. In the cellular component analysis, up-regulated genes were mainly enriched in nucleoplasm, condensed chromosome kinetochore, cytosol, chromosome, centromeric region and chromosome, centromeric region; down-regulated genes were enriched in extracellular space, extracellular region, proteinaceous extracellular matrix, extracellular exosome and cell surface. The results demonstrated that most DEGs were significantly enriched in mitotic cell cycle, adhesion and protein binding process.

Pathway enrichment analysis
According to KEGG pathway analysis, significantly enriched pathways of DEGs were shown in Table 2. Up-regulate genes were mainly enriched in cell cycle, DNA replication, viral carcinogenesis, systemic lupus erythematosus and pyrimidine metabolism pathways; while down-regulate genes were significantly enriched in regulation of lipolysis in adipocytes, PI3K-Akt signaling pathway, PPAR signaling pathway and cytokine-cytokine receptor interaction pathways.
Moreover, the total of 955 nodes and 2246 edges were analyzed using plug-in MCODE, and the most www.impactjournals.com/oncotarget significant modules were screened out, which contained 13 nodes and 75 edges ( Figure 1). Strikingly, all of the genes in this module were up-regulated DEGs. According to GO enrichment analysis, in biological process, the genes were mainly associated with sister chromatid cohesion, cell division, mitotic nuclear division, mitotic sister chromatid segregation and chromosome segregation. In molecular function, these genes were significantly enriched in protein binding, histone kinase activity, kinetochore binding, anaphasepromoting complex binding and centromeric DNA binding. In the cellular component analysis, they were mainly enriched in condensed chromosome kinetochore, cytosol, chromosome, centromeric region, kinetochore and spindle pole (Table 3). KEGG pathway analysis demonstrated that these genes were mainly involved in cell cycle, oocyte meiosis and progesterone-mediated oocyte maturation pathways (Table 4).

Validation of the hub genes
To confirm the reliability of the hub genes, we used ONCOMINE (www.oncomine.org), a cancer microarray database and web-based data-mining platform to validate the expression levels of the 10 genes [8]. We performed the differential analysis between IDC and normal samples using TGCA datasets. Consistently, the top 10 hub genes were significantly up-regulated in IDC (Supplementary Figure 3). In order to identify the hub genes which would be potentially associated with overall survival of IDC patients, we evaluated the associations between hub genes' expression and patients' survival using Kaplan-Meier curve and Log-rank test. The results showed that 3 hub genes (KIF2C, MAD2L1 and PLK1) were associated with the overall survival ( Figure 2). We summarized the association of the three hub genes' expression levels and clinical features in Tables 5-7. The result demonstrated that MAD2L1and KIF2C were significantly associated with patients age, ER and PR status, tumor stage and size; PLK1was related to ER and PR status, tumor stage and size. The difference expressions of MAD2L, KIF2C and PLK1 among molecular subtypes were shown in Figure 3.

Mining genetic alterations and potential drugs of hub genes
We analyzed the 10 hub genes using cBioportal to explore their cancer genomic alterations in breast cancer and to find out potential drugs. Among the 10 breast cancer studies, alterations ranging for the hub genes were found from 0% to 51.7% (Figure 4). In the study of Curtis et al. and/or Pereira et al. [9], 634 cases (25%) had an alteration in at least one of the 10 genes queried. The frequency of alteration in each of the selected genes was shown in Figure 5. Mutual exclusivity analysis showed that 43 gene pairs had significant co-occurrent alterations  Table 1). The drugs for hub genes were showed in Figure 6. Among them, CCNA2, CDK1 and PLK1 were targets of most drugs.

DISCUSSION
IDC is a common histological type of breast cancer, it is important to understand the molecular mechanisms for the treatments. Nowadays, microarray has been widely used to analyze the expression changes of mRNA in breast cancer and predict the potential therapeutic targets. In the present study, we explored potential crucial genes associated with IDC using bioinformatics analysis. Based on the cutoff criteria, a total of 667 up-regulated and 332 down-regulated DEGs were identified from GSE10780.
Through GO enrichment analysis, in biological process, the up-regulated genes were most significantly involved in cell division. While down-regulated genes were significantly enriched in cell adhesion. It is easy to understand that uncontrolled cell division is the hallmark of cancers, and loss of cell-cell adhesion is an important step in the acquisition of the invasive, metastatic phenotype [10,11]. In the molecular function portion, upregulated genes were mainly enriched in protein binding, microtubule binding and protein kinase binding. It is pointed out that many microtubule binding proteins were associated with oncogenesis. Microtubule end-binding protein 1 (EB1) was demonstrated up-regulated both in human breast cancer specimens and cell lines. The level of EB1 could indicate the malignancy of breast cancer and is reported to be correlated with clinical characteristics, including higher histological grade, higher pathological tumor node metastasis stage (pTNM), and higher incidence of lymph node metastasis [12]. While down-regulated genes were most significantly enriched in heparin binding, a previous study reported that the expression of heparinbinding epidermal growth factor-like growth factor (HB-EGF) was inversely related to biological aggressiveness of the breast carcinoma, suggesting that HB-EGF may play an important role in process of breast carcinoma [13]. As for cellular component, up-regulated genes mainly located in the cell nucleus, and down-regulated genes were mostly enriched in extracellular positions. This result indicated that DEGs may participate in DNA replication and cell adhesion. GO enrichment analysis demonstrated that DEGs might play crucial roles in oncogenesis through cell division, adhesion and binding-related mechanisms.
Furthermore, pathway analysis revealed that upregulated genes were mainly engaged in cell cycle and DNA replication, suggesting DEGs may participate in cell proliferation. Down-regulated genes were enriched in regulation of lipolysis in adipocytes, PI3K-Akt signaling pathway, PPAR signaling pathway and cytokine-cytokine receptor interaction. Recent studies have demonstrated  Oncotarget 6805 www.impactjournals.com/oncotarget that obesity is associated with increased recurrence and reduced survival of breast cancer, and adipocyte lipolysis may play an important role in the provision of metabolic substrates to breast cancer cells. While further studies are needed to explore the complex metabolic symbiosis between tumor-surrounding adipocytes and cancer cells that stimulate their invasiveness [13,14]. In addition, PI3K-Akt signaling pathway and PPAR signaling pathway were confirmed that both play crucial roles in cell proliferation, invasion and metastasis [15], PI3K-Akt signaling pathway had been demonstrated to be activated in breast cancer, in the present study, 15 genes (IL6, FGF10, GNG11, KIT, PCK1, LAMB3, COL6A6, RELN, LAMC2, ANGPT1, TNN, PDGFD, EGF, PIK3R1, GHR) were down regulated in this pathway, and the result was validated using TCGA database. These genes are      not crucial genes in the PI3K-Akt signaling pathway, they may participate other pathways which can regulate tumorigenesis, the downregulation of these genes may promote the occurrence and development of tumor. Through PPI network construction, the module analysis revealed that the DEGs were mainly involved in cell division, cycle and binding-related mechanisms. And we listed the top 10 hub genes with higher degrees: CDK1, CCNB1, CENPE, CENPA, PLK1, CDC20, MAD2L1, HIST1H2BK, KIF2C and CCNA2. The 10 hub genes were validated in the TCGA database. PLK1, MAD2L1 and KIF2C were demonstrated significantly associated with overall survival and clinical features. In addition, these hub genes all had alterations in breast cancer.
CDK1 have been demonstrated to be a potential prognostic indicator, the protein encoded by this gene is a member of the Ser/Thr protein kinase family, which is essential for G1/S and G2/M phase transitions of eukaryotic cell cycle. S. J. Kim et al. considered that CDK1 is strongly associated with clinical outcomes of breast cancer patients, and regarded CDK1 as a new independent prognostic factor [16,17]. CCNB1 is a regulatory protein involved in mitosis, it is expressed predominantly during G2/M phase. It is reported that   CCNB1 is a power prognostic factor for the survival of ER+ breast cancer patients [18], and it is also involved in therapy resistance [19]. CENPE is a kinesin-like motor protein that accumulates in the G2 phase of the cell cycle, while CENPA is proposed to be a component of a modified nucleosome or nucleosome-like structure. CENPE and CENPA are two AU-rich elements (AREs) involved in the mitotic cell cycle, a recent study revealed that defects in ARE-mediated posttranscriptional control could lead to carcinogenesis. Recent studies have also shown that the survival of breast cancer patients is related to high levels of the mitotic ARE-mRNA signature [20]. PLK1 is a Ser/Thr protein kinase which performs important roles in the M phase of the cell cycle. PLK1 antagonizes p53 during DNA damage response, and alteration of mRNA and protein expression related to DNA damaging, replication and repairing was detected in PLK1-silenced tumor cells, including the DNA-dependent protein kinase (DNAPK) and topoisomerase II alpha (TOPO2A) [21]. M Wierer et al. reported PLK1 mediates estrogen receptor (ER)-regulated gene transcription in human breast cancer cells [22]. Evidence also revealed that breast cancer cells with treatment of siRNAs targeting PLK1 could improve the sensitivity toward paclitaxel and Herceptin [23]. CDC20 acts as a regulatory protein which is an essential component of cell division. High CDC20 is reported to be associated with an aggressive course of disease and poor prognosis [24]. MAD2L1 is a component of the mitotic spindle assembly checkpoint, and it may play crucial roles in the progression of breast cancer. Interestingly, MAD2L1 could inhibit the activity of the anaphase promoting complex by sequestering CDC20 until all chromosomes are aligned at the metaphase plate. Lowering the expression of MAD2L1 by siRNAs could reduce tumor cell growth and inhibit cell migration and invasion [25]. MAD2 overexpression has recently been shown to lead to tumor initiation and progression through the acquisition of chromosomal instability (CIN) in mice, tumors that experience transient MAD2 overexpression and consequent CIN results in markedly elevated recurrence rates [26]. It is reported that measuring the expression of MAD2L1 may assist the prediction of breast cancer prognosis [25]. The potential mechanisms of MAD2L1 in breast cancer require further investigation. HIST1H2BK is a core component of nucleosome, which participates in the pathway of activated PKN1 stimulates transcription of androgen receptor regulated genes KLK2 and KLK3. While the role of HIST1H2BK in breast cancer remains unclear. KIF2C is a kinesin-like protein that functions as a microtubule-dependent molecular motor. It can depolymerize microtubules at the plus end, thereby promoting mitotic chromosome segregation. T Abdelfatah et al. revealed the overexpression of KIF2C protein is associated with unfavorable clinic-pathological features and predicted poor clinical outcomes [27]. While the underlying mechanisms are not clear, further investigations are needed to identify the role of KIF2C in breast cancer. CCNA2 functions as a regulator of the cell cycle to promote transition through G1/S and G2/M. Several studies have demonstrated that CCNA2 has significant power to predict the survival of breast cancer patients and it is also found that CCNA2 was closely associated with tamoxifen resistance [28,29]. In our present study, only MAD2L1, KIF2C and PLK1 were associated the overall survival of IDC patients. The three genes all play important roles in the process of mitotic cell cycle. It is easy to understand that uncontrolled cell cycle is an important step of cancer occurrence and development, while further studies are still required to explore the mechanisms.
Using bioinformatics analysis, our study identified 667 up-regulated and 332 down-regulated DEGs. Among the10 hub genes, MAD2L1, KIF2C and PLK1 were potential biomarkers for the prognosis of IDC patients. The results of the present study may give valuable indication for basic and clinical research. However, further molecular biological experiments are needed in order to confirm the functions of identified DEGs.

Identification of DEFs in IDC
The gene expression profiles of GSE10780 were downloaded from the GEO database. GSE10780 which was submitted by Chen D et al. was based on GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) and contained 185 samples, including 42 histologically normal breast tissues and 143 IDC tissues. The probes without annotation of gene expression profiles were filtered and probes were transformed into gene symbol. The gene expression profile data was preprocessed using the Robust Multi-array Average (RMA) algorithmin affy package within Bioconductor (http://www. bioconductor.org) in R. After background correction, quantile normalization and probe summarization, we used the Linear Models for Microarray Data (LIMMA, http://www.bioconductor.org/packages/release/bioc/html/ limma.html) package in R to identify DEGs by comparing expression value between samples in IDC and normal group. The corresponding p value of gene symbols after classical T-test was defined as adjusted p-value, adjusted P-value <0.01 and |log2 foldchange (FC)|>1 were set as the cutoff criteria. And five healthy tissue samples and five IDC samples from GSE21422 (based on GPL570 platform) were analyzed to validate the results.

Gene ontology and pathway enrichment analysis of DEGs
Functional and pathway enrichment analysis of DEG were carried out using the database for Annotation, www.impactjournals.com/oncotarget Visualization and Integrated Discovery (DAVID, http:// david.abcc.ncifcrf.gov/). In the present study, we performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis based on DAVID online tools. The Go terms were classified into three categories, including cellular component (CC), biological process (BP) and molecular function (MF). P-value < 0.01 was considered as statistically significant differences. For KEGG pathway analysis, P-value < 0.01 was set as the cutoff criterion to identify the enriched pathways.

Construction of PPI network
The PPI network of DEGs in our study were constructed using Search Tool for the Retrieval of Interacting Gene (STRING) database. STRING is an online tool to predict the protein-protein interaction information, and can provide system-wide view of cellular processes. To evaluate the interactive relationships among DEGs, we established the PPI network using STRING, and "Confidence score > 0.7" was selected as the cut-off criterion. Then, PPI network was visualized by cytoscape software (http://www.cytoscape.org/). Molecular Complex Detection (MCODE) was subsequently applied to screen the modules of PPI network in cytoscape. The criteria were set as follows: "degree cutoff = 2", "node score cutoff = 0.2", "k-core = 2" and "max.depth = 100". The hub proteins are a small number of proteins that have many interactions with other proteins, to screen out these important nodes in the PPI network, the plug-in CytoHubba was utilized in the present study.

Validation of the hub genes
To confirm the reliability of the hub genes, we used ONCOMINE (www.oncomine.org), a cancer microarray database and web-based data-mining platform to validate the expression levels of the 10 genes [8]. We performed differential analysis between IDC and normal samples using TGCA datasets. In addition, the RNA sequencing data and clinical information were downloaded from TCGA database (https://cancergenome.nih.gov/). A total of 778 IDC cases were analyzed in our study. The RNA sequencing data were normalized using R language package. Patients clinical information included sex (male and female), age at diagnosis (<35, 35-49, 50-64, ≥65 years), race (white, black, Asian), ER, PR and HER2 status, and tumor-node-metastasis (TNM) stage. The prognostic value of each differentially expressed mRNA was evaluated using Kaplan -Meier survival curves by logrank test. A p-value < 0.05 was defined as significant.

Exploring cancer genomics data by cBioportal
The cBioPortal for Cancer Genomics (http://www. cbioportal.org/) provides visualization, analysis and download of large-scale cancer genomics data sets [30,31]. In this study, we used cBioPortal to explore the genetic alterations of hub genes and potential drugs.

Human participants and animal rights
This article does not contain any studies with human participants or animals performed by any of the authors.