Computational modeling suggests impaired interactions between NKX2.5 and GATA4 in individuals carrying a novel pathogenic D16N NKX2.5 mutation

NKX2.5, a homeobox containing gene, plays an important role in embryonic heart development and associated mutations are linked with various cardiac abnormalities. We sequenced the NKX2.5 gene in 100 congenital heart disease (CHD) patients and 200 controls. Our analysis revealed a total of 7 mutations, 3 in intronic region, 3 in coding region and 1 in 3’ UTR. Of the above mutations, one mutation was found to be associated with tetralogy of fallot (TOF) and two (rs2277923 and a novel mutation, D16N) were strongly associated with VSD. A novel missense mutation, D16N (p-value =0.009744), located in the tinman (TN) region and associated with ventricular septal defect (VSD), is the most significant findings of this study. Computational analysis revealed that D16N mutation is pathogenic in nature. Through the molecular modeling, docking and molecular dynamics simulation studies, we have identified the location of mutant D16N in NKX2.5 and its interaction map with other partners at the atomic level. We found NKX2.5-GATA4 complex is stable, however, in case of mutant we observed significant conformational changes and loss of key polar interactions, which might be a cause of the pathogenic behavior. This study underscores the structural basis of D16N pathogenic mutation in the regulation of NKX2.5 and how this mutation renders the structural-functional divergence that possibly leading towards the diseased state.

To conduct the 3D model, the BLAST (from NCBI) was performed with the aim to identify the existing crystal structures with high identity and similarity. Overall, very less sequence similarity was found for N-term and C-term regions, however, exceptionally high similarity was found for HD region. Additionally, we performed disorder tendency and protein binding analysis through DISOPRED [3] and IUPRED [4], which highlight the probability estimation of each residue in the sequence ( Figure 1E), which claimed well that except HD region, N-term and C-term are unstructured and highly disordered. Henceforth, to construct 3D model, we divided the model building steps into two parts: first, to build HD region and then its N-term. The C-term was not constructed, as the aim was to explore the localization of novel identified mutation D16N which is localized in the N-term.
Building N-term model of NKX2.5 From the BLAST database, we failed to identify any protein structures (identification of template step) related to sequences of the N-term (from residue 1-137). The disorders prediction from DISOPRED server indicates that the N-term is highly unstructured and disordered. Therefore, we utilized comparative modeling strategy, shown as Scheme1: flowchart of molecular modeling pipeline (Figure 3), to build model for N-term region. Furthermore, the prediction of protein binding zone in the disordered N-term region was made by Anchor [7]. Apart from the PSIPRED, JPred [8] and RaptorX [9], tools were also used for the prediction of secondary structures in the disorder region of N-term. The fold recognition (FUGUE [10] and SAM-T06 [10]) and meta-threading and machine learning approach (eThread [11]) strategies were used for building the N-term region. Furthermore, to get the N-term model, the top models obtained from two different methods was compared with predicted secondary structure outcomes, and the consensus results which also matched with the result of ANCHOR (protein binding motifs regions) used as template for applying the comparative modeling strategy All models which have shown good DOPE-score, and Ramachndran plot were taken as template for model generation through Modeller v.-9.14(500 full-atom models). Now to merge the separately build, N-term and HD region, the templates of N-term and HD were used for further comparative modeling to get the final model (N-term + HD). In final models, the loop regions were refined using the LOOPY module of Jackal package [12]. Finally, the models were subjected to molecular dynamics simulations to remove bad contacts, steric clashes, and for optimization of hydrogens.
The biomolecular surface area, accessible to the solvent accessible surface area (SASA) [13] was performed. It is well known that hydrophobic amino acids, which are deeply buried in the protein core act as a driving force for folding of protein. Therefore, now a day, SASA used as a tool to distinguish correctly from incorrectly folded protein models. So according to this concept, in general the hydrophilic and polar amino acid should be exposed toward solvent and hydrophobic amino acid should be deeply buried and orient toward protein core.
The sequences alignment (with ClustalW) followed by model building through Modeller v.-9.14 (500 fullatom models). In each case, the best model was evaluated on the basis of DOPE-score, and Ramachandran plot was selected for further analysis. We used the ProSA program to check the fitness of the sequences relative to the obtained structures. The secondary structure predictions were performed similarly as in NKX2. 5.
For further analysis of robustness and quality of modeled proteins (NKX2.5 and GATA4, both), the ERRAT was used.

Protein-DNA docking
NKX2.5-DNA complex was formed by two-way. First, the DNA was taken from published crystal structure of NKX2.5-ANF complex PDB-ID "3RKQ" and "4S0H" and the complex with best minimized model was formed by overlay of the HD region and crystal of NKX2.5. Formed complex was further energy minimized to remove bad clashes. Secondary, the protein-DNA docking was performed through web servers HADDOCK2.2 [18] and NPDock [19]. The best complex was screened out by taking as standard, interaction of HD of NKX2.5 with DNA as reported in PDB ID: 3RKQ (Supplementary Figure 2A).

Protein-Protein Docking (P-P docking) NKX2.5-GATA4
An extensive multi-step protein-protein docking between best models of GATA4 and of NKX2.5 was performed by SWARMDOCK [20] and Maestro/PIPER (Schrödinger Inc.) tool [21]. The docking outcomes were separately evaluated to get the best complexes. Furthermore, the focused docking was conducted to get the proper orientation of the complexes (i.e binding mode analysis). However, to get more robust outcome and to validate the result of protein-protein docking we further performed the SiteMap [22] analysis through Schrodinger module.
Role of Mutation: We mutate D16 of NKX2.5 into N16 and relax the protein by using OPLS3 force field [23] in Schrödinger suite (Schrödinger Release, 2017-1) [24] the final protein is now NKX2.5 MT. The mutated protein was superimposed over wild type, to check the variability in the interaction map.

Construction of homeodomain (HD)
NKX2.5 HD adopts a unique canonical form of architecture, which is specific to HD containing proteins acting as transcription factor ( Figure 5). The best 3D model for HD (from amino acid residue 141-192) of NKX2.5 has been obtained using as template structures i.e. the PDB-IDs: "3RKQ" (chain a, 137-194, resolution: 1.7 Å) [25], "4S0H" (chain b, 138-194, resolution: 2.8 Å) [26] and "5FLV" (chain a, 140-191, resolution 3.0 Å) [27]. The Final model of HD has 97.8% residues in the most favored regions in Ramachandran plot, Prosa Z-score of -4.54 and Errat overall quality factor of 91.7 while, template 4S0H has 92.0% residue in the most favored region in Ramachandran plot (Supplementary Figure 3B), Prosa Z-score of -4.57 (Supplementary Figure 4B) and Errat overall quality factor of 98.0 (Supplementary Figure  5C). The HD zone appears well structured as in agreement with its low propensity to the disorder ( Figure 1E). The 3D homology model of HD matches very well with the secondary structure predictions and the architecture of secondary structure in 3-dimensional space completely mimic the published crystals "3RKQ" and "4S0H". The Dope score of model was showing the same trend as in crystals (Supplementary Figure 9A). The C-alpha RMSD analysis between HD region of final model of NKX2.5 and its template is 0.345Å, indicating the robustness and good quality of the model.

Construction of N-term region
The N-term region is positioned above the DNA binding site of HD in such a manner so that it makes a lid like structure, probably N-term regulates the on/off switch behavior. N-term model has 86.7% residues in the most favored regions in Ramachandran plot (Supplementary Figure 3C) and a Prosa Z-score of 4.28 (Supplementary Figure 4C). The solvent accessibilitycalculation (SASA) was carried out to further validate the model (Supplementary Figure 6A). Final model falls in the line with the result of SASA as the exposed residues lied at outer face of model away from the core/grove and exposed to solvent, while the deep and buried residues form hydrophobic inner globular core.

Modeling of GATA4
Human GATA4 is broadly divided into three big domains, viz. N-term, CORE and C-term. The N-term have two small Transcriptional activation domain (TAD), core contain two Zn-finger (ZnF) of the motif Cys-X2-Cys-X17-Cys-X2-Cys that directs binding to the nucleotide sequence element (A/T) GATA (A/G). [28] The C-term also has one large TAD. The  Figure  1C) [29]. Each ZnF made up of one helix of 10 amino acid and two anti-parallel beta sheets. (Figure 6B and S8) From the Figure 1F it is clear that the GATA4 have two strong protein binding zones one is localized at the in C-term (from amino acid residue 300-420), and other is at the center between two Zn-Finger residues 248-268.
However, at N-term, either very weak or no any protein binding region is identified.
Multiple sequence alignment of human GATA family proteins human GATA: GATA1 to GATA6, and mouse GATA: GATA1, GATA3 and GATA4; and from the crystal structures template PDBs, it was clear that core region from residue 214 to 326 remain highly conserved throughout its the family. However, while opposite to this very less similarity was found at in N-term and C-term regions. (Supplementary Figure 6B). The best 3D model of core region has been obtained by using as crystal structures (template PDB ID: "4HC7" (chain a, 259-366, GATA3-complex2, resolution: 2.7 Å) 3 ; "3VD6" (chain c, 200-318, 1.9 Å) 3 ; "4HC9" (chain a, 257-371, GATA3-complex1, resolution: 1.6 Å) 4 ; "4HCA" (chain a, 257-371, resolution 2.8 Å) and "2M9W" (chain a, 261-321, GATA4) shown in Supplementary Figure  9. This model of core region has 99.9% residues in the most favored regions in Ramachandran plot and a Prosa Z-score of 8.7. These values, compared also with those of the template structures (including ERRAT values, Supplementary Figure 5G and Dope_score, Supplementary Figure 9B), indicating the quality of the model.

Disorder, globularity and binding region prediction in GATA4
Similar to NKX2.5 the disorderness was also calculated for GATA4. As shown in Figure 1F that the amino acid residue from 1 to 170 and 320 to 425 are highly disordered (confidence score >0.8), residue 170 to 210 are moderate disordered (confidence score 0.6-0.8), residue 210 to 320 and 425 to 442 are structured (confidence score <0.5). The confidence score per residue-wise, in which the confidence score above 0.5 is counted as disordered region. From the Figure 1F it is clear that, the GATA4 have two strong protein binding sites, one localized in C-term disorder part (from residue 300-420) and, second is at connecting loop i.e between two ZnF-domains (residues 248-268). The globularity prediction ( Figure 1F) also matched with DISOPRED and IUPRED findings that the region beyond core region (ZnF domains) is nonglobular and highly flexible. So here we consider only ZnF domain regions for model building. Similar to NKX, the best model was evaluated on the basis of Prosa score which

Explanation of ERRAT
Although ERRAT is very effective in identifying erroneous and non-erroneous regions in model structure but have one weakness of higher sensitivity toward small deviations in atomic positions. It means those regions in protein which are flexible or have high "R" value, disordered or involved in disorder-to-order transition (during interaction with other partner proteins) have high error value, and because of this, although the structure of protein is compact, though the overall quality factor decreased significantly. It is well evident in case of human GATA3, PDB ID: 4HCA and 4HC7 (from 260-369, resolution 2.80 Ȧ). In both PDBs, the residue 304-316 has shown higher disorder tendency (> 0.5) ( Figure 1F). This region is flexible enough, as in 4HCA it is missing, while in 4HC7, it has shown very high B-factor. The flexibility and disorderedness impact the overall quality of proteins as in 4HC7, the calculated overall quality of is 73.6% (error value 99.0%), while in 4HCA it is 90% (Supplementary Figure 7E and 7F). Furthermore, the same analysis was performed on GATA4 model, where the region to be address are localized from residue 257-270. This region also has shown the disordered tendency > 0.5. The overall quality factor for model GATA4 is 79.9 (error value 99.0%) (Supplementary Figure 7G) which is considerably higher value than crystal 4HC7, indicating the reliability of our constructed model.
From this discussion, it is convincible that the regions in protein which are disordered have high error value in ERRAT score. In NKX2.5, the crystal information is available for HD region only. The overall quality factor for HD region is 91.7, which is near to its crystal structure, PDB: 4S0H (94.8). In the final model of NKX2.5 (N-term + HD), the overall quality factor decreased enormously (61.1). The abrupt drop of quality is due to disordered N-term region of NKX2.5 (Supplementary Figure 7D). Transfec server predicts total 13 putative sites with Relative profile score threshold set to 80%. The place marked in yellow show best motif. ΔΔGb value higher than zero means decrease in binding affinity and value less than zero means increase in binding affinity.