Autosomal InDel polymorphisms for population genetic structure and differentiation analysis of Chinese Kazak ethnic group

In the present study, we assessed the genetic diversities of the Chinese Kazak ethnic group on the basis of 30 well-chosen autosomal insertion and deletion loci and explored the genetic relationships between Kazak and 23 reference groups. We detected the level of the expected heterozygosity ranging from 0.3605 at HLD39 locus to 0.5000 at HLD136 locus and the observed heterozygosity ranging from 0.3548 at HLD39 locus to 0.5283 at HLD136 locus. The combined power of discrimination and the combined power of exclusion for all 30 loci in the studied Kazak group were 0.999999999999128 and 0.9945, respectively. The dataset generated in this study indicated the panel of 30 InDels was highly efficient in forensic individual identifcation but may not have enough power in paternity cases. The results of the interpopulation differentiations, PCA plots, phylogenetic trees and STRUCTURE analyses showed a close genetic affiliation between the Kazak and Uigur group.


INTRODUCTION
In the last few years, a novel polymorphic marker, insertion and deletion polymorphisms (InDels), gained increased concern and attention in the field of medical and forensic genetics. InDels as biallelic markers combine desirable characteristics of both single nucleotide polymorphisms (SNPs) and short tandem repeats (STRs), becoming promising genetic markers for forensic purposes. InDels could be readily analyzed by capillary electrophoresis which is relatively common in the forensic DNA laboratories [1]. Moreover, InDels have small amplicons and relatively low mutation rates, which makes them more applicable for degraded or ancient DNA  [2,3]. Besides the purpose of forensic caseworks, ancestry informative InDels can be good candidates for biogeographic ancestry analyses since the allele frequencies of InDels are significantly different between different groups or populations [4].
The Kazak national minority whose population exceeds a million mainly lives in the Ili Kazak Autonomous Prefecture, Barkol Kazak Autonomous County and Mori Kazak Autonomous County in the Xinjiang Uygur Autonomous Region, China. Some are located in Qinghai and Gansu Provinces (http:// en.people.cn/102759/102835/7562907.html). To date, one commercial kit, Qiagen Investigator DIPplex reagent (Qiagen, Hilden, Germany), has been applied in multiplex amplification of 30 autosomal InDels. Previous population studies have been done and published using the kit [5][6][7][8]. In order to further clarify the genetic background and origin of the Kazak ethnic minority, we collected bloodstain samples from Kazak group in Xinjiang Uygur Autonomous Region and obtained population data using the kit mentioned above. Then we calculated the statistical parameters of the 30 autosomal InDel loci and evaluated the population genetic differentiations between Kazak and 23 previously published populations.

Forensic statistical parameter analysis
Allele frequencies and forensic efficiency parameters of 30 InDels in Kazak group were shown in Figure 1 and Supplementary Table 1. Having applied the Bonferroni correction, Hardy-Weinberg equilibrium (HWE) test showed no significant deviation from the expected value (p>0.0017), with the lowest p value at HLD97 locus (0.0455). We observed the expected heterozygosity (He) ranging from 0.3605 (HLD39) to 0.5000 (HLD136) and the observed heterozygosity (Ho) ranging from 0.3548 (HLD39) to 0.5283 (HLD136). The match probability (MP), the typical paternity index (TPI) and the polymorphic information content (PIC) were in the range of 0.3619 to 0.4736, 0.7749 to 1.0599 and 0.2955 to 0.3750, with a mean value of 0.3971, 0.9344 and 0.3555, respectively. The power of discrimination (DP) ranged from 0.5264 (HLD39) to 0.6381 (HLD101). The highest value of the power of exclusion (PE) was 0.2135 observed at HLD136 locus, while the lowest value was 0.0887 at HLD39 locus. The combined power of discrimination (CDP) and the combined power of exclusion (CPE) for all 30 loci in Kazak group were 0.999999999999128 and 0.9945, respectively. The high CDP demonstrated the sufficient potential of the 30 InDels in forensic individual identification. However, compared with the previous study concerning 21 STR loci in Kazak group [9], the CPE value was relatively low, which suggested that the panel of 30 InDels could just be treated as a supplement for STR loci in kinship analyses.

Linkage disequilibrium analysis
Linkage disequilibrium test between 30 InDels was carried out using the SNPAnalyzer program. As shown in Supplementary Figure 1, no crimson color coated by thick black curve existing in the graph and there was no significant LD observed between pairwise InDels with the values of r 2 less than 0.1(data not shown). Thus, these genetic markers could be regarded as relatively independent in the subsequent statistical analyses.

Interpopulation differentiations
Interpopulation differentiations were compared using the analyses of molecular variance (AMOVA) method. As shown in Supplementary Table 2, we calculated pairwise p values between the studied Kazak group and 23 previously published populations including Kazak1 [7], Uigur [7], Yi [10], Xibe [11], Tujia [12], South Korean [13], She [14], two Tibetan groups [15], three Han populations [7,14,16] in different regions, six Mexican groups [17], four European groups and Uruguayan group [18][19][20][21] (The geographical locations of the studied Kazak and other reference populations were shown in Supplementary Figure  2) based on allele frequencies of 30 InDel loci. The least differences were observed between the studied Kazak group and Kazak1, Uigur groups with significant differences at 1, 6 loci, respectively. While the most significant differences were found between Kazak group and six Mexican groups at 21-26 loci. Among the 30 loci, the HLD111 and HLD81 loci showed the highest population genetic differentiations with significant differences between Kazak group and 21 other compared populations, and the HLD77, HLD93, HLD101 and HLD136 loci had the lowest ethnic diversities with just 7 pair-wise populations. The present results showed that there were significant differences in allele frequency distributions of some InDel loci among different ethnic groups. Hence, study of more InDel allelic distributions in more populations may be required for the forensic application researches.
As shown in Figure 2, a heat map of pairwise Fst of Kazak and other referenced population was carried out by R statistical software [22]. A shade of blue color in the heat map represents the genetic distances of pairwise populations. The darker color stands for the bigger Fst value and the farther genetic relationship. On the contrary, the lighter color stands for the samller Fst value and the closer genetic relationship. It is obvious that close genetic relationships could be observed again between the studied Kazak group and Kazak1, Uigur groups, which would be displayed directly as labels of the lighter color for their pairwise Fst values.     components accounted for 88.60% of the total variance. Four distinct areas were observed, and the Kazak group similarly clustered to the Kazak1 and Uigur groups.
We then carried out the STRUCTURE analysis of the 24 populations with ADMIXTURE v1.23 software [24] which is a useful tool to infer individual genetic ancestry coefficients by conditioning the value of K (the number of hypothetical ancestral populations) and thus analyze population structure. Although the 30 InDel loci are not ideal ancestry informative markers and have limited differentiation power, they were still efficient to distinguish ancestries of the studied Kazak and other populations to some extent. As shown in Figure 4a, the ancestry components of Kazak group were similar to that of Central Asians (Uigur and Kazak1 groups) with different K value. Of course, it is still need more effective ancestry informative markers to identify and estimate ancestry components of admixtures better in the later studies. Moreover, we performed the population structure analysis again with the STRUCTURE program v2.2, which is given in Figure 4b. At K=2, the Asians and Europeans were almost entirely filled with yellow and blue component, respectively. Meanwhile, the Kazak, Kazak1 and Uigur groups represented a mixture of blue  and yellow components. Uigur and Kazak groups could be better separated from Europeans and East Asians at K=3, which was in accordance with the result of output posterior probabilities that K=3 was the most appropriate and suitable configuration (shown in Supplementary  Figure 3).

Genetic distances and phylogenetic analysis
Genetic distance (D A distance) reveals the genetic divergence between different populations. Populations with similar allelic distributions have small genetic distances. In order to estimate the genetic distances between the Kazak and 23 reference populations, we calculated D A values on the basis of 30 InDels. As presented in Table 1, the smallest distance was showed between the Kazak and Kazak1 group (D A =0.0007), followed by Uigur group with D A =0.0016. And the largest genetic distance was found with Mexican Amerindian group (D A =0.0526), which were consistent with the Fst and PCA results mentioned above.
We further conducted phylogenetic reconstruction on the basis of two different methods. An unrooted tree was constructed by the PHYLIP software (version3.6) based on the allele frequencies of all InDel loci, revealing the genetic relationships between studied kazak group and other compared populations. As shown in the Figure  5a, the branch on the top side contained ten East Asian groups, whereas the lower one consisted of Uruguayan group, four European populations and six Mexican populations. The Kazak, Kazak1 and Uigur groups were in the middle of the above two branches. Based on the D A distances, a phylogenetic tree reconstructed by MAGA software using neighbor joining (N-J) method was in the Figure 5b, and two main clusters could be seen from the dendrogram. The first cluster was composed of the East Asian groups,Central Asian groups(Kazak, Kazak1 and Uigur groups), Uruguayan group, as well as European groups, while the second one consisted of six Mexican groups. The Kazak group firstly tended to cluster together with Kazak1 and Uigur groups, and then with other groups, which indicated close relationships between the studied Kazak and Kazak1, Uigur groups. Yuan et al. represented a N-J tree with regard to 21 autosomal STR loci and also found Kazak was closely related to Uigur [10]. The dendrogram based on 17 Y-chromosomal STRs indicated the close relationship between Kazak and Uigur as well [25]. The same results were also obtained from the previous HLA and mtDNA studies [26,27]. The clustering results of phylogenetic trees were in good accordance with the above results of the inter-population differentiations, PCA plots, genetic distances and STRUCTURE analyses. In Chinese history, Uigurs belonged to a branch of Turkic people, while the Kazaks were formed as a result of the long-term development of the Turkic, Wusun, Khitan and Mongolian people. Furthermore, Kazaks and Uigurs were the main populations in the Silk Road of ancient China and they shared the common religious belief and culture, which indicated the two groups located in Central Asia have had a close geographic connection since ancient times. Therefore, the gene flow might exist and bring about the close affiliation between Kazak and Uigur groups.

Samples collection and DNA isolation
Bloodstain samples were collected from 513 unrelated healthy individuals (200 males and 313 females) from Kazak group residing in Xinjiang Uygur Autonomous Region, China. Written informed consents were acquired from all participants involved in this study. The research was carried out according to the human and ethical research principles of Xi'an Jiaotong University Health Science Center, China and approved by the ethics committee of Xi'an Jiaotong University Health Science Center. In the process of collecting samples, it must be ensured that any two individuals have no any blood relationship within at least three generations. Human genomic DNA was isolated from bloodstain samples utilizing the method of Chelex-100 [28].

PCR amplification and InDels genotyping
A multiple PCR amplification with fluorescent of autosomal 30 InDels was conducted with Investigator DIPplex reagent (Qiagen, Hilden, Germany) in a single multiplex reaction on GeneAmp PCR System 9700 thermal cycler (Applied Biosystems, Foster City, CA, USA) following manufacturer's instructions. InDel genotyping was performed with capillary electrophoresis on ABI 3500 Genetic Analyzer (Applied Biosystems, Foster City, CA, USA) and analyzed by GeneMapper v3.2 software (Applied Biosystems, Foster City, CA, USA).

Statistical analysis
Allele frequency distributions and forensic statistical parameters including HWE, MP, Ho, PE, DP, PIC and TPI of 30 InDel loci were computed with the modified Powerstat (version1.2) spreadsheet (Promega, Madison, WI, USA). And the He values were calculated as described previously [29]. The locus-by-locus p values were estimated by Arlequin software (version3.0) using the AMOVA method. The heat map of pairwise Fst was carried out in R statistical software v3.0.2. The SNPAnalyzer (version2.0 Istech, South Korea) was used to test LD for all pair-wise InDels. Two PCA plots were carried out in EIGENSOFT v6.0.1 software and MATLAB 2007a (MathWorks Inc., USA), respectively. Population genetic structure analyses were performed with the ADMIXTURE v1.23 and STRUCTURE v2.2 programs. Two phylogenic trees were conducted by PHYLIP (version3.6) software via allele frequencies of 30 InDels, as well as MAGA software (version5.0) based on D A values, respectively.

CONCLUSION
In this study, we obtained the allele frequencies and forensic parameters of the autosomal 30 InDels for the research of population genetics and forensic sciences. And we found the panel was highly efficient in forensic individual identification but could only be used as supplementary markers for STR loci in paternity cases. The results of the interpopulation differentiations, PCA plots, phylogenic trees and STRUCTURE analyses indicated a close genetic relationship between Kazak and Uigur groups. In order to better understand the origin and genetic background of Kazak group, further study should be conducted in later research.