Comprehensive mapping of the human papillomavirus (HPV) DNA integration sites in cervical carcinomas by HPV capture technology

Integration of human papillomavirus (HPV) DNA into the host genome can be a driver mutation in cervical carcinoma. Identification of HPV integration at base resolution has been a longstanding technical challenge, largely due to sensitivity masking by HPV in episomes or concatenated forms. The aim was to enhance the understanding of the precise localization of HPV integration sites using an innovative strategy. Using HPV capture technology combined with next generation sequencing, HPV prevalence and the exact integration sites of the HPV DNA in 47 primary cervical cancer samples and 2 cell lines were investigated. A total of 117 unique HPV integration sites were identified, including HPV16 (n = 101), HPV18 (n = 7), and HPV58 (n = 9). We observed that the HPV16 integration sites were broadly located across the whole viral genome. In addition, either single or multiple integration events could occur frequently for HPV16, ranging from 1 to 19 per sample. The viral integration sites were distributed across almost all the chromosomes, except chromosome 22. All the cervical cancer cases harboring more than four HPV16 integration sites showed clinical diagnosis of stage III carcinoma. A significant enrichment of overlapping nucleotides shared between the human genome and HPV genome at integration breakpoints was observed, indicating that it may play an important role in the HPV integration process. The results expand on knowledge from previous findings on HPV16 and HPV18 integration sites and allow a better understanding of the molecular basis of the pathogenesis of cervical carcinoma.


INTRODUCTION
Cervical carcinoma is one of the most commonly occurring cancers in women worldwide [1,2], while greater than 99% of the disease contains human papillomavirus (HPV) sequences [3,4]. Viral oncoproteins E6 and E7 encoded by HPV, which would respectively inactivate p53 and members of the pRb family, interfere with the cellular control mechanisms of the cell cycle. In addition, some studies demonstrated that these two oncoproteins also cooperatively disturb the mechanisms of chromosome duplication and segregation during mitosis and induce thereby severe chromosomal instability [5].
Studies have shown a significant correlation between integration and progression of cervical dysplastic lesions to invasive carcinomas. In pre-invasive lesions, HPV16 integration has already been found to occur in cervical intraepithelial neoplasia (CIN) [6], whereas in most cases of invasive carcinoma, HPV DNA sequences are frequently integrated into the host genome [7][8][9][10]. Although not much is known about the exact mechanisms of HPV genome integration, it is widely accepted that HPV integration can be a driver mutation in cervical carcinogenesis, with consequences influencing both the cellular and viral genomes. Recently, Akagi et al presented a model of "looping" by which HPV integrant-mediated www.impactjournals.com/oncotarget DNA replication and recombination may result in viral-host DNA concatemers, disrupting genes involved in oncogenesis and/or amplifying HPV oncogenes E6 and E7. Their results suggested that HPV directly promotes genomic instability [11]. Therefore, viral genome integration represents a crucial step in tumorigenesis, and elucidation of these particular integration events is pertinent for understanding HPV induced carcinogenesis.
Determining the DNA sequence of viral-cellular junctions would therefore indicate most directly HPV integration, as attempted in a number of previous studies where genomic HPV integration sites had been analyzed via many different assays. However, the precise identification of tiny stretches against a background of massive amount of episomal forms is an ongoing technical challenge. In addition, due to the nature of integration, which may be incorporated either as a single or multiple numbers of concatenated form, the latter can actually hinder integration detection [12]. Thus, development of more efficient methods would facilitate a comprehensive mapping of HPV integration sites to gain deeper insight into cervical carcinogenesis.
Here, in order to contribute to the understanding of the presence and the precise localization of HPV integration sites in cervical cancer, we studied a large series of cases using base-resolution genome-wide HPV capture technology with the next generation sequencing. More specifically, we simultaneously evaluate HPV prevalence and detailed characterization of the exact integration sites of the HPV DNA into the human genome in 47 primary cervical cancer samples and 2 cell lines.

RESULTS
In this study, 49 DNA samples including 47 from fresh-frozen cervical carcinomas and 2 cervical cancer cell lines (SiHa and HeLa) were detected using the method of HPV capture combined with next generation sequencing (Supplementary Table S1).

Determination of potential HPV integration sites
As described in the Bioinformatics Analysis method, if a specific position has one or more discordant read pairs mapped with one end to a human chromosome and the other to the HPV reference genome, it will be considered as a potential HPV integration site. A total of 319 potential HPV integration sites were discovered in 25 HPV DNA-positive cases including 19 cases for HPV16, 3 cases for HPV18, 1 cases for HPV58, and 2 cell lines (SiHa, HeLa), with frequencies ranging from 1 to 59 per sample (Supplementary Table S1). The prevalence of integrated HPV16 DNA was 63% (19/30), while the prevalence of integrated HPV18 DNA reached 100% (3/3). Additionally, HPV integration was detected in one HPV58 DNA-positive sample, but the HPV45 and HPV31positive cases showed no integration ( Figure 1 and Supplementary Table S1).

Validation and assay of HPV integration sites
In the attempt to confirm the newly uncovered HPV integration sites and further identify the exact integration site and the sequence between viral and cellular genome, all the 319 potential HPV integration sites were detected via targeted PCR amplification and Sanger sequencing. The Sanger sequencing validation rate based on one, two, three, or more than three discordant paired-end reads, was 3.7% (6/162), 47.8% (22/46), 44.4% (4/9), and 83.3% (85/102), respectively (Supplementary Table S1). The sequences of the fusion genes were first characterized by the NCBI human mega Blast database alignment tool and the UCSC Blat database to obtain the nucleotide resolution of the gene. A total of 117 unique HPV integration sites were identified, including HPV16 (n = 101), HPV18 (n = 7), and HPV58 (n = 9) (Table 1 and Supplementary  Tables S1 and S2); thus providing further basis upon which to perform additional research on these validated integration sites.
SiHa and HeLa are cell lines with defined integration sites, which provide an invaluable standard for validation of diagnostic assays (Supplementary Figure S1 and Supplementary Figure S2). Consistent with other studies, two integration sites were observed in SiHa cell line at loci 13q22.1 [11,13,14] and four integration sites were observed in concatenated form in HeLa cell line at loci 8q24.21 [14][15][16][17].
Additionally, the frequencies of HPV16 integration sites ranged from 1 to 19 per sample ( Figure 2B). That is, a single HPV integration site was found in 6 cases; 3 cases found two sites and 2 cases found three. Three additional cases found four sites each, and there was one case each with eight, twelve, fourteen, sixteen and nineteen sites (n = 99 sites) ( Figure 2B and Supplementary  Table S1). Altogether, 68% (13/19) of the HPV16 DNA-positive carcinomas were found to harbor more than one integration sites per sample.

Characterization of genomic DNA sequences adjacent to HPV integration breakpoints
Through PCR amplification and Sanger sequencing, we obtained the viral-cellular junction sequences.   Table S2). Overlapping, defined by nucleotides shared between viral and cellular sequence, was the most prevalent pattern accounting for 67% of the 117 validated viral-cellular sequences. Since some nucleotides between viral and cellular sequence were not aligned to a particular genomic sequence, we categorized them as the insertion of some unaligned nucleotides, which group amounts to approximately 27% of all 117 sequences. About 6% of the 117 viral-cellular sequences were present with direct ligation. Additionally, information of the cellular sequences was observed and collected with RepeatMasker. It was found that 51.3% of the 117 integration sites were distributed in repeating elements throughout the human genome (Supplementary Table S2), consistent with the actual proportion (50%) of repetitive sequence in the human genome [18]. All integration loci were checked for the presence of fragile sites. Of the 117 integration loci identified, 69 (59%) were located in or close to a fragile site (Supplementary  Table S2): 20 (17%) within common or rare fragile sites and 49 (42%) up to 5 Mb adjacent to fragile sites. 48 (41%) integration sites were not associated with fragile sites.
Meanwhile, the genomic regions within 50 kb of the viral-cellular junctions were investigated. Approximately 47% (55/117) of the 117 confirmed HPV integration sites were located within cellular genes, and 54 cellular DNA breakpoints were located in introns and one breakpoint was located in exon (Supplementary Table S2). Moreover, the following cellular genes occurred in at least two unique HPV integration sites, including CACNG7 (2), CAPG (2), HDAC4 (2), TP63 (2), CAGE1 (2), PVT1 (5), MACROD2 (6), and NFIB (6). The targeted genes are in the same orientation as the cellular sequence of viral-cellular junctions in 31 HPV integration sites, and in the opposite orientation in 24 HPV integration sites (Supplementary Table S2).

Correlation between HPV status and clinical information
Among the 47 patients with cervical cancer, 17 patients had received chemotherapy and radiotherapy before surgery, and the remaining 30 patients had not been treated before surgery (Supplementary Table S1). The prevalence of HPV DNA in 17 patients with treatment and in 30 patients without treatment before surgery was 76.5% (13/17), and 86.7% (26/30), respectively. However, there was no statistically significant difference (76.5% vs 86.7%, P = 0.435). In consideration of HPV integration status, we found that in 13 HPV DNA-positive patients treated with radiotherapy and chemotherapy before surgery, HPV integrations were not detected in 10 patients, including 8 patients positive for HPV16 and 1 patient positive for HPV45 and 1 patient positive for HPV58. Altogether, in 17 patients with cervical cancer who had received radiotherapy and chemotherapy before surgery, 82.4% (14/17) showed HPV negative or no HPV  Table 1 and Supplementary  Table S1. www.impactjournals.com/oncotarget integration sites (Supplementary Table S3). According to the frequency of HPV16 integration sites per sample, we found that all the 5 cervical cancer cases harboring more than four HPV16 integration sites had received diagnosis of clinical stage III carcinoma (Supplementary Table S1 and Figure 4). No significant difference was found between HPV integration and age at diagnosis (Supplementary Table S3).

DISCUSSION
Utilization of the PCR-based approach has prevailed in the existing attempts of detection of HPV integration sites in the human genome, such as Amplification of Papillomavirus Oncogene Transcripts (APOT) in discriminating HPV mRNAs derived from integrated and episomal viral genomes [19,20], Restriction Site PCR (RS-PCR) [21,22], Southern blot and Detection of Integrated Papillomavirus Sequences (DIPS) in detecting integrated HPV DNA regardless of its transcriptional status [15], and Real-time PCR, which can determine the physical status (integrated or episomal) of HPV estimated by calculating HPV E2:E6/E7 ratio by realtime PCR amplification of HPV E2 and E6/E7 [23]. However, all these assays usually are time-consuming and fall short of sensitivity, because of a battery of typespecific primers adopted and innate random distribution of HPV integration sites. Furthermore, sensitivity may be heavily influenced by a number of factors, such as the virus copy number, the physical status (episomal or integrated of HPV), and integrated form (in a single copy or in concatenated form) [12]. To overcome these and other potential limitations, HPV capture combined with next generation sequencing, which presents a more efficient method as described here, is not only able to detect the signals of multiple virus in a single run with high specificity and sensitivity, but has the ability to distinguish, from a diverse background of mainly episomal or concatenated forms, a discreet number of integrated forms, and further determinate the sequence between viral and cellular junctions.  6p24.3, 8q24.21, 9p22.3, 11p15.4, 13q22.1, 20p12.1, 20q11.22, and 20q13.2. www.impactjournals.com/oncotarget HPV capture in this study was designed based on probes from full-length HPV genome of 17 types. Thus, the HPV subtypes and HPV DNA integration sites were simultaneously detected in 49 cervical cancer samples (47 primary tumors and 2 cell lines). Out of which 82.9% of the 47 cervical cancer cases was HPV DNA positive, including HPV16(63.8%), HPV18(8.5%), HPV58(10.6%), HPV45 (2.1%), and HPV31 (2.1%). These percentages are relatively well aligned with existing findings [24][25][26], that is, HPV16 is by far the most prevalent type responsible for more than 50% of all the cervical cancer cases worldwide, followed by HPV18 and HPV58 and other high-risk HPV types that are less prevalent.
The sensitivity of the HPV capture combined with sequencing can be estimated from the results obtained from the well characterized cervical carcinoma cell lines SiHa and HeLa. SiHa, a HPV16-positive cervical cell line, harbors only one HPV16 DNA copy per cell [27,28]. Two validated HPV integration sites were found in SiHa cell line, as the same sites reported previously [10][11][12] (Supplementary Figure S1 and Supplementary  Table S2). Four HPV18 integration sites were detected on the chromosomal band 8q24.21 in HeLa cell line (Supplementary Figure S2 and Supplementary Table S2), which were consistent with previous studies [15,16]. Thus, our analysis had achieved 100% sensitivity and specificity compared with previous reported sites for SiHa and HeLa [11,15,16].
With this strategy, 117 unique and validated HPV integration sites and high-quality nucleotide sequences of the viral-cellular junctions were obtained. Of the 30 HPV16 DNA-positive cervical cancer samples, 63% (19/30) harbored HPV integration sites, similar to previously reported percentages [10,29], indicating the presence of a large number of integrated viral genomes in HPV16positive cervical carcinomas. Moreover, 68% (13/19) of the HPV16 DNA-positive carcinomas were found to harbor more than one DNA junctions per sample. This high percentage of multiple HPV16 integration sites exceeds 15~20% reported in previous studies using DIPS-PCR and RS-PCR methods [22,30,31], and also exceeds 49% reported in a recent study using TEN16 (Tagging, Enrichment, and Next-generation sequencing of HPV16) assay [10], indicating the more effective performance of HPV capture combined with sequencing adopted in this study for HPV DNA integration site analysis.
It is worthy to note that our data, obtained from 101 validated HPV16 integration sites including 2 integration sites from SiHa cell line, indicated that all the regions of HPV genome can be targeted for the linkage to cellular sequences, even E6 and E7. Doubt arises on the feasibility of the widely used E2:E6/E7 ratio by real-time PCR amplification to determine the physical status of HPV infection, which is based on the assumption that the viral E2 gene is frequently disrupted upon HPV integration, and the HPV E6 and E7 open reading frames are almost always retained, whereas other portions of the HPV genome may be deleted [32]. In contrast, the breakpoints of HeLa and 3 other HPV18positive cervical cancer cases observed to occur at viral E1, E2, L1 and LCR region have been reported [15,33].
Of the 4 HPV18 DNA-positive cervical carcinomas including HeLa, 100% (4/4) harbored HPV integration sites. This percentage conforms to the estimation of 100% of HPV18-positive cervical carcinomas harboring integrated viral DNA by Southern analysis [33]. The high integration frequency of HPV18 may be related to its greater transforming efficiency in vitro and its reported clinical association with more aggressive cervical cancers [34]. According to the frequency of HPV16 integration sites per sample, we divide them into two groups, one is more than four HPV16 integration sites, and the other is no more than four HPV16 integration sites (x-axis). The number of patients with clinical stages in two groups is shown (y-axis). PCR and Sanger sequencing were used to verify the HPV integration breakpoints based on one or more discordant paired-end reads. Although the low validation rate for one to three discordant paired-end reads may arise from technical error or integration occurred in a small subset of cancer cells, practically, the criterion using more than three discordant read pairs can cover most breakpoints identified (85/117). It should be noted that no explanation could sufficiently capture the failure of the remaining 16.7% of integration sites for validation based on more than three discordant reads, such as the number of discordant paired-end reads, and the proportion of repetitive sequence in the human genome. Additional research is warranted to reduce the false positive rate.
Viral integration within fragile sites has previously been reported in several studies [22,35]. Fragile sites are genomic regions prone to chromosome breaks that facilitate foreign DNA integration. Of the 117 integration loci identified, 69 were located in or close to a fragile site. It is note that the clustered integration sites on 6p24.3, 8q24.21, 13q22.1, and 20p12.1 were located within 1.8, 1.8, 0.8, and 3 Mb, respectively, of an adjacent fragile site, and that the clusters on 9p22.3, 11p15.4, 20q11.22, and 20q13.2 were located >5 Mb distant to any known fragile site. Thus, it is tempting to speculate that unknown fragile sites may be identified based on the HPV integration hotspots.
In total, 55 (47%) of the 117 confirmed HPV integration sites were located within cellular genes, most frequently in intronic regions. Moreover, several cellular genes existed in two or more unique HPV integration sites in the current study, including CACNG7, CAPG, HDAC4, TP63, CAGE1, PVT1, MACROD2, and NFIB. HPV integration into an intron will probably disturb expression of the targeted cellular gene. However, only few examples exist, because a direct link between HPV integration and gene alteration must be verified by functional data. TP63, a member of the p53 protein family, acts as a tumor suppressor protein and aberrant expression was noted for several cancer entities including cervical cancer [36,37], supporting the assumption that insertional mutagenesis of cellular genes by HPV integration contributes to cervical carcinogenesis.
In our data, overlapping was the most prevalent pattern accounting for 67% of the 117 validated viral-cellular sequences. A significant enrichment of overlapping nucleotides shared between the human genome and HPV genome at integration breakpoints was observed, indicating that it may play a vital role in the HPV integration process.
Additionally, we observed that the prevalence of HPV DNA in 17 patients who had been treated with radiotherapy and chemotherapy before surgery was lower than that in the group of 30 patients without the pre-operative procedures, but no statistically significant difference was found. While considering the status of HPV integration, we found that 82.4% showed negative for HPV or an absence of HPV integration sites in 17 patients with cervical cancer who had received pre-operative radiotherapy and chemotherapy. These results might indicate that radiotherapy and chemotherapy before surgery had some impact on HPV in cervical carcinoma, especially on HPV integration. A plausible explanation for the decrease of HPV integration is that these cells have been more effectively cleared by treatment. Additionally, we found that all the cervical cancer cases harboring more than four HPV16 integration sites showed clinical features and diagnosis of stage III carcinoma, indicating that the complexity of HPV integration might be associated with advanced cervical cancers.
Taken together, the method of HPV capture combined with sequencing allow a better analysis of multiple HPV subtypes at the same time and to identify the precise nucleotide sequences at the viral-cellular junctions as well as the affected cellular genes. The high number of 117 validated unique HPV integration sites deduced in the study demonstrates the effective performance of this strategy at single-nucleotide resolution. HPV integration is a special pattern of cancer mutation, which has the well-established potential to affect the cellular genes involved in cancer carcinogenesis. Therefore, identification of HPV integration in cervical cancers can provide new insight into the understanding of the cancer genome projects and its role in the pathogenesis of cervical carcinoma. Furthermore, because of the intrinsic random distribution of HPV integration sites, the efficient determination of HPV integration sites would allow them to be served as individualized markers in cervical cancer screening and in the follow-up of patients for the early detection of recurrent disease and metastasis in the future.

Sample collection and cell lines
A total of 47 fresh tissue specimens were collected from patients with cervical cancers who had undergone surgeries at Anyang Cancer Hospital, Henan province, China, between 2009 and 2010. Median age of these patients was 51 years with range 29 to 81 years. Among whom 42 had squamous cell carcinoma, 3 had adenocarcinoma and 2 adenosquamous carcinoma. The HPV16-and HPV18-positive human cervical carcinoma cell lines SiHa and HeLa with defined integration sites served as positive control. SiHa and HeLa cell lines had been obtained from AACT many years ago, and were often used for research purposes. Before undergoing experimentation in this study, these two cell lines had been identified using a PCR-based short tandem repeat (STR) genotyping method.
Individual informed consents had been collected from all study participants. This study received ethical approval from the Institutional Review Board of the hospital.

DNA preparation
DNA was extracted using DNeasy Blood & Tissue kit (Qiagen, Hilden, Germany) according to the manufacturer's protocol. The DNA concentration was determined via a Nano-Drop (NanoDrop Technologies, Wilmington, Delaware USA).

Overall process of HPV capture and sequencing
Prepare DNA whole-genomic library In order to establish the whole genomic library with the targeted gene specifically, a stretch of 3~5 µg genomic DNA was sheared to about 150 bp fragments by Covaris S2. These fragments were subsequently end blunted, "A" tailed, adaptor ligated and amplified 10 cycles by PCR. Out of which 1 µl of the prepared library samples was quantified using Nanodrop 2000, and 3 µl identified in 2% agarose gel and the final size of the electrophoresis fragment was around 300bp~500 bp.

Capture targeted gene regions and sequencing
HPV probes were designed according to full-length genome of 17 HPV types (6,11,16,18,31,33,35,39,45, 52, 56, 58, 59, 66, 68, 69, and 82) by MyGenostics (MyGenostics, Baltimore, MD, USA). The overall experiment was conducted according to the manufacturer's protocol. In brief, the whole-genomic libraries were hybridized with HPV probes (MyGenostics GenCap Technology), adsorbed onto the beads via biotin and streptavidin magnetic beads, and the uncaptured DNA fragments were removed by washing. Then the eluted fragments containing the targeted gene were enriched by 18 cycles of PCR to generate libraries for sequencing. Libraries were quantified and sequenced for paired-end 100bp using the Illumina HiSeq 2000 sequencer (Illumina Inc., San Diego, CA, USA).

Bioinformatics analysis method
The 100 bp paired-end reads were preceded into bioinformatics analysis. For quality control, we first filtered low quality reads using the Trim Galore program. Then, 3′/5′ adaptors were trimmed using the Cutadapt program implemented in Trim Galore, thereby rendering high quality clean reads, whose quality value exceeds 20 and read length greater than 80 bp. The high quality reads were obtained for subsequent analysis.
Illumina clean reads were mapped to human genome (GRCh37/hg19) and HPV genome of 17 types (6,11,16,18,31,33,35,39,45,52,56,58,59,66,68,69, and 82) using the BWA program. The quality scores were recalibrated and realigned to reference sequences using the GATK software package. For the potential HPV integration sites in the human genome, we aligned the Illumina clean reads to the reference human genome hg19 using the BWA program, and simultaneously performing recalibration and realignment with GATK. The paired-end read, uniquely mapped with one end to a human chromosome and the other to the HPV reference genome, is identified as a discordant read pair. If a specific position has one or more discordant read pairs, it would be considered as a potential HPV integration site and the breakpoints of which were then identified using the BreakDancer program. The integration sites were annotated according to human genome (GRCh37/hg19) from the UCSC database and HPV genome.
The sample was considered HPV DNA positive when the average sequencing depth was more than 10 and more than 50% of the virus genome's total length was covered by at least 4x.
The average sequencing depth was calculated based on sequencing tags containing HPV sequences and according to the following formula: Sequencing Depth = Total HPV bases/genome size. Genome size here means the size of the specific HPV genome.

PCR amplification and Sanger sequencing
To validate the potential HPV integration sites, we designed primers to amplify regions of interest by PCR. PCR products were sequenced directly from both directions using conventional Sanger sequencing. If unsuccessful, the integration sites were identified by TA cloning of PCR products and sequencing. All sequences of the fusion genes were characterized by the NCBI human mega Blast database alignment tool and the UCSC Blat database.

Statistical analysis
Fisher's exact test was used for statistical analysis in the present study. P-value of less than 0.05 was considered as statistically significant. All the P-values presented are two-sided.

GRANT SUPPORT
This work was supported by grants from Natural Science Foundation of China (81321003), and Beijing Municipal Science and Technology (Z151100001615022).

CONFLICTS OF INTEREST
No potential conflicts of interest were disclosed.