Single-strand DNA library preparation improves sequencing of formalin-fixed and paraffin-embedded (FFPE) cancer DNA

Mathias Stiller; Antje Sucker; Klaus Griewank; Daniela Aust; Gustavo Bruno Baretton; Dirk Schadendorf; Susanne Horn

doi:10.18632/oncotarget.10827

Research Papers:

Single-strand DNA library preparation improves sequencing of formalin-fixed and paraffin-embedded (FFPE) cancer DNA

PDF | Full Text | Supplementary Files | How to cite

Oncotarget. 2016; 7:59115-59128. https://doi.org/10.18632/oncotarget.10827

Metrics: PDF 3258 views | Full Text 5626 views

Mathias Stiller1,2,3, Antje Sucker2, Klaus Griewank2, Daniela Aust4, Gustavo Bruno Baretton4, Dirk Schadendorf2, Susanne Horn1,2

1Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, D-04103 Leipzig, Germany

2Department of Dermatology, University Hospital, West German Cancer Center, University Duisburg-Essen, and German Consortium for Translational Cancer Research (DKTK), D-45147 Essen, Germany

3Department for Translational Skin Cancer Research, University Duisburg-Essen, and German Consortium for Translational Cancer Research (DKTK), D-45141 Essen, Germany

4Departments of Surgery and Pathology, Technical University Dresden, D-01307 Dresden, Germany

Correspondence to:

Susanne Horn, email: [email protected]

Mathias Stiller, email: [email protected]

Keywords: high-throughput sequencing, library preparation, formalin-fixed paraffin embedded (FFPE) tissue, cancer, whole exome sequencing

Received: April 27, 2016 Accepted: June 30, 2016 Published: July 24, 2016

ABSTRACT

DNA derived from formalin-fixed and paraffin-embedded (FFPE) tissue has been a challenge to large-scale genomic sequencing, due to its low quality and quantities. Improved techniques enabling the genome-wide analysis of FFPE material would be of great value, both from a research and clinical perspective.

Comparing a single-strand DNA library preparation method originally developed for ancient DNA to conventional protocols using double-stranded DNA derived from FFPE material we obtain on average 900-fold more library molecules and improved sequence complexity from as little as 5 ng input DNA. FFPE DNA is highly fragmented, usually below 100bp, and up to 60% of reads start after or end prior to adenine residues, suggesting that crosslinks predominate at adenine residues. Similar to ancient DNA, C > T substitutions are slightly increased with maximum rates up to 3% at the ends of molecules. In whole exome sequencing of single-strand libraries from lung, breast, colorectal, prostate and skin cancers we identify known cancer mutations. In summary, we show that single-strand library preparation enables genomic sequencing, even from low amounts of degraded FFPE DNA. This method provides a clear advantage both in research and clinical settings, where FFPE material (e.g. from biopsies) often is the only source of DNA available. Improving the genetic characterization that can be performed on conventional archived FFPE tissue, the single-strand library preparation allows scarce samples to be used in personalized medicine and enables larger sample sizes in future sequencing studies.

INTRODUCTION

Technical advances in recent years have enabled comprehensive genetic analyses that have greatly increased our knowledge of the pathogenesis of human cancers. Various molecular targets have been identified by genetic studies and the subsequent development of compounds targeting altered molecules and pathways has revolutionized cancer therapy.

A caveat of sophisticated genetic assays is that they frequently require high-quality DNA. Most large-scale genetic studies performed to date have thus relied on fresh-frozen material. However, fresh-frozen material is rarely available, the majority of tumor tissue is formalin-fixed and paraffin-embedded (FFPE), which is used for histopathological analysis and allows tissue to be conveniently stored with minimal decay for many years. Vast collections of FFPE tissues have been compiled over the last decades, however, are generally not accessible to modern genetic studies. The low quality and quantity of DNA obtained from FFPE material has impeded large-scale analyses, such as genome sequencing and RNAseq approaches [1, 2]. Decay during fixation and long-term storage of FFPE tissues results in DNA molecules being fragmented and chemically modified, e.g. by hydrolytic damage and crosslinks interconnecting DNA strands as well as DNA and proteins [1]. Methods that would enable high-throughput sequencing analysis of partially degraded tissues with low amounts of input material (i.e. 5 ng or less DNA) are highly sought after [2–5]. Several studies have applied standard library preparation protocols, e.g. by Illumina and SOLiD, to high amounts (> 1 μg) of FFPE input DNA [6, 7] as well as lower quantities ranging down to 50 ng [8, 9]. Only very occasionally these standard protocols have been applied to even lower amounts of FFPE DNA down to 5 ng [5, 10].

The majority of library preparation methods utilize double-stranded DNA. To enable analysis of damaged and fragmented DNA from fossils, these methods have been improved to work with subnanogram amounts of extracted DNA, allowing highly multiplexed target capture and sequencing of ancient DNA. The most recent development is a library preparation specifically targeting single-stranded DNA [11]. Only using this newly developed protocol enabled the genome of an ancient hominin specimen to be sequenced to high coverage, which had not been possible before due to low complexity, i.e. insufficient numbers of unique DNA molecules, in the sequencing libraries [11, 12].

Hence, this single-strand DNA library preparation method may also be beneficial for working with FFPE material as the protocol is optimized for highly fragmented DNA. Furthermore, FFPE-isolated DNA may be expected to be at least partially single-stranded, as isolation protocols generally apply heat to reverse the crosslinking of DNA and proteins.

In this study we test and compare two double-strand library preparation methods, one specifically designed and developed for ancient DNA [13] (hereafter called “MPI”) as well as a commercially available kit from New England Biolabs (hereafter called “NEB”) advertized to require only 5 ng of input DNA, to the single-strand preparation protocol [11]. All three methods were applied to DNA extracts derived from four to 19-year old melanoma FFPE tissue sections. Yield is first quantified in terms of total molecule numbers in the sequencing libraries. A subset of the libraries was then subjected to whole genome shotgun sequencing on an Illumina HiSeq or MiSeq instrument to further characterize the library molecules with respect to sequence complexity, molecule length distribution, base composition and potential damage and fragmentation patterns. Lastly, whole exome enrichment and sequencing was performed with FFPE DNA extracted from an array of different cancer types. Collectively these data will allow a comprehensive evaluation of the methods with respect to their suitability for routine processing of cancer FFPE DNA in large-scale studies, e.g. aiming at genome wide mutational profiling of various cancers from scarce input material like biopsies and histological sections.

RESULTS

Yield of library preparations

Three library preparation methods were tested on a set of 21 melanoma FFPE DNA samples and yielded a total of 2.04 × 10⁹–1.16 × 10¹⁰ molecules for the single-strand method, 3.94 × 10⁶–5.22 × 10⁷ molecules for the MPI double-strand method and 1.26 × 10⁶–1.44 × 10⁷ molecules for the NEB double-strand method (Table 1, Figure 1A). Library preparations with the single-strand method thus yielded between 68 and 5,018 times more library molecules than either of the double-strand methods. These differences were highly significant for both comparisons (two-sided Wilcoxon rank sum tests, P = 3.716 × 10⁻¹², n = 21, Figure 1B). The difference between both double-strand methods was less pronounced, however, the MPI method performed significantly better than the NEB kit (P = 7.995 × 10⁻⁰⁶). When comparing both library types it was apparent that single-strand libraries consist of shorter molecules than double-strand libraries (Figure 1C). Negative controls carried through library preparation and sequencing showed orders of magnitude lower molecule numbers, approximately ten times less reads after sequencing and a lower percentage of mapped unique reads than authentic samples (10% vs. on average 61% in genomic libraries and 1% vs. on average 17% in exome libraries, Tables 2 and 3).

Figure 1: Comparison of double-strand and single-strand libraries. (A) Overall copy number yield of library preparation measured in digital droplet PCR. (B) Comparisons of copy number yield. Single-strand (ss) library preparation outperforms both double-strand (ds) methods (P = 3.716 × 10⁻¹² for both, ss vs. dsNEB and ss vs. dsMPI comparisons). Custom ds library preparation (dsMPI) has higher yield than the commercial method (dsNEB, P = 7.995 × 10⁻⁰⁶ for dsNEB vs. dsMPI). Raw p-values from two-sided Wilcoxon rank sum test (n = 21 samples per library preparation). (C) Molecule length of 100,000 unique, mapped reads in single-strand and double-strand libraries. As merged reads were analyzed, the plot displays reads with a maximum length of 141 bp (2 × 76 bp reads before overlap merging) for ss libraries. (D) Estimated complexity of single-strand libraries (ss) is higher than of double-strand (ds) libraries.

Table 1: Samples and library preparation yield

Five ng of FFPE DNA yield higher numbers of molecules in single-strand (ss) than in double-strand (ds) DNA library preparation. Ds MPI: a custom method developed for ancient DNA at the Max Planck Institute EVA Leipzig [13]. Ds NEB: a commercial method from New England Biolabs. N.a. not available. N.d. not determined. NTC. Negative control (also see methods section). For a more detailed overview of the order of sequencing experiments see Supplementary Table S1.

Sequence complexity

Initial shotgun sequencing of the first four single-strand libraries showed high sequence complexity (100% unique reads for all samples), thus motivating the subsequent comparison to the double-strand library preparation methods (Table 2, Supplementary Table S1, Supplementary Figure S1A). Libraries were then prepared for a second batch of five samples using all three methods. All of those 15 libraries were then shotgun sequenced for comparison. The fraction of unique molecules and thus the complexity estimates of single-strand libraries were higher than those of double-strand libraries (Figure 1D, Table 2). Had the libraries been sequenced further, we estimate coverage would have reached 33–117× for whole genomes and between 17× and > 1,000× for exomes (Tables 2 and 3). The fraction of unique molecules remained stable throughout the sequencing run, allowing an extrapolation beyond the performed sequencing in order to identify rich libraries (Supplementary Figure S1B) [14].

Table 2: Sequencing characteristics of genomic single-and double-strand libraries

The percentages of mapped reads are reported without quality filters and before duplicate read removal as well as thereafter for the remaining unique reads. Ss. Single-strand. Ds. Double-strand. NTC. Negative control.

Patterns of damage in FFPE DNA

The fixation of tissue with formalin is known to alter the structure of DNA, causing fragmentation and alterations of the organic bases [15–17]. In our sequencing data, reads from FFPE DNA were observed to be short, i.e. mostly below 100 bp (Figure 1C, Supplementary Table S2). When analyzing the mapping positions of our FFPE DNA reads, we observed the first base of the reference genome outside the sequenced molecule to be an adenine in up to 66% of reads with a mean of 53% (Figure 2A, Supplementary Table S2). This pattern reflects a preferential starting point of the sequenced DNA after adenine bases and is hereafter called ‘A-fragmentation’. For double-strand libraries, A-fragmentation is detected as adenines outside the 5′-end and as thymines outside the 3′-end (Supplementary Figure S1C) when analyzing reverse strand molecules.

Figure 2: DNA damage in FFPE DNA. (A) A-fragmentation pattern of FFPE DNA molecules of single-strand library MS1. Frequencies of adenine as the first base of the reference sequence adjacent to the sequenced molecule are observed to be as high as 60%. (B) Substitution frequencies throughout the sequenced FFPE DNA molecules. (C) C > T and G > A substitution frequencies among called variants. (D) Alignment of reads to the human reference genome. C > T damage substitutions in sequencing reads are distinguished from C > T variants (arrow), C > T damage substitutions are dispersed throughout the sequencing reads. C > T variants occur at the same position in a fraction of the sequencing reads. Forward read bases are depicted as points, reverse read bases as commas.

FFPE DNA showed an elevated frequency of C > T throughout the sequencing reads and most pronounced at the ends of molecules with frequencies up to 3% (Figure 2D, Supplementary Table S2). This pattern was also independent from the tested methods of library preparation and present in single- and double-strand libraries (Supplementary Figure S1D).

Comparison of damage patterns in single-strand and double-strand libraries

We compared single-strand and double-strand libraries for five FFPE DNA extracts and identified differences in their fragmentation and substitution patterns. The amount of A-fragmentation in single-strand libraries was lower at the 5′-end when compared to corresponding double-strand libraries (P = 1.27 × 10⁻⁰⁵, Supplementary Figure S2). C > T substitutions at both molecule ends reached higher frequencies in single-strand libraries (P = 0.02) and the single-strand method utilized shorter molecules (P = 2 × 10⁻⁰⁵).

To test if the FFPE fixation procedure introduces these types of DNA damage, we performed a comparison of FFPE and snap-frozen samples obtained from the same tissues using published raw data [2]. In this study, three prostate tissue samples were taken during surgery and either snap-frozen or formalin-fixed and paraffin-embedded for 3–9 years. All three FFPE samples showed A-fragmentation, unlike their respective snap-frozen counterparts. The latter in fact showed higher frequencies of cytosine residues outside the 5′-ends of molecules and thus, a possible C-fragmentation (Supplementary Figure S3, Supplementary Table S2). Frequencies of C > T substitutions at 5′-ends were slightly increased in FFPE samples in two of three patients, however, not as high as in our long-term stored samples.

It is assumed that longer contact with formalin should increase the amount of DNA damage. Thus we tested whether the incubation time of those tissues in formalin was correlated with the amount of DNA damage observed. For single-strand libraries, we observe increased frequencies of C > T substitutions and shorter molecule length with increasing FFPE storage time (linear models with adjusted R² of 0.3 (5′-end), 0.23 (3′-end), 0.34 (molecule length) and p-values of: 0.0035, 0.0112 and 0.0029, respectively, of which the 3′-end loses significance after correction for multiple testing, Supplementary Figure S2).

Exome sequencing

Five ng of input DNA were retrieved from two 10 μm-sections per FFPE tissue and used to generate single-strand libraries. For exome sequencing on average 34% of reads mapped on target (Table 3). This may seem low for cancer samples in general, but FFPE DNA is very short and will not be captured as efficiently. Fold-enrichment was observed to be 41 to 55-fold and should be comparable across experiments, even when targeted genomic regions differ in size (Supplementary Table S3). Deeper sequencing could then be used to further increase coverage. As we refrained from deeper sequencing, single nucleotide variants (SNV) were called at low stringency (see methods section) for all exomes that had reached 1–5x average unique read coverage. We detected between 30 and 1,300 coding missense SNV per sample and a number of these variants have previously been reported in various types of cancer [18] (Table 4, Supplementary Table S6). Two types of substitutions, C > T and G > A, have been studied most frequently and were previously supposed to include artifacts caused by DNA damage. For all variants called in the exome data, we ascertained the substitution types (Supplementary Table S4), and found C > T and G > A to be most frequent (Figure 2C). With the two types of substitutions representing around 40% of all variant calls, they resemble the upper range of previously sequenced FFPE and fresh-frozen cancer DNA ranging around 20–40% [8, 19].

Table 3: Sequencing characteristics of single-strand exome libraries

N.d. not determined. NTC. Negative control.

Table 4: Coding missense SNV called in exome sequencing data

Subset of variants detected in this study and also detected previously in cencer sequencing [18]. Full lists of called variants are listed in Supplementary Table S6. Freq: frequency

The GC content of our genomic single-strand libraries ranged from 41–47%, which was slightly lower than that of double-strand libraries ranging from 46–50% and thus closer to the known GC content of the human genome around 42% (Supplementary Table S5). Single-strand exome libraries had a higher GC content of 45–51%, likely reflecting a preference of higher GC molecules in hybridization capture.

DISCUSSION

In this study, we successfully tested a single-strand DNA library preparation method, originally established for ancient DNA, to generate genome wide sequencing data of FFPE derived cancer DNA. A comparison with standard double-strand methods showed that single-strand library preparation is far superior to previously reported methods of analyzing FFPE-isolated DNA. Complexity of sequencing libraries is a critical issue when applying high-throughput sequencing to scarce FFPE samples [5]. Genome sequencing requires high complexity, meaning sufficient numbers of distinct, or unique, molecules, especially when increasing sequencing depth is desired. With 98–100% unique molecules in our genomic libraries the single-strand method outperforms two double-strand methods (58–93%), one developed for ancient DNA and the other specifically advertised to cope with low-quantity samples. Complexity of single-strand libraries was also considerably higher than that reported for alternative double-strand protocols by Illumina and SOLiD with 35–75% [3] and 60–63% [19] unique molecules, respectively. Especially after hybridization-based whole exome capture, the sequence complexity further decreases owing to the additional handling steps. Therefore, the complexity achieved in the initial library preparation is of critical importance. Consequently, our comparisons show that the single-strand libraries contain higher complexity after the exome enrichment, namely 55–87% unique molecules, compared to previously reported double-strand library preparations of Illumina protocols with 15–62% [20], 40–62% [2], 7–88% [5] and SOLiD protocols with ~50% [6] of unique molecules. Notably, other studies reportedly required hundreds of nano- or even micrograms of input DNA [2, 3, 6, 20]. Some previous studies using lower input amounts of FFPE DNA do not report sequence complexity. Between five and 50ng of input DNA were used to e.g. generate copy number karyograms with ~60% of reads mapped, however, the percentage of unique reads was not reported [4]. While input amounts of ≤ 250 ng FFPE DNA were previously reported as insufficient for adequate exome coverage [10], recently, ‘successfully’ sequenced exomes have been reported from as little as 16 ng input FFPE DNA in a large study of 99 FFPE samples [5]. In the same study, duplicate molecules in double-strand libraries were found to be problematic, requiring high amounts of additional sequencing to reach a sufficient unique sequencing coverage [5]. Eventually, such libraries had been discarded. Our study demonstrates that those hurdles can be overcome by employing the single-strand library preparation protocol to retrieve molecule counts sufficient for whole genome and exome sequencing at adequate unique molecule depth. We measure factors that allow us to estimate the anticipated coverage in order to conclude which samples are worthwhile sequencing. The pronounced gain of molecules in single-strand libraries appears to be at least partially the result of utilizing shorter molecules [11]. We observed a difference of over 20 bp in molecule length between single-strand (61.5 bp) and double-strand (86.5 bp) libraries (Supplementary Figure S2E). Hence, some of the gain of molecules is lost after sequencing, as very short molecules cannot be mapped accurately to the human reference genome and are thus excluded from the subsequent analysis. However, the percentage of mapped reads in genomic single strand libraries was on average 87% and with high complexity and molecule numbers these libraries are estimated to yield between 33x and 117x unique read whole genome coverage (Table 2). The targeted exome single-strand libraries had on average 71% mapped reads and are estimated to yield 17x–3,668x fold exome coverage. Thus, in single-strand libraries a loss of molecule length is well compensated by molecule numbers and complexity. On the contrary for double-strand libraries, which had lower molecule numbers and complexity, we also observed lower percentages of mapped reads, on average 48%. This factor may be an underestimate, as it should usually range higher. However, even with all molecules mapped, the double strand libraries would reach only 0.3x–0.89x genome coverage, which is insufficient for most applications.

The proportion of single-stranded DNA in DNA extracts likely benefits from heating steps during DNA extraction. The DNeasy Blood&Tissue kit (Qiagen) is a widely used standard with moderate heating at 56°C. Other extraction kits vary in heating temperature and duration. Most kits use a proteinase K digestion at ≥ 37°C, others are using higher temperatures up to 90°C combined with the addition of heated elution buffer to the spin column (Qiagen AllPrep Micro and DNA/RNA FFPE kit). In our case, the DNA Mini kit (Qiagen) was used and contains a 56°C and a 70°C incubation. The benefit shown for the single-strand over the double-strand library preparation may be even further increased when incubations are carried out at higher temperatures. However, DNA extraction procedures should be further optimized for the retrieval of single-strand DNA.

Fragmentation of FFPE DNA has previously been analyzed with capillary gel electrophoresis and average molecule length was found to range between 20 and 100 bp in DNA extracts isolated from FFPE tissues stored between 8 and 48 years [15]. Determining the length of molecules via sequencing we observe similar molecule lengths in our samples that have been stored 4–19 years (Figure 1, Supplementary Figure S4A and 4B). We analyzed fragmentation in the sequence context and found that most often the sequenced molecules start and end next to an adenine residue in the reference genome. As this A-fragmentation pattern was observed in our data obtained from single-strand and double-strand libraries, as well as in published data from double-strand libraries, we are confident it represents a ubiquitous pattern in FFPE DNA. A-fragmentation probably results from the process of interstrand crosslinking during formalin fixation. Formalin has been reported to introduce interstrand crosslinks preferentially at adenine residues [17, 21]. During DNA extraction, DNA strands presumably break adjacent to these crosslinks and thus preferentially release DNA molecules between adenines.

Relatively frequent events in formalin treated DNA are intrastrand crosslinks at GG-CC and GA-TC sites which, if repaired lead to tandem base substitutions [22]. These types of mutation supposedly make up about half of the formalin-induced mutations. However, DNA strands with intrastrand crosslinks presumably will not be available for library preparation, and the absence of DNA repair in fixed tissues may explain why these patterns are not obvious in our data. The same applies to G > T transversions, previously reported to be one of the most frequent single base substitutions [22]. Only DNA repair will change the N-2-hydroxymethyl adduct to guanine into a G > T transversion, which we consequently do not see elevated in formalin-fixed tissues. It is in general not clear if some of the true damage patterns of the molecule ends may be altered during sequence library preparation. Furthermore, unknown biases may exist in ligation efficiency at the molecule ends depending on the particular base composition.

The diversity of DNA alterations introduced by FFPE storage is not yet fully understood. We observed a potential C-fragmentation in the snap-frozen sample dataset (Supplementary Figure S3). Some FFPE libraries, A8239 and A8240, a lung and prostate cancer, further showed high guanine frequencies at the first base outside the sequenced molecule (Supplementary Figures S4C and S4D) and thus a potential G-fragmentation or depurination as described for ancient DNA [23, 24]. These results indicate a number of yet unknown patterns of DNA damage introduced during freezing, fixation, extraction and library preparation.

The deamination of C > T is a well-studied process that is known to increase the substitution frequencies of C > T in DNA sequences [23, 25–28]. With a frequency of around 1% within sequenced molecules and up to 3% at their ends, FFPE DNA contains lower frequencies of C > T damage than generally observed in ancient DNA, where frequencies often exceed 20–40% at molecule ends, e.g. in the genomes of Neanderthals and the Tyrolean Iceman [23, 24]. Our data reveal elevated frequencies of C > T, however, not G > A substitutions in FFPE DNA, which have previously been reported in sequences of PCR products and double-strand libraries [8, 9, 19, 20, 29]. PCR and double-strand library preparation often do not represent the original 3′-end of forward strand molecules due to a blunt ending reaction or by copying the opposite 5′-strand during the process of library preparation. G > A damage substitutions in sequencing reads are thus thought to represent reverse complements of the underlying cytosine deamination in the 5′-end of reverse strands of the original molecule [20, 23].

Comparing published DNA sequencing data from FFPE and frozen samples indicated that A-fragmentation and C > T substitutions are introduced by the FFPE fixation procedure. An analysis of samples with increasing storage time indicates that damage accumulates with FFPE storage time, as suggested previously [20, 30]. Possible batch effects of the fixation procedures performed many years ago cannot be excluded. To understand the relationship of FFPE fixation and DNA degradation in more detail, experiments with a prospective design including tests with varying formalin concentrations and fixation times, would be necessary.

Our data revealed diverse DNA alterations introduced by FFPE storage conditions of patient material. Presumably, previous studies describing mutational signatures across human cancer types have also contained a fraction of samples derived from FFPE material [31, 32], summarized at http://cancer.sanger.ac.uk/cosmic/signatures. Thus, understanding damage patterns in FFPE-derived DNA is of critical importance to distinguish and decipher the mutational signatures described for any given type of cancer.

In the past, elevated frequencies of non-reproducible substitutions have been reported for FFPE DNA and were long argued to be a major drawback of this material. However, frequent C > T and G > A transitions were originally detected in experiments based on PCR and Sanger sequencing, where single or few damaged molecules have been amplified [29]. Using current high-throughput sequencing techniques, the impact of single damaged molecules for variant calling and hence for the detection of mutations is much less problematic. Moreover, recent studies repeatedly showed a high concordance of variants called and similar mutational profiles both in frozen and FFPE tissues [5, 8, 9, 20], which can be explained by the low frequency of C > T damage in sequenced molecules. We estimate that as a result of DNA damage, at any sequenced position about 1% of sequencing reads show a C > T substitution (Figure 2B), which is comparable to previously reported data [19]. In standard genome or exome sequencing with e.g. 30–80x coverage, 1% of sequencing reads would correspond to single reads carrying C > T damage substitutions. However, reads of both sequencing directions should be required routinely to support variant calls as exemplified in Figure 2D. Moreover, variants are routinely reported to clinicians only when observed in a pre-determined minimal fraction, e.g. 5% of sequencing reads. The observed frequencies of FFPE damage C > T substitutions will therefore remain below this threshold in the majority of cases and would thus only rarely cause false positive variant calls. Assuming an underlying Poisson distribution, we estimate up to around 2% of variant calls may represent false positive findings (Supplementary Figure S5), which is concordant with previous results on FFPE exome sequencing of paired fresh-frozen and FFPE samples. False positive mutation calls were identified from FFPE GIST tumor samples using frozen and peripheral blood samples as controls [9]. False positives were observed in the range of 0–1% of all mutation calls for three high-quality FFPE samples and at 2% for one low-quality FFPE sample. Taking the step from sequencing reads to variant calls, previous studies notably showed similar substitution frequencies within variants called from FFPE and fresh-frozen cancer DNA. For the more frequent substitution types C > T and G > A previous estimates ranged around 20–40% [8, 19]. At low coverage we observe similar frequencies of C > T and G > A substitutions, around 40% and higher, in the variants called from single-stranded FFPE DNA of lung, colon, prostate and skin cancers (Figure 2C). As there is no obvious GC bias in single-strand libraries (Supplementary Table S5 and Supplementary Figure S6), we attribute these higher frequencies of C > T and G > A to the relatively shallow sequencing performed here. In future, detailed studies comparing the single-strand method on paired samples of frozen and FFPE material are needed. However, the identified variants in our exome sequencing of single-strand libraries revealed characteristic mutations that have previously been described for cancer DNA (Table 4, Supplementary Table S6). In summary, these data indicate that the slightly elevated frequency of C > T in FFPE DNA will not be detrimental to routine high-throughput sequencing in the clinical setting.

Most large-scale genetic research efforts in oncology to date have been required to rely on fresh-frozen material, as results obtained from FFPE tissue were poor. Many genetic studies have been impeded owing to insufficient or inadequate tumor tissue availability. As FFPE material has been the standard for archiving tumor tissues for decades, access to fresh-frozen material remains an exception in a routine clinical setting. Allowing large-scale genetic analyses based on the vast collections of FFPE material, we expect the single-strand library preparation method to greatly facilitate future research, especially in the field of molecular oncology.

In the era of personalized medicine, analyzing a patient’s individual tumor can be critical to determine genetic alterations relevant in terms of prognosis and therapy decisions. Especially in a clinical oncology setting, the effective use of DNA retrieved from FFPE or biopsy material is crucial, in particular for patients where additional tumor tissue is generally not available or obtaining it would require additional surgery.

In summary, our analyses indicate that the number of unique molecules obtained in double-strand sequencing libraries has hampered large-scale sequencing studies of FFPE DNA. Higher yield and sequence complexity can be obtained from the same source material using a single-strand library preparation protocol, which makes whole exome and genome sequencing feasible from only 5 ng of FFPE-isolated DNA, generally corresponding to one or two 10 μm-sections of FFPE tissue. Although the generated sequence data contain C > T DNA damage as well as strand breaks adjacent to adenines, we do not expect variant calling to be compromised if quality criteria, such as minimum bidirectional read coverage, are applied. Given the significant improvements over conventional double-strand protocols for high-throughput sequencing of FFPE DNA, we believe the single-strand library preparation protocol will be of great value to clinicians and researchers alike.

MATERIALS AND METHODS

Samples

Melanoma samples were obtained from the Skin Cancer Biobank (SCABIO), Department of Dermatology, University Hospital Essen. Samples of patients with lung, breast, colorectal and prostate cancer were obtained and anonymized at the Department of Pathology, University Hospital Carl Gustav Carus, Technical University Dresden. Tumor samples were collected during the surgery, subsequently formalin-fixed and paraffin-embedded and stored as FFPE blocks at room temperature. To study the effect of sample storage time we obtained data on the year of surgery and fixation of the sample as well as date of DNA extraction from SCABIO, University Hospital Essen and the regional clinical cancer registry Dresden.

Ethics statement

Participants consented to their biological material being used for diagnostic purposes, and remainders of the material were used in this study. The study was approved by the respective ethics committees of the faculties for medicine of the University Duisburg-Essen (study ID 11-4715) and the Technical University Dresden (study ID EK59032007), Germany. The dataset presented here is anonymized in a manner that beyond the year of surgery and the type of cancer no personal information is given for study participants.

DNA extraction

DNA extraction from tumor samples and corresponding blood was performed with the QIAamp DNA Mini kit (Qiagen, Hilden, Germany). This standard procedure includes incubation in xylene for deparaffinization and incubation with proteinase K at 56°C as well as an incubation step at 70°C for ten minutes.

Library preparation, quantification, amplification and enrichment

DNA libraries were prepared from a total of 5 ng of extracted genomic DNA. Prior to the library preparation a stock of 5 ng/ul was prepared and aliquoted to ensure equal input quantities in all preparations. Libraries were prepared using two previously published custom protocols especially optimized for ancient and degraded DNA, i.e. one for double-stranded DNA [13] and one for single-stranded DNA [11], as well as a commercially available kit - NEBNext^® Ultra™ DNA Library Prep Kit (New England BioLabs, Ipswich, MA) advertised to be capable of using DNA input amounts as low as 5 ng total. The double-strand DNA method was implemented as described in the original publication. The NEB kit was used according to the manufacturers recommendations, but with a 1:30 dilution of the adapters in the ligation step to reduce the otherwise extensive formation of adapter dimers. For the single-strand method we followed the protocol as published [11] including the preparation and aliquoting of all chemicals in a dedicated pre-PCR facility, and with a modification reported recently by Korlevic et al. [33]: the single-stranded adaptor oligonucleotide (CL78) was cleaned from human contamination and synthesis artifacts by E. coli exonuclease I treatment. For this purpose, 7.5 μM of oligonucleotide was incubated for 20 min at 37°C in 1×CircLigase II buffer with 46 U exonuclease I (New England BioLabs, Ipswich, MA) in a reaction volume of 23 μl. The exonuclease was then heat-inactivated for 1 min at 95°C. One microliter of each library was used to measure the number of library molecules by digital droplet (dd)PCR (QX200 system; Bio-Rad) using primers IS7 and IS8 [13] and EvaGreen chemistry following the manufacturer’s recommendations. When necessary, dilutions were measured. The total number of library molecules was then calculated for the full volume of each library (20 ul for the two MPI methods and 50 ul for the NEBNext kit). The remaining libraries were amplified and tagged with pairs of sample-specific barcodes [34] using AccuPrime Pfx DNA polymerase (Life Technologies, Carlsbad, CA) as described elsewhere [35], but using higher primer concentrations (1 μM).

Negative controls

Water was used in negative controls and carried through all steps of the library preparation until the ddPCR and partly also to sequencing. Molecule numbers in negative controls were orders of magnitude lower than samples, with the exception of one NEB double-strand preparation (NTC1 in Table 1). It has been reported that standard library preparation methods (e.g., Illumina’s TruSeq DNA sample preparation kit, no. 15026486 Rev. A, or NEB’s NEBNext Ultra DNA library preparation kit, v1.1) are less suitable for the generation of libraries from highly degraded and low-quantity DNA because a background of around 1 × 10E⁸ molecules may be created mostly derived from adapter dimers and synthesis artifacts present in the adapter oligonucleotide [11]. The number of sequencing reads and mapped reads was assessed for three negative controls of single-strand library preparation.

Exome capture

Two 10 μm-sections of ~25 square mm per FFPE tissue were used for DNA extraction and 5 ng of the extracted DNA was used to prepare single-strand DNA libraries. For exome capture, 4 ul of libraries prepared from single-stranded DNA were used for hybridization capture as described previously [36]. To more efficiently enrich for the coding fraction of the genome, two consecutive rounds of hybridization captures were carried out.

Sequencing and sequence analyses

Sequencing was performed on Illumina MiSeq and HiSeq instruments with parameters outlined in detail in Supplementary Table S1. Base calling was performed by Bustard (Illumina Corp.) for MiSeq data and by freeIbis [37] for HiSeq data. Illumina adapters were removed, putative chimeric sequences were flagged as failing quality control and overlapping paired reads were merged using leeHom with the “—ancientdna” option [38]. Reads were assigned to their sample of origin using deML [39] and jivebunny with default quality thresholds (https://github.com/udo-stenzel/biohazard). Reads were then mapped to the human genome reference hg19 using bwa-0.4.9. Library complexity was estimated from the sequencing reads using preseq-0.1.0 [14] with steps of 100,000 reads. Duplicate reads were marked using Picardtools (http://picard.sourceforge.net/) and the mapped fraction of reads was calculated without read mapping quality filters requiring at least one of the read segments mapped. The GC content was determined for unique mapped reads. Distributions of molecule length in sequencing libraries of ≥ 10 bp were determined using the “–m” option implemented in samtools (https://github.com/mpieva/samtools-patched) from 100,000 merged, unique and mapped reads and the median was calculated.

Estimates for genome and exome coverage were calculated as follows. Mappable bp were calculated by the fraction of unique reads multiplied with the fraction of mapped (unique) reads and multiplied with the total molecule count and with the median molecule length. We then extrapolated to mappable genomes and exomes. Substitution frequencies and the reference base composition were analyzed from 100,000 mapped reads of a minimum length of 30 bp using damage patterns (https://github.com/udo-stenzel/damage-patterns). We additionally analyzed publicly available sequencing data for FFPE and corresponding snap-frozen tissues obtained via ICGC (accession EGAD00001000033, DACO-1002654). For these single end reads, information was available only for the 5′-end.

For exome sequencing data the average coverage over the entire bed file of target regions was determined using bedtools [40]. The numbers of unique reads mapping to the entire genome and to the target region were determined using samtools (https://github.com/mpieva/samtools-patched) and requiring a mapping quality of 20.

SNV calling was performed using samtools-1.0 [41]. Variant calls required one unique read for each variant allele in forward and reverse direction, respectively, a coverage of at least 8 unique reads and a variant frequency of at least 10%. Annovar [42] was applied to identify nonsynonymous coding SNV not known from dbSNP138 or the 1000 genomes project (MAF > 0.01). SNV located in repetitive regions were excluded using the UCSC repeatmasker track and intersectBed [40]. From all called SNV we identified variants previously detected in cancer sequencing projects by cross referencing the COSMIC database [18]. As we used archived FFPE samples, it was not possible to obtain consent for the publication of the sequencing raw data retrospectively from some study participants as they are deceased. Thus, the sequencing raw data are not made publicly available.

Statistical analyses

To test whether library preparation yields and amount of DNA damage in different groups of samples varied significantly, we applied two-sided Wilcoxon rank sum tests in R v. 3.0. [43]. Correlations of C > T substitution and A-fragmentation frequencies, as well as molecule length with sample storage time were performed using linear models in R. The probability to observe > 5% of reads with C > T damage was modeled using the Poisson distribution for read coverage up to 100x with an expected frequency of C > T in 1% of reads at any sequenced nucleotide position. The model also required to observe at least two reads with C > T as routinely a minimum of two reads is required to support variant calls.

ACKNOWLEDGMENTS

We thank the patients who donated samples for this study. We also thank Matthias Meyer, Janet Kelso and Udo Stenzel for helpful discussion.

CONFLICTS OF INTEREST

The authors disclose no potential conflicts of interest.

FUNDING

This study was supported by the Max Planck Society and the German Cancer Aid (Deutsche Krebshilfe, DKH) grant no. 11402.

REFERENCES

1. van Beers EH, Joosse SA, Ligtenberg MJ, Fles R, Hogervorst FBL, Verhoef S, Nederlof PM. A multiplex PCR predictor for aCGH success of FFPE samples. Br J Cancer. 2005; 94:333–337.

2. Kerick M, Isau M, Timmermann B, Sultmann H, Herwig R, Krobitsch S, Schaefer G, Verdorfer I, Bartsch G, Klocker H, Lehrach H, Schweiger M. Targeted high throughput sequencing in clinical cancer Settings: formaldehyde fixed-paraffin embedded (FFPE) tumor tissues, input amount and tumor heterogeneity. BMC Med Genet. 2011; 4:68.

3. Schweiger MR, Kerick M, Timmermann B, Albrecht MW, Borodina T, Parkhomchuk D, Zatloukal K, Lehrach H. Genome-Wide Massively Parallel Sequencing of Formaldehyde Fixed-Paraffin Embedded (FFPE) Tumor Tissues for Copy-Number- and Mutation-Analysis. PLoS ONE. 2009; 4:e5548.

4. Wood HM, Belvedere O, Conway C, Daly C, Chalkley R, Bickerdike M, McKinley C, Egan P, Ross L, Hayward B, Morgan J, Davidson L, MacLennan K, et al. Using next-generation sequencing for high resolution multiplex analysis of copy number variation from nanogram quantities of DNA from formalin-fixed paraffin-embedded specimens. Nucleic Acids Res. 2010; 38:e151.

5. Allen EMV, Wagle N, Stojanov P, Perrin DL, Cibulskis K, Marlow S, Jane-Valbuena J, Friedrich DC, Kryukov G, Carter SL, McKenna A, Sivachenko A, Rosenberg M, et al. Whole-exome sequencing and clinical interpretation of FFPE tumor samples to guide precision cancer medicine. Nat Med. 2014; 20:682–688.

6. Menon R, Deng M, Boehm D, Braun M, Fend F, Boehm D, Biskup S, Perner S. Exome Enrichment and SOLiD Sequencing of Formalin Fixed Paraffin Embedded (FFPE) Prostate Cancer Tissue. Int J Mol Sci. 2012; 13:8933–8942.

7. Wagle N, Berger MF, Davis MJ, Blumenstiel B, DeFelice M, Pochanard P, Ducar M, Van Hummelen P, MacConaill LE, Hahn WC. High-throughput detection of actionable genomic alterations in clinical tumor samples by targeted, massively parallel sequencing. Cancer Discov. 2012; 2:82–93.

8. Munchel S, Hoang Y, Zhao Y, Cottrell J, Klotzle B, Godwin AK, Koestler D, Beyerlein P, Fan J-B, Bibikova M, Chien J. Targeted or whole genome sequencing of formalin fixed tissue samples: potential applications in cancer genomics. Oncotarget. 2015; 6:25943–61. doi: 10.18632/oncotarget.4671.

9. Astolfi A, Urbini M, Indio V, Nannini M, Genovese CG, Santini D, Saponara M, Mandrioli A, Ercolani G, Brandi G, Biasco G, Pantaleo MA. Whole exome sequencing (WES) on formalin-fixed, paraffin-embedded (FFPE) tumor tissue in gastrointestinal stromal tumors (GIST). BMC Genomics. 2015; 16:892.

10. Beltran H, Yelensky R, Frampton GM, Park K, Downing SR, MacDonald TY, Jarosz M, Lipson D, Tagawa ST, Nanus DM. Targeted Next-generation Sequencing of Advanced Prostate Cancer Identifies Potential Therapeutic Targets and Disease Heterogeneity. Eur Urol. 2012.

11. Gansauge M-T, Meyer M. Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA. Nat Protoc. 2013; 8:737–748.

12. Meyer M, Kircher M, Gansauge M-T, Li H, Racimo F, Mallick S, Schraiber JG, Jay F, Prüfer K, de Filippo C, Sudmant PH, Alkan C, Fu Q, et al. A High-Coverage Genome Sequence from an Archaic Denisovan Individual. Science. 2012; 338:222–226.

13. Meyer M, Kircher M. Illumina Sequencing Library Preparation for Highly Multiplexed Target Capture and Sequencing. Cold Spring Harb Protoc. 2010; 2010:pdb.prot5448.

14. Daley T, Smith AD. Predicting the molecular complexity of sequencing libraries. Nat Methods. 2013; 10:325–327.

15. Zimmermann J, Hajibabaei M, Blackburn DC, Hanken J, Cantin E, Posfai J, Evans TC. DNA damage in preserved specimens and tissue samples: a molecular assessment. Front Zool. 2008; 5:18.

16. McGhee JD, Von Hippel PH. Formaldehyde as a probe of DNA structure. I. Reaction with exocyclic amino groups of DNA bases. Biochemistry. 1975; 14:1281–1296.

17. Huang H, Solomon MS, Hopkins PB. Formaldehyde preferentially interstrand cross-links duplex DNA through deoxyadenosine residues at the sequence 5′-d (AT). J Am Chem Soc. 1992; 114:9240–9241.

18. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2011; 39:D945–D950.

19. Yost SE, Smith EN, Schwab RB, Bao L, Jung H, Wang X, Voest E, Pierce JP, Messer K, Parker BA. Identification of high-confidence somatic mutations in whole genome sequence of formalin-fixed breast cancer specimens. Nucleic Acids Res. 2012; 40:e107–e107.

20. Spencer DH, Sehn JK, Abel HJ, Watson MA, Pfeifer JD, Duncavage EJ. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens. J Mol Diagn. 2013; 15:623–633.

21. Huang H, Hopkins PB. DNA interstrand cross-linking by formaldehyde: nucleotide sequence preference and covalent structure of the predominant cross-link formed in synthetic oligonucleotides. J Am Chem Soc. 1993; 115:9402–9408.

22. Kawanishi M, Matsuda T, Yagi T. Genotoxicity of formaldehyde: molecular basis of DNA damage and mutation. Front Environ Sci. 2014; 2:36.

23. Briggs AW, Stenzel U, Johnson PL, Green RE, Kelso J, Prüfer K, Meyer M, Krause J, Ronan MT, Lachmann M. Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA. 2007; 104:14616–14621.

24. Keller A, Graefen A, Ball M, Matzas M, Boisguerin V, Maixner F, Leidinger P, Backes C, Khairat R, Forster M. New insights into the Tyrolean Iceman’s origin and phenotype as inferred by whole-genome sequencing. Nat Commun. 2012; 3:698.

25. Wang TP, Sable HZ, Lampen JO. Enzymatic deamination of cytosine nucleosides. J Biol Chem. 1950; 184:17–28.

26. Gilbert MTP, Binladen J, Miller W, Wiuf C, Willerslev E, Poinar H, Carlson JE, Leebens-Mack JH, Schuster SC. Recharacterization of ancient DNA miscoding lesions: insights in the era of sequencing-by-synthesis. Nucleic Acids Res. 2007; 35:1–10.

27. Stiller M, Green RE, Ronan M, Simons JF, Du L, He W, Egholm M, Rothberg JM, Keates SG, Ovodov ND. Patterns of nucleotide misincorporations during enzymatic amplification and direct large-scale sequencing of ancient DNA. Proc Natl Acad Sci USA. 2006; 103:13578–13584.

28. Weiss RB, Mineura K, Henderson EE, Duker NJ, DeRiel JK. Enzymic detection of uracil in a cloned and sequenced deoxyribonucleic acid segment. Biochemistry. 1983; 22:4501–4507.

29. Williams C, Pontén F, Moberg C, Söderkvist P, Uhlén M, Pontén J, Sitbon G, Lundeberg J. A high frequency of sequence alterations is due to formalin fixation of archival specimens. Am J Pathol. 1999; 155:1467–1471.

30. Araujo LH, Timmers C, Shilo K, Zhao W, Zhang J, Yu L, Natarajan TG, Miller CJ, Yilmaz AS, Liu T. Impact of Pre-Analytical Variables on Cancer Targeted Gene Sequencing Efficiency. PLoS ONE. 2015; 10:e0143092.

31. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale A-L. Signatures of mutational processes in human cancer. Nature. 2013; 500:415–421.

32. Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, Jones D, Hinton J, Marshall J, Stebbings LA. Mutational processes molding the genomes of 21 breast cancers. Cell. 2012; 149:979–993.

33. Korlevic P, Gerber T, Gansauge M-T, Hajdinjak M, Nagel S, Ayinuer-Petri A, Meyer M. Reducing microbial and human contamination in DNA extractions from ancient bones and teeth. Biotechniques. 2015; 59:87–93.

34. Kircher M, Sawyer S, Meyer M. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform. Nucleic Acids Res. 2012; 40:e3.

35. Dabney J, Meyer M. Length and GC-biases during sequencing library amplification: a comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques. 2012; 52:87–94.

36. Fu Q, Meyer M, Gao X, Stenzel U, Burbano HA, Kelso J, Pääbo S. DNA analysis of an early modern human from Tianyuan Cave, China. Proc Natl Acad Sci USA. 2013; 110:2223–2227.

37. Renaud G, Kircher M, Stenzel U, Kelso J. freeIbis: an efficient basecaller with calibrated quality scores for Illumina sequencers. Bioinformatics. 2013; 29:1208–1209.

38. Renaud G, Stenzel U, Kelso J. leeHom: adaptor trimming and merging for Illumina sequencing reads. Nucleic Acids Res. 2014; 42:e141–e141.

39. Renaud G, Stenzel U, Maricic T, Wiebe V, Kelso J. deML: robust demultiplexing of Illumina sequences using a likelihood-based approach. Bioinformatics. 2015; 31:770–772.

40. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26:841–842.

41. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25:2078–2079.

42. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010; 38:e164–e164.

43. Team RDC. R: A language and environment for statistical computing. ISBN 3-900051-07-0. R Foundation for Statistical Computing. Vienna, Austria, 2013. url: http://www.R-project.org). 2005.

Publication Alerts

European Cancer and Oncology Congress

Post-Publication Promotion

Research Papers:

Single-strand DNA library preparation improves sequencing of formalin-fixed and paraffin-embedded (FFPE) cancer DNA

Publication Alerts

European Cancer and Oncology Congress

Oncoscience

Post-Publication Promotion

Research Papers:

Single-strand DNA library preparation improves sequencing of formalin-fixed and paraffin-embedded (FFPE) cancer DNA