A comparative assessment of clinical whole exome and transcriptome profiling across sequencing centers: implications for precision cancer medicine

Advances in next generation sequencing technologies provide approaches to comprehensively determine genomic alterations within a tumor that occur as a cause or consequence of neoplastic growth. Though providers offering various cancer genomics assays have multiplied, the level of reproducibility in terms of the technical sensitivity and the conclusions resulting from the data analyses have not been assessed. We sought to determine the reproducibility of ascertaining tumor genome aberrations using whole exome sequencing (WES) and RNAseq. Samples of the same metastatic tumors were independently processed and subjected to WES of tumor and constitutional DNA, and RNAseq of RNA, at two sequencing centers. Overall, the sequencing results were highly comparable. Concordant mutation calls ranged from 88% to 93% of all variants including 100% agreement across 154 cancer-associated genes. Regions of copy losses and gains were uniformly identified and called by each sequencing center and chromosomal plots showed nearly identical patterns. Transcript abundance levels also exhibited a high degree of concordance (r2 ≥ 0.78;Pearson). Biologically-relevant gene fusion events were concordantly called. Exome sequencing of germline DNA samples provided a minimum of 30X coverage depth across 56 genes where incidental findings are recommended to be reported. One possible pathogenic variant in the APC gene was identified by both sequencing centers. The findings from this study demonstrate that results of somatic and germline sequencing are highly concordant across sequencing centers that have substantial experience in the technological requirements for preparing, sequencing and annotating DNA and RNA from human biospecimens.


Tissue acquisition and preparation
Metastatic tumor samples were obtained from patients who died of castration resistant prostate cancer during a rapid autopsy performed within 6 hours of death. Patients and family members signed a written consent for the Prostate Cancer Donor Program at the University of Washington [1]. The Institutional Review Board of the University of Washington approved this study. Tumor fragments were embedded in OCT freezing medium. Frozen sections were examined to assess tumor purity following hematoxylin and eosin staining. Frozen tumor pieces containing > 70% tumor cells based on visual analysis were sent to the Broad Institute and University of Michigan for processing and sequence analyses.

Library preparations and sequencing
Whole exome-Broad Institute DNA extraction was performed as previously described [2]. DNA libraries for massively parallel sequencing were generated as previously described. Libraries with concentrations above 40 ng/µl, as measured by a PicoGreen assay automated on an Agilent Bravo instrument, were considered acceptable for hybrid selection and sequencing. The exon capture procedure was performed as previously described. Libraries with concentrations between 40 and 50 ng/µL were normalized to 40 ng/µL, and 12.3 µL of library was combined with blocking agent, bait, and hybridization buffer. Finally, the hybridization reaction was reduced to 17 hours, with no changes to the downstream capture protocol. After post-capture enrichment, libraries were quantified using PicoGreen, normalized to equal concentration using a Perkin Elmer MiniJanus instrument, and pooled by equal volume on the Agilent Bravo platform. Library pools were then quantified using quantitative PCR (KAPA Biosystems) with probes specific to the ends of the adapters; this assay was automated using Agilent's Bravo liquid handling platform. Based on qPCR quantification, libraries were brought to 2 nM and denatured using 0.2 N NaOH on the Perkin-Elmer MiniJanus. After denaturation, libraries were diluted to 20 pM using hybridization buffer purchased from Illumina. Cluster amplification of denatured templates was performed according to the manufacturer's protocol (Illumina). HiSeq v3 cluster chemistry and flowcells, as well as Illumina's Multiplexing Sequencing Primer Kit. DNAs were added to flowcells and sequenced using the HiSeq 2000 v3 Sequencing-by-Synthesis method, then analyzed using RTA v.1.12.4.2 or later. Each pool of whole exome libraries was subjected to paired 76 bp runs. An 8-base index sequencing read was performed to read molecular indices, across the number of lanes needed to meet coverage for all libraries in the pool.
Sequence data processing: Exome sequence data processing was performed using established analytical pipelines at the Broad Institute. A BAM file was produced with the Picard pipeline (http://picard.sourceforge.net/), which aligns the tumor and normal sequences to the hg19 human genome build using Illumina sequencing reads. The BAM was uploaded into the Firehose pipeline (http:// www.broadinstitute.org/cancer/cga/Firehose), which manages input and output files to be executed by GenePatterm [3]. Quality control modules within Firehose were applied to all sequencing data for comparison of the origin for tumor and normal genotypes and to assess fingerprinting concordance. Cross-contamination of samples was estimated using ContEst [4].

Whole exome-University of Michigan
Matched normal genomic DNAs from frozen normal tissue blocks were isolated using the Qiagen DNeasy Blood & Tissue Kit, according to the manufacturer's instructions. Tumor genomic DNA and total RNA were purified from the same sample using the AllPrep DNA/ RNA/miRNA kit (QIAGEN) with disruption using a 5 mm bead on a Tissuelyser II (Qiagen). RNA integrity was verified on an Agilent 2100 Bioanalyzer using RNA Nano reagents (Agilent Technologies). Whole exome capture libraries were constructed from 100 ng to 1 μg of DNA from tumor and normal tissue after sample shearing, end repair, and phosphorylation and ligation to barcoded sequencing adaptors (NEB). Ligated DNA was size selected for lengths between 200-350 bp and subjected to exonic hybrid capture using SureSelect Exome v4 baits (Agilent). Paired-end libraries were sequenced with the Illumina HiSeq 2500, (2 × 100 nucleotide read length. Reads passing the chastity filter of Illumina BaseCall software were used for subsequent analysis.

Transcriptome-Broad Institute
RNA was extracted from frozen tissue using the miRNeasy Mini kit (Qiagen) according to the manufacturer's instructions, including the optional on-column DNase digest. All samples were quantified using Nanodrop and quality was evaluated using Agilent's Bioanalyzer 2100. Total RNA was quantified using the Quant-iT™ RiboGreen ® RNA Assay Kit (Invitrogen) and normalized to 4 ng/ul. An aliquot of 200 ng for each sample was transferred into library preparation which was an automated variant of the Illumina Tru Seq™ RNA Sample Preparation protocol (Revision A, 2010). This method uses oligo dT beads to select mRNA from the total RNA sample followed by heat fragmentation and cDNA synthesis from the RNA template. The resultant cDNA then goes through library preparation (end repair, base 'A' addition, adapter ligation, and enrichment) using Broad designed indexed adapters substituted in for multiplexing. After enrichment the libraries were quantified with qPCR using the KAPA Library Quantification Kit for Illumina Sequencing Platforms and then pooled equimolarly. The entire process is in 96-well format and all pipetting is done by either Agilent Bravo or PerkinElmer JANUS Mini liquid handlers.
Illumina Sequencing: Pooled libraries were normalized to 2 nM and denatured using 0.2 N NaOH prior to sequencing. Flowcell cluster amplification and sequencing were performed according to the manufacturer's protocols using either the HiSeq 2000 v3 or HiSeq 2500. Each run was a 76 bp paired-end with an eight-base index barcode read. Data was analyzed using the Broad Picard Pipeline which includes de-multiplexing and data aggregation

Transcriptome-University of Michigan
Transcriptome libraries were prepared using 200-1000 ng of total RNA isolated as above. Poly(A)+ RNA was isolated using Sera-Mag oligo(dT) beads (Thermo Scientific) and fragmented with the Ambion Fragmentation Reagent kit (Ambion, Austin, TX). cDNA synthesis, end-repair, A-base addition, and ligation of the Illumina indexed adapters were performed according to Illumina's TruSeq RNA protocol (Illumina). Libraries were sizeselected for 250-300 bp cDNA fragments on a 3% Nusieve 3:1 (Lonza) agarose gel, recovered using QIAEX II gel extraction reagents (Qiagen), and PCR-amplified using Phusion DNA polymerase (New England Biolabs). The amplified libraries were purified using AMPure XP beads (Beckman Coulter). Total transcriptome libraries were prepared as above, omitting the poly A selection step and captured using Agilent SureSelect Human All Exon V4 reagents and protocols. Library quality was measured on an Agilent 2100 Bioanalyzer for product size and concentration. Paired-end libraries were sequenced with the Illumina HiSeq 2500, (2 × 100 nucleotide read length), with sequence coverage to 50 M paired reads and 100 M total reads. Reads passing the chastity filter of Illumina BaseCall software were used for subsequent analysis.

Determination of tumor purity
Tumor content for each tumor exome library was estimated from the sequence data by fitting a binomial mixture model with two components to the set of most likely SNV candidates on 2-copy genomic regions. The set of candidates used for estimation consisted of coding variants that (i) exhibited at least 3 variant fragments in the tumor sample, (ii) exhibited zero variant fragments in the matched benign sample with at least 16 fragments of coverage, (iii) were not present in dbSNP, (iv) were within a targeted exon or within 100 base pairs of a targeted exon, (v) were not in homopolymer runs of four or more bases, and (vi) exhibited no evidence of amplification or deletion. In order to filter out regions of possible amplification or deletion, we used exon coverage ratios to infer copy number changes, as described below. Resulting SNV candidates were not used for estimation of tumor content if the segmented log-ratio exceeded 0.2 in absolute value. Candidates on the Y chromosome were also eliminated because they were unlikely to exist in 2-copy genomic regions. Using this set of candidates, we fit a binomial mixture model with two components using the R package flexmix, version 2.3-8. One component consisted of SNV candidates with very low variant fractions, presumably resulting from recurrent sequencing errors and other artifacts. The other component, consisting of the likely set of true SNVs, was informative of tumor content in the tumor sample. Specifically, under the assumption that most or all of the observed SNV candidates in this component are heterozygous SNVs, we expect the estimated binomial proportion of this component to represent one-half of the proportion of tumor cells in the sample. Thus, the estimated binomial proportion as obtained from the mixture model was doubled to obtain an estimate of tumor content.

University of Michigan
Paired-end reads were aligned using Novoalign v 3.02.00 and sorted using Novosort. (Novocraft Technologies) Variants in both normal and tumor libraries were identified using the local realignment haplotypebased caller FreeBayes [7]. process, Tophat2 internally deploys an ultrafast short read alignment tool Bowtie (Version 0.12.8) to map the transcriptome data. Potential false positive fusion candidates were filtered out using 'Tophat-Post-Fusion' module. Further, the fusion candidates were manually examined for annotation and ligation artifacts. Junction reads supporting the fusion candidates were re-aligned using an alignment tool BLAT (http://genome.ucsc.edu/ cgi-bin/hgBlat) to reconfirm the fusion breakpoint. Full length sequence of the fusion gene was constructed based on supporting junction reads, and evaluated for potential open reading frames (ORF) using an ORF finder (http:// www.ncbi.nlm.nih.gov/gorf/gorf.html). Further, the gene fusions with robust ORFs and the amino acid sequences of the fused proteins were explored using the Simple Modular Architecture Research Tool (SMART) evaluating for gain or loss of known functional domains in the fusion proteins. Candidate fusion junctions with 4 or greater spanning reads were reported.

Broad Institute
Copy ratios were calculated for each captured target by dividing the tumor coverage by the median coverage obtained in a set of reference normal samples. The resulting copy ratios were segmented using the circular binary segmentation algorithm (Olshen 2004). Genes in copy ratio regions with segment means of greater than log 2 (4) were evaluated for focal amplifications, and genes in regions with segment means of less than log 2 (0.5) were evaluated for deletions.

University of Michigan
Copy number aberrations were quantified and reported for each gene as the segmented normalized log2transformed exon coverage ratios between each tumor sample and matched normal sample [12]. To account for observed associations between coverage ratios and variation in GC content across the genome, lowess normalization was used to correct per-exon coverage ratios prior to segmentation analysis. Specifically, mean GC percentage was computed for each targeted region, and a lowess curve was fit to the scatterplot of log2-coverage ratios vs. mean GC content across the targeted exome using the lowess function in R (version 2.13.1) with smoothing parameter f = 0.05. The resulting copy ratios were segmented using the circular binary segmentation algorithm [13].