Structural variation discovery in the cancer genome using next generation sequencing: computational solutions and perspectives.

Somatic Structural Variations (SVs) are a complex collection of chromosomal mutations that could directly contribute to carcinogenesis. Next Generation Sequencing (NGS) technology has emerged as the primary means of interrogating the SVs of the cancer genome in recent investigations. Sophisticated computational methods are required to accurately identify the SV events and delineate their breakpoints from the massive amounts of reads generated by a NGS experiment. In this review, we provide an overview of current analytic tools used for SV detection in NGS-based cancer studies. We summarize the features of common SV groups and the primary types of NGS signatures that can be used in SV detection methods. We discuss the principles and key similarities and differences of existing computational programs and comment on unresolved issues related to this research field. The aim of this article is to provide a practical guide of relevant concepts, computational methods, software tools and important factors for analyzing and interpreting NGS data for the detection of SVs in the cancer genome.

SV detection programs, are recorded and summarized.
The tested pair-end WGS dataset was generated by Illumina sequencing service with Illumina HiSeq 2000 platform on the primary tumor and matched blood of a patient with muscle-invasive transitional cell carcinoma of the urinary bladder (TCC-UB). The read length of each end is 100 nucleotides, and the mean library insert sizes are 320±15 nucleotides and 313±15 nucleotides for tumor and matched normal samples, respectively. Sample preprocessing, fragmentation, and library preparation was performed following Illumina's standard protocols. The raw image data was processed by CASAVA [1], and the sequence reads mapped to the hg19 reference genome using BWA [2]. The mean coverages are 61x and 46x, and the sizes of the resulted BAM files are 155 and 119 GBs for tumor and matched blood, respectively.
The tested SV detection programs include GASV [3], BreakDancer [4,5], HYDRA [6], SVDetect [7], CREST [8], DELLY [9], PRISM [10], and LUMPY [11]. PEMer [12] was not included due to the extremely high computational demand of its MEGABLAST [13] mapping step. The BAM files generated by BWA mapping serve as the inputs for all tested programs except for PRISM which requires SAM format. The programs that support parallel computing are tested under a computer cluster composed of 100 nodes and each node has two Intel Xeon E5-2670 @2.6 GHz processors and 64 GB of memory. The programs that do not support parallel computing are tested using a Dell Linux workstation with two Intel Xeon E5-2620 v2 @2.1 GHz processors and 32 GB of memory.
Supplemental Table 1 showed the performance statistics for the programs that support parallel computing setting. Different programs support parallel computing in different ways. For examples, CREST splits the jobs by chromosomes, DELLY calls different types of SVs in parallel, and SVDetect and LUMPY use pre-specified multithread in computation. The default or recommended settings are used for each program except for HYDRA. As the realignment component of HYDRA utilizes novoalign [14] whose noncommercial version doesn't support multithread in computation, we split the reads files into small files with each containing 100,000 reads during its realignment. As shown in Table S1, all programs except for SVDetect are finished within 2 days in our testing computer cluster. The computing statistics of the programs that do not support parallel computing are listed in Supplemental Table 2. The default settings are used for all these programs. They all finish within several hours in our testing workstation, with GASV using much more memory (9GB) than other two (1GB).
It should be emphasized that many additional factors such as data quality, complexity of cancer genome, and sequencing coverage, could affect the statistics of memory usage and runtime. Furthermore, these factors may affect the computing performances of different SV detection programs to different extents.
For example, increased number of splitting reads will greatly slow down the programs requiring reads realignment, but might have relatively less effect on the run times of the programs without realignment step. A systematic study of the performance of each method is beyond the scope of this review. a. The computational environment is a DELL Linux workstation with two Intel Xeon E5-2620 v2 @2.1 GHz processors and 32 GB of memory, and the default settings were used for all programs; b. The maximum memory usage in any step of the SV calling steps.