In this blog series, I will discuss the architecture of a state of the art secondary pipeline that is able to detect single nucleotide variations (SNV) and copy number variations (CNV) in one test leveraging next-gen sequencing. In Part I, we reviewed genetic variation in humans in general and looked at the key components of a systems architecture supporting this kind of analysis. Part II reviews briefly how algorithms such as GATK are leveraged to call single nucleotide variations. Part III will give you an overview of some of the design principles of a CNV analytics framework for next-gen sequencing data. Part IV will show some examples of how such a CNV caller will identify CNVs. Finally, Part V shows what an integrated clinical workflow looks like.
To start things off, let’s take a look at the secondary analysis for SNVs. Sequencers take the physical sample and translate it into data – this step is referred to as primary analysis. The data output is called a FASTQ file (shown in Fig. 1), and it contains millions of reads varying in length from 50 to 250 base pairs which consist of A, C, G and T character strings. The sequencers also compute a quality score, or the so called Phred Score, for each base.
From there, the secondary analysis for SNVs begins with three discreet steps as outlined in Fig 2. The first step in the data processing is called Alignment. This process determines the most likely source of the genome sequence for the observed DNA sequencing read, given the knowledge of which species the sequence has come from. Today’s short-read alignment programs are generally used for the alignment of DNA sequence from the species of interest to the reference genome assembly of that species. This difference, although it initially seems subtle, has several consequences to the final algorithm design and implementation, which include letting assumptions about the number of expected mismatches be driven by the species polymorphism rate and the technology error rate rather than by considerations of evolutionary substitutions.
In general, these assumptions allow for much faster processing, as few low-quality alignments are either expected or scored. Given the massive data volumes produced by the present sequencing machines, this has also allowed alignments to be calculated without a correspondingly massive increase in computer hardware requirements. As sequence capacity grows, algorithmic speed may become a more important bottleneck. Although there is a large and ever growing number of implementations for short-read sequence alignment, the number of fundamental technologies used is much smaller. These are:
- Hash table–based implementations, in which the hash may be created using either the reference genome or the set of sequencing reads.
- Burrows Wheeler Transform (BWT)-based methods, which first create an efficient index of the reference genome assembly in a way that facilitates rapid searching in a low-memory footprint.
Alignment programs normally follow a multi-step procedure to accurately map sequence. Using heuristic techniques in the first step, efforts are made to quickly identify a small set of places in the reference sequence where the location of the best mapping is most likely to be found. Once the smaller subset of possible mapping locations has been identified, slower and more accurate alignment algorithms such as Smith-Waterman are run on the limited subset. Since this algorithm has a quadratic runtime behavior, it’s best to use it selectively and only on reads of short length. Running these accurate alignment algorithms as a full search of all possible places where the sequence may map is computationally infeasible. Figure 4, below, shows an example of aligned reads against a reference genome.
The next step is Deduping. This step corrects any redundant reads that are introduced as a byproduct of the PCR amplification in the primary analysis. This step helps to clean up the data, reduce redundancy and increase computational efficiency in downstream steps. (See Fig 4 as an example)
The final step is the actual Variant Calling. While there are a number of different callers available, GATK developed by the Broad Institute has a wide adoption in the clinical realm. Bayesian estimation of the most likely genotype from next-generation DNA sequencing reads has already proven to be valuable. GATK follows these principles. As a tool, it is a simple Bayesian genotyper. The genotyper, though naïve, provides a framework for implementing more advanced genotyping and variant discovery approaches that incorporate more realistic read-mapping error and base-miscall models. This simple Bayesian genotyper serves both as the starting point for more advanced statistical inference tools and also as an ideal place to highlight the shared memory and distributed parallelization capabilities of the GATK core engine.
In short, it computes the posterior probability of each genotype, given the pileup of sequencer reads that cover the current locus and expected heterozygosity of the sample. This computation is used to derive the prior probability of each of the genotypes, using the Bayesian formulation (see fig.5).
GATK has been designed as a multi-threaded algorithm. This allows for the method of leveraging parallel computing to speed up run-time. Golden Helix has selected Sentieon, a commercial provider of GATK, due to its superior performance in benchmark tests. The company was able to achieve major performance improvements due to its implementation of GATK in low-level C and Assembler over JAVA based implementations of the same method. In addition, the company also implemented alignment (BWA) and deduping as part of an integrated secondary analysis solution for SNVs.
Stay tuned for my next post in this series which will give you an overview of some of the design principles of a CNV analytics framework for next-gen sequencing data!