After the Wet Lab process has been completed, the bioinformatics analysis of the sequencing data work begins. The next three blogs will focus on three aspects of this process.
- The building blocks of a bioinformatics pipeline, documentation and validation (today’s topic)
- Quality Management (Part V)
- Clinical Reporting (Part VI)
The Building Blocks of an NGS Pipeline
The bioinformatics process to analyze NGS data occurs in three steps. First, we need to generate the sequence read file. This consists of a linear nucleotide sequence (e.g. ACTGGCA), with each nucleotide being assigned a numerical value that relates to its predicted accuracy. This step occurs within a DNA-sequencer, which is commercially available from companies such as Illumina. All sequence reads are stored in a FASTQ file generated from the sequencer. FASTQ files contain the compilation of individual sequence reads that are between 50 to 150 base pairs long. Depending on the selected coverage, a FASTQ file might contain millions to even billions of short read sequences. Generating the FASTQ file is also called “Primary Analysis”.
Second, the sequences in the FASTQ file need to be aligned vis a vis the human genome reference sequence. This is computationally an expensive step that alone cannot be solved optimally within a reasonable time frame. The underlying computer science problem of aligning reads to the reference sequence is NP-complete. Hence, there are many different types of algorithms in the literature described with different optimization goals. One of the standard aligners used in day to day practice is the BWA aligner. The output of this step is a BAM file which contains all the reads from the FASTQ file aligned to the reference. The next step in the process is to identify the differences between the patient sequence reads and the reference sequence. These differences might entail single nucleotide variations (SNVs) including insertions and/or deletions, copy number variations (CNVs) and other structural variations. There are a number of software tools available. For example, the Broad Institute has a widely adopted variant caller, called GATK. GATK can be used for germline mutations. There is a special version available that is being used to analyze cancer related mutations. This step of the bioinformatics process is referred to as “Secondary Analysis”. The result of this step is a VCF file that contains all identified variants of the patient sample.
Third, we need to enrich the VCF file by annotating all of the variants based on information from public and private data sources. Essentially, this step interprets the patient sample by identifying variants that are damaging or functionally impacting one or multiple genes, relevant to the patient’s observed phenotype. The number of variants in a VCF file depends very much on the initial target region (e.g. gene panels, exome or genome sequences). It can range from hundreds of thousands to millions of variants. In order to reduce the number of variants to only those that have high clinical relevance, a combination of quality filters, population frequency data and functional predictions are typically used. In the case of exome or genome sequences within a family, we eliminate variants that are present in the unaffected family members. Based on the resulting set of variants, it is possible to conduct a variant prioritization. This step utilizes public and private databases, such as Online Mendelian Inheritance in Man (OMIM), to identify the variants that are known to be associated with disease. The steps outlined to annotate, filter and prioritize variants is often referred to as “Tertiary Analysis”.
Laboratories are obligated to document all algorithms, software and databases used in the analysis, interpretation and reporting of NGS results. The overarching objective is to create a repeatable pipeline that creates consistent results. This is a tall order. In reality, this means that each version of all pipeline components must be described and recorded. The documentation of each component may include a baseline, default installation settings and the description of any customization by using different configuration parameters, running different algorithms or deploying custom code. Here are a few examples of what needs to be documented:
a) Alignment process: the reference sequence version number and assembly details, alignment algorithm, data transfer process from the sequencer to the data repository and any kind of quality control parameter, such as the number of reads per sample
b) Variant calling: thresholds for read coverage depth, variant quality scores and allelic read percentages
c) Variant filtering: standard filtering workflows to determine recessive, dominant and de novo variants
These are only a few examples. Obviously, the documentation of the entire process can be very involved. We hear from our clients that their SOPs for this portion of their process exceed a hundred pages or more.
It goes without saying that any bioinformatics pipeline needs to be validated. We have to ensure that the results of the pipeline are accurate and up to date with current best practices. Once the pipeline is implemented and documented, and the first empirical data suggests that it yields the intended results, it’s time to perform and document a comprehensive validation. So, how do we do this?
First, the pipeline is locked down. This means that no one within the laboratory modifies or alters any part of the pipeline in any way. This includes no further optimization of any parameters or settings, no upgrades to tools, operating systems or databases.
Second, the lab identifies a sufficient number of samples to determine the pipeline’s analytic and diagnostic sensitivity and specificity, as well the reproducibility of known outcomes. Here is what to look for in this step:
- Define a number of samples with known outcomes
- Define the regions of a genome that need to be assessed
- Define the types of variants that need to be detected
- Measure the error rates for false positives and false negatives in variant calls. The laboratory needs to determine the error rates for several representative examples by variant type.
In the end, the pipeline needs to be able to reproduce known outcomes. Then, after changes have been done, the updated pipeline must be able to reproduce the known results. Any improvements in sensitivity and accuracy measuring mutations should be documented, as this would reset the testing standards within the lab.