An Example of an Integrated Clinical Workflow for CNVs and SNVs
In this blog series, I discuss the architecture of a state of the art secondary pipeline that is able to detect single nucleotide variations (SNVs) and copy number variations (CNVs) in one test leveraging next-gen sequencing. In Part I, we reviewed genetic variation in humans and looked at the key components of a systems architecture supporting this kind of analysis. Part II reviews how algorithms such as GATK are leveraged to call single nucleotide variations. Part III will give you an overview of some of the design principles of a CNV analytics framework for next-gen sequencing data. Part IV shows some examples of how a CNV caller identifies CNVs. Finally, Part V shows what an integrated clinical workflow looks like.
Let’s start with a specific case. A patient’s DNA was sequenced using the TruSight Cardio Sequencing Kit from Illumina to identify causal variants implicated in inherited cardiac conditions (ICCs). This NGS gene panel test resulted in over 2,000 variants in 174 genes with known associations to 17 ICCs. To evaluate variants of clinical relevance, we must develop a workflow that:
- Identifies mutations in regions targeted by the gene panel
- Filters out low quality variants
- Identifies variants that are classified as pathogenic with a predicted missense or loss of function effect
The filtering procedures defined in the above template can be seen in figure 1.
This first filter card removes variants that are present outside of the regions sequenced in the gene panel. Variants that exist outside of gene regions targeted by the panel are not the focus of this test and are likely not covered with sufficient reads to be high quality, and thus should be removed. The 421 variants remaining are then passed through a second filter which identifies which variants have a classification of “PASS” as determined by the variant caller. Also, using data from the VCF file created by the variant caller, variants are filtered (filter card 3) to keep those with a read depth greater than or equal to the specified threshold of 100. This is an example of a value tuned during the test validation to ensure the desired trade-off between sensitivity and specificity. These filtering steps produce a list of 328 variants.
The next step in the filter chain is to categorize variants by sequence ontology effect prediction, and only those variants that present a loss of function or missense are included, resulting in 133 variants. Finally, to keep only variants of clinical relevance, a filter on whether a variant is annotated in the ClinVar database as “Pathogenic” or Likely Pathogenic” is applied. This reduces the number of variants down to 6.
Copy Number Variants (CNVs) provide critical evidence for many genetic tests run in a clinical lab. Along with the small variants, NGS data can also be used to call CNVs, providing extra value for data you may already have and discovering events that may not be captured by any of your existing testing technology.
The CNV algorithm implemented in VarSeq uses sample-level coverage statistics to detect CNV events. The coverage data for a given region is normalized against the same region in closely matched controls, and then metrics from this comparison (Z-score and Ratio) can be used to call single or double copy losses (deletion) or copy gain (duplication) CNV events.
The Z-score for a target measures the number of standard deviations a sample’s coverage is from the mean reference sample coverage, while the Ratio is the target coverage divided by the mean reference sample coverage.
In our example, the patient DNA was sequenced using the TruSight Cardio Sequencing kit. Variant filtration and annotation in VarSeq resulted in 6 variants classified as “Pathogenic” in the ClinVar database. In addition to the 6 pathogenic variants, the patient was found to possess a heterozygous deletion of exons 12-14 of the MYBPC3 gene. This gene is involved in production of cardiac myosin binding protein C (cardiac MyBP-C), which is found in heart (cardiac) muscle cells. Mutations in the MYBPC3 gene are a common cause of familial hypertrophic cardiomyopathy, accounting for up to 30 percent of all cases. Although some people with familial hypertrophic cardiomyopathy have no obvious health effects, all affected individuals have an increased risk of heart failure. Hence, this CNV event is important to report.
Now that a short list of variants and CNVs is obtained, the clinician can further enrich our variant data set with additional clinically relevant content from OMIM and the ExAC database. OMIM is a comprehensive database of human genes and genetic phenotypes for all known Mendelian disorders and over 15,000 genes. OMIM is updated daily with new literature, cross-references and hand-written descriptions and interpretations of genes, phenotypes and variants. The OMIM annotation source is deeply integrated into VSReports. Of these six variants, a single variant was associated with Hereditary Hemochromatosis which is flagged as a primary finding. In addition, the heterozygous deletion CNV event in the MYBPC3 gene should be reported as well (figure 3).
Archiving and Accessing Genomic Data
Now that we have developed a workflow and created a clinical report containing actionable data, we can store this information in our genetic data warehouse platform VSWarehouse. The VSWarehouse is a scalable, multi-project warehouse for NGS variant call sets, clinical reports and catalogs of variant assessments, allowing labs to leverage volumes of data to make historical insights. VSWarehouse provides a platform to labs and healthcare professionals for result interpretation, result retrieval and result re-interpretation (see figure 4).
As new samples get uploaded into the VSWarehouse, the software generates a fully merged matrix of unique variants across both new and existing samples. Additionally, the software runs all algorithms and annotations that are configured with the project.
As a clinical lab generates new samples that are being analyzed by pathologists leveraging NGS pipelines, a workflow should be in place that enables secure clinical research and auditable use of genetic data. For instance, as the network of public genomic databases used to annotate and classify variants continuously changes, a variant that has been classified as “unknown significance”, could be deemed “pathogenic” as annotation sources are updated. These updates could alter the diagnosis and treatment selection for a patient as time progresses (see figure 7).
In addition to tracking variant classification, VSWarehouse puts a system into place that allows clinicians to revisit clinical reports that have been generated in the past. The indexed reports, saved on the VSWarehouse server, can be queried at variant or sample level with the rendered reports hosted on the server and are ready for download or integration with other internal systems.
Projects hosted on VSWarehouse can be used as annotation sources in VarSeq to be integrated into your custom variant annotation and interpretation workflow. This allows any new variant to be annotated and potentially filtered with the frequency of that variant in your warehouse projects. The annotations are versioned with the projects. This means, just like our public annotations hosted on the cloud, you can always reproduce your analysis by using the exact same version perpetually or choose when to update to the latest version, which may have more samples. In figure 8, the primary findings result from a prior TruSight Cardio Hereditary gene panel test that was used as an annotation source in VarSeq. In this instance, four (4) variants from our prior clinical report matched the current dataset. This feature allows clinicians to compare and contrast findings, interpretations, variant classifications and recommendations across patients.
For this example, VarSeq makes it easy to get to a clinically actionable decision. The filters described above are adjustable, but this gives an idea of what is possible. The workflow demonstrated here is just one example of many different options available in VarSeq for evaluating gene panel data.
In addition to tracking variant classification, the VSWarehouse platform puts a system into place that allows clinicians to revisit clinical reports that have been generated in the past. These indexed reports, saved on the VSWarehouse server, can be queried at variant or sample level with the rendered reports hosted on the server, and are ready for download or integration with other internal systems.