Accurate Annotations: Updates to the NHLBI Exome Sequencing Project Variant Catalog

Since its early release in early 2012, the population frequencies from the GO Exome Sequencing Project (ESP) – from the National Heart, Lung and Blood Institute (NHLBI) have been a staple of the genomic community. With the recent release of ExAC exome variant frequencies, the ESP has been surpassed as the largest cohort of publicly available variant frequencies (by nearly an order of magnitude). Yet, the NHLBI cohort, with its years of maturity and many citations, is still a staple annotation source.

Over the years, we have kept up to date with the latest version of the ESP catalog’s public releases. Recently, with their ESP6500SI-V2-SSA137 release, NHLBI has also provided mappings for the new human reference genome, making it the first major frequency catalog available for GRCh38!

While updating to this release, the Golden Helix data curation team has also made a number of improvements over the raw released data to make our ESP6500 track vastly more useful:

  • Break out the monolithic genotype counts fields into individual numeric fields for heterozygous and homozygous counts.
  • Compute an Alt Allele Frequency field, along with the provided Minor Allele Frequency to easily spot the “ref-is-minor” cases and be comparable to other catalogs such as 1000 genomes and ExAC.
  • Support multi-allelic sites, using our genotype aware allelic splitting and left-align technique to provide focused records with relevant allele and genotype counts for the variants present in your data.

Frequencies and Counts: Enabling Flexible and Powerful Filters

While the NHLBI population catalog was the first to provide the counts of observed genotypes seen in the cohort, as opposed to just counts and frequencies of individual alleles, they have always required interpreting their dense monolithic field notation (such as “TT=6426/TC=72/CC=0” from our example below). This may be fine for inspecting a variant to find out “does this rare variant ever occur in a homozygous state in the population?” but it doesn’t allow for that useful question to be asked of all variants or used as a filter.

The new fields for the NHLBI ESP6500 track and their documentation. There are now a number of allele and genotype count (GTC) fields available for filtering.

The new fields for the NHLBI ESP6500 track and their documentation. There are now a number of allele and genotype count (GTC) fields available for filtering.

By default, NHLBI provides just the minor allele frequency for each variant, meaning that if the reference allele is minor, the frequency will be for the minor allele and not the alternate (non-reference) allele. The minor allele frequency  (MAF) is a great field to use when filtering. For example, the filter “keep variants with a MAF < 0.01” keeps variants with a Alternate Allele Frequency (AAF) < 0.01 as well as > 0.99. If your sample has a reference allele, and the reference is minor, it may well be functional and so this is the desired behavior.

But many times is still useful to spot the “ref-is-minor” cases easily and potentially remove them, so we have added the more common AAF as well to the new track.

NHLBI Prev and New Formats

Showing select fields from ESP6500 annotations (prev version on top, current on bottom). This variant is a “ref-is-minor” case, where the ref (C) allele occurs only 72 times (from 72 heterozygous samples).

Getting the Right Numbers

While the majority of variants are straightforward to annotate, valuable data is lost and filters will lack precision without handling the edge cases of multi-allelic sites correctly, both in these population catalogs and in your samples.

Multi-allelic sites are those where more than one alternate allele from the reference is detected. This can be as simple as seeing three alleles for a SNP (A/C/T for example) but also includes loci where different insertions and deletions occur at the same reference base.

A complex multi-allelic site is now broken down into all allele pairs with observed genotypes in the cohort, each with precise and useful counts for filtering. In this case, the insertion of a T was original encoded as a T -> TT, and now is correctly annotated to the sample’s variant.

A complex multi-allelic site is now broken down into all allele pairs with observed genotypes in the cohort, each with precise and useful counts for filtering. In this case, the insertion of a T was original encoded as a T -> TT, and now is correctly annotated to the sample’s variant.

Ready to Rumble

Whether you have been waiting for this moment to start using the new reference sequence for analysis, or would like to take advantage of population genotype counts, this new release of NHLBI ESP6500 has a lot to be excited about.

It’s available today in Golden Helix’s unmatched public annotation catalog, and can be visualized in our free GeomeBrowse tool, as well as integrated into your variant analysis with VarSeq and SNP & Variation Suite.

Note our next version of VarSeq will come bundled with updated trio-analysis and hereditary gene panel starter project templates that are built with this new track!

One thought on “Accurate Annotations: Updates to the NHLBI Exome Sequencing Project Variant Catalog

  1. Pingback: VarSeq is a better ANNOVAR, snpEff and VEP | Our 2 SNPs…®

Leave a Reply

Your email address will not be published. Required fields are marked *