On my flight back from this year’s Molecular Tri-Conference in San Francisco, I couldn’t help but ruminate over the intriguing talks, engaging round table discussions, and fabulous dinners with fellow speakers. And I kept returning to the topic of how we aggregate, share, and update data in the interest of understanding our genomes.
Of course, there were many examples of each of these topics given by speakers and through the many conversations I had. The ENCODE project’s massive data output is illuminating the functional value of the genome outside protein coding genes. The CHARGE consortium, with its deeply phenotyped and heavily sequenced cohort of 14,000 individuals, will take a step forward in our understanding of the genome as large as those made by the HapMap and 1000 Genomes Project.
Meanwhile, Dr. Valerie Schneider of NCBI kicked off AGBT with a talk titled “Taking Advantage of GRCh38”. Certainly there is a lot to show for the work of the Genome Reference Consortium over the last 5 years since GRCh37 (hg19) was released. Most significantly, there is a 3% increase in protein coding transcripts, incorporation of the rCRS mitochondrial reference (used by GRCh37_g1k), and a larger emphasis on alternative loci outside the MHC (haplotype sequences representing variation from the reference).
With all these fixes and improvements in GRCh38, Valerie concluded, there is no reason not to start using it for your research
Except there is.
There are hundreds of terabytes of GRCH37 annotations, alignments, variants, databases and reports worth of reasons.
The World of Genomics Is Measured by a 3,137,144,693bp Ruler
Let’s face it. Five years has been long enough that we think of GRCh37 as THE human genome. Everything that does not have a GRCh37 coordinate might as well not exist.
And you know what, that’s OK!
Or, maybe it’s not OK. Before I refute my own point, let me expand on that.
Having a single chromosome name and number (or pair of numbers) as a way to compare ANY data source to ANY other data source is amazing, powerful, intoxicating.
Being able to use a Genome Browser to visualize every genomic annotation and piece of sample data in a single linear plane, with the context of all those sources being absorbed by the strengths of human visual perception, is invaluable.
But it goes beyond that. We are in the middle of massive efforts to expand our understanding of the genome in new and unprecedented ways. ENCODE is annotating bits and pieces outside of genes that are entirely unrecognizable other than through their GRCh37 coordinates (there is no RefSeqGene mRNA like databases that contain these elements in a reference-free context).
|# of Variants||331,824||319,442|
Statistics from my exome, processed with the exact same pipeline on GRCh37 and GRCh38 results in ~20K fewer variants. The rise in Ts/Tv indicates fewer false positives in the newer human reference.
The National Institute of Standards and Technology (NIST) has been coordinating an heroic community effort to create a base-line truth for a single cell line (NA12878) and the SNV and InDels calls made on it (check out their Nature publication from this month).
All of this work results in GRCh37 coordinate data. Lots of it.
Imperfect Alternatives to Genomic Reference Based Annotations
We are facing the quintessential chicken and egg problem, and it usually takes a bold move of a large data producer to kick off the slow and painful transition. A case in point is when the 1000 Genomes Project chose GRCh37 over NCBI36.
But this transition, I argue, will be more painful, more slow-moving, and with less certainty of the benefits of the new reference.
Two expected rebuttals to such a claim: What about LiftOver? What about databases stored in HGVS notations anchored on RNA transcripts and not DNA chromosomes?
For the first question, if your goal is to gain from the transition to GRCh38 in terms of reduced false positives or more sequence specificity, then simply porting the results of analytics on GRCh37 buys you nothing and may pick up artifacts of the coordinate transition.
For the second, HGVS plays an important role in persisting descriptions of DNA mutations that has spanned many reference sequence versions. But by nature of it being ambiguous, involving human judgment and approximations and often not using a strongly versioned mRNA sequence, it is a manually intensive and error prone method to port annotations between genome assemblies.
For example, in preparation for my talk at TriCon, we downloaded ClinVitae, an aggregate variant database that only describes variants in HGVS notation, and using Invitae’s UTA translator were able to programmatically map only 75% of the approximately 80,000 variants to GRCh37. Examples of ones that failed to map include: NM_003611.2:c.936-?_1129+?del, NM_012459.2:c.-1131_-1130delGA and NM_000364.3:c.418C>T. These may be able to be resolved with some human intervention, but also demonstrate the inherent complexity of using a third coordinate space (that of a mRNA) to transition between two different chromosomal coordinate spaces.
When the time comes, I will have no nostalgia or qualms about dumping GRCh37 for GRCh38, or GRCh38 for whatever is on the horizon beyond that.
But in terms of getting one’s work done, it’s a lot easier to wait for the chicken and egg to work out between themselves their provenance and viability.
Well, not quite.
When matching your variant data to other variant catalogs, you may run into:
- Insertion/deletion allele prefixing
- Complex variant haplotypes versus allelic primitive variants
- Allelic matching (with strand correctness)
- Different legitimate representations of the same DNA haplotype as mutations to the reference.
While I’m not going to dive into the details of each of these right now (I will cover these in my webcast on Wednesday), I don’t think the last point can be emphasized enough.
When Complete Genomics released their first public dataset in 2010, we saw grave
discrepancies between their variants and those in 1000 Genomes due to a different preference of where to describe length polymorphisms in runs of repeated nucleotides.
About a year later, in the release notes of the 2.0.0 version of their bioinformatics pipeline, they noted:
Previous Assembly Pipeline versions used a right-shifted canonical form of indels. The current assembly pipeline now uses a left-shifted canonical form of indels, which is consistent with the recommended placement of variants for VCF format.
Which alludes to the fact that VCF 4.1 specification, released in 2011, added the following clarification to describing the coordinate field::
When the position is ambiguous due to identical reference sequence, the POS coordinate is based on the leftmost possible position of the variant.
Those are very few words to clarify a very important and difficult-to-implement piece of the standard, but it shows that we at least have a consensus on how to describe ambiguous variants by this time.
So We Agree? Left-Align, Forgive and Forget?
Even if every tool that writes a VCF file were to follow this detail of the specification correctly, and I can assure you that I have seen plenty of examples of tools that do not, it turns out that your VCF files are only the starting point.
Variants are almost entirely useless without the context of the many variant databases we can use to annotate them: dbSNP, 1000
Genomes, NHLBI 6500, ClinVar COSMIC, and computed functional prediction like SIFT and Polyphen2 to name a few.
While we have standardized on VCF, and VCF has standardized on “left-aligning” variants, our databases have inputs from many sources and historical contexts that do not.
This can lead to surprising results where the position of a variant through the dbSNP website (and UCSC Genome Browser) does not match the VCF file they produce.
Furthermore, since data is now being submitted en mass by projects like the 1000 Genomes Project, there are starting to be multiple entries in dbSNP for the same DNA mutation.
This shows up in some shockingly high profile cases such as the highest frequency carrier mutation in Europeans: CFTR ΔF508
This variant, present in one in thirty Caucasians, is historically described in the deletion of three bases at the 1521st coding position of CFTR (in HGVS notation: CFTR:c.1521_1523delCTT ) which results in the loss of a Phenylalanine (F) at the 508 amino acid encoding the protein. ΔF508 is fully penetrant for cystic fibrosis when it is recessively inherited (child has two copies) and is the most tested gene for this important carrier information.
But a VCF style left-aligned variant call of this mutation is described one base to the left of its historical position as described by theΔF508 record rs113993960, which was added in version 132. Since the 1000 Genomes Project calls this variant in 12 samples, it has ended up in its left-aligned form in version 137 as rs199826652.
Not only does this example explain a duplicate entry, but also highlights the potentially catastrophic issue of a CFTR database not matching NGS data in VCF files if the annotation was done purely on genomic coordinates.
Practice Zen Informatics: Complexity In, Clarity Out
No matter what genomic reference and coordinate you are using, it turns out there is no substitution for visualizing the local context at a given locus.
It is easy to spot the causes of wrongly annotated variants like the ones I listed earlier. You can find false-positives in your data or data duplication in variant catalogs.
But also importantly, you can see how multiple variants or annotations in a short distance may be interacting or exist as a single haplotype.
While we still call our NGS technology “short-reads”, 75 to 150bp may be enough to construct confident haplotypes with multiple variants oriented in a meaningful way.
In the webcast, I’ll cover an example in which this local context made me suspect a series of mismatches and a splice-site deletion that turned out to be incorrectly mapped due to some missing sequences in GRCh37. Spoiler alert: GRCh38 doesn’t fix this particular problem.
While this may sound discouraging, it’s only possible to have this many examples of edge cases because of the pure volume of data we are dealing with. In reality, the vast majority of variants are unambiguously defined and have no secret surprises.
But the devil is in the details, and I hope to keep building tools to keep the details from waving their pointy tail around.