In preparation for a webcast I’ll be giving on Wednesday on my own exome, I’ve been spending more time with variant callers and the myriad of false-positives one has to wade through to get to interesting, or potentially significant, variants.
So recently, I was happy to see a message in my inbox from the 23andMe exome team saying they had been continuing to work on improving their exome analysis and that a “final” analysis was now ready to download.
This meant I had both an updated “variants of interest” report as well as updated variant calls in a new VCF file. I’ll get to the report in a second, which lists rare or novel variants in clinically associated genes, but first let’s look at what changed in the variant calls.
New… and Improved?
At first blush, the variant files look quite different as you can see in the first Venn diagram below comparing variants (both unique and common) between the original and new files. But after I applied a conservative filter (Genotype Quality (GQ) >10 and Read Depth (RD) > 10) on the variants, things start to look less dramatic. So what is with all the new variants? It looks like many are just more aggressive variant calls. In fact there were ten thousand variants with a read depth of just one (a single read) in the new file!
From the second figure, those overlapping 79K variants have a 99.58% genotype concordance. Looking through the 328 discordant genotypes (the 0.42%), it looks like the majority are those notoriously hard-to-call InDel variants. Also, of those 12,882 variants unique to the new VCF file, 10,184 are InDels. More on that later…
Ok, so now let’s look at my updated “report”…
Oh no! I now have a scary red highlighted variant right at the top:
Well that was certainly not there in my first report. As I blogged about before, all my reported variants were missense variants, unlike this loss of function frameshift insertion!
(As an aside on a pet peeve, it is a bit lazy to report this variant in the manner they did. It’s actually an insertion of a C at position chr6:7585968. Of course, the VCF file encodes it as a G->GC mutation at chr6:7585967, but that’s because the VCF spec requires there to be a reference nucleotide. So insertions are essentially always left-shifted one base to include a reference base.)
Anyway, back to the point. What is going on with this new scary loss of function mutation in the DSP gene?
Naturally, the first thing I did was pull it up in GenomeBrowse to look at the evidence and context of the insertion.
There is a SNP in my exome near this location, but that’s not the variant we are looking for. In fact, my VCF file has that SNP too, which is a very nice and clean homozygous variant (quite common and benign). No insertion in sight at the alignment!
Here is a table of the relevant variant information from the VCF file:
It’s important to note the “LKB250_AD” field (the 8th column from left) contains the number of reads that support the ref, alt genotype respectively. For our clean homozygous C_C genotype on row one, 1,153 means that 1 sequence read contained the reference base and 153 support the the alternate “C” allele.
But my scary homozygous insertion (row 2) shows 153 reference bases and no reads supporting the insertion. Yet it was still called a homozygous variant!
I promptly sent an email off to 23andMe’s exome team letting them know about what is clearly a bug in the GATK variant caller. They confirmed it was a bug that went away after updating to a newer release. I talked to 23andMe’s bioinformatician behind the report face-to-face a bit at this year’s ASHG conference, and it sounds like it was most likely a bug in the tool’s multi-sample variant calling mode as this phantom insertion was a real insertion in one of the other samples.
Since there were 8,242 other InDels that match this pattern, I am most likely not looking at random noise but real “leaked” variants from other members of the 23andMe Exome Pilot Program. (Edit: After some analysis with a fixed version of GATK, Eoghan from 23andMe found that these genotypes where not leaked from other samples but completely synthetic.)
So not only was there a buggy variant in my exome, but the potential for personal variant genotypes to show up in other customer’s exomes! Of course, because all these phantom variants are at the “population” level, there is no way to see whom they belong to or even how common they are. So I think the data leakage here is not that concerning.
Bugs in Rapidly Changing Research Software? Why, of Course!
GATK is developed largely by team members of the 1000 Genomes Project. Their primary goal when they were founded was to call variants for the project on as many low-coverage whole-genome sequences as accurately as possible in an environment where every read counts. Their secondary goal was to share with the research community their technology and algorithms in a modular suite of open source software that support their publications (GATK stands for the Genome Analysis Toolkit, and it can do much more than just call variants).
With a research-focused mission, and with variant calling far from being a solved problem, it’s no surprise that each version of GATK introduces new features and tweaks to existing functionality with the possibility of bugs.
I talked to a speaker at ASHG who works on GATK to improve calling of MNP (multi-nucleotide variants), and he mentioned a scenario where he saw GATK produce an obviously incorrect output and as he tried to nail it down, it seemed to disappear. My engineers lovingly call these “Heisen Bugs” because they follow a pattern similar to the uncertainty principle by which measuring or nailing down one aspect of the bug seems to morph it or make it temporarily disappear.
I don’t blame GATK for introducing bugs from one version to the next, or having an occasional Heisen Bug. Complex software has these issues by nature and preventing/fixing them is asymptotically difficult.
But because GATK has been used so prolifically in publications and is backed by the Broad Institute, it can be viewed as a “safe” choice. As small labs and clinical centers around the world are starting to set up their DNA-seq pipelines for gene panel and exome sequencing, they may choose GATK with the assumption that the output doesn’t need to be validated.
And that would be a mistake.
GATK is as susceptible to bugs as much as any complex software. Their new mixed licensing model (free for academic, fee for commercial) is intended to add more dedicated support resources to the team. I suggest they think about adding dedicated testers as well.
Clinical Certification is Fixing, Testing, and Not Tinkering
Coming from the environment of fast-paced innovation and constantly keeping up with the treasure trove of new public annotations and data sources, it first seemed to me archaic that clinical labs would freeze their informatic pipeline for analysis and only update them once a year when they are willing to put ongoing effort into vetting and accrediting their entire stack of laboratory procedures to produce a test.
That seems crazy! For a whole year, they don’t get the benefits of the latest tools, the latest population catalogs, and the latest functional predictions!
Yes, but they also don’t get the uncertainty and risk of introducing bugs.
As a side note, as I’ve been comparing and analyzing the output of various secondary alignment and variant calling tools, I’ve collected quite a few examples like these. I will be presenting my lessons learned and common pitfalls to avoid as part of a short course at the 2013 X-Gen conference. So if you plan on attending X-Gen in San Diego this March, check out Short Course 4: Alignment and Assembly strategies.