GATK is a Research Tool. Clinics Beware.

         December 3, 2012

In preparation for a webcast I’ll be giving on Wednesday on my own exome, I’ve been spending more time with variant callers and the myriad of false-positives one has to wade through to get to interesting, or potentially significant, variants.

So recently, I was happy to see a message in my inbox from the 23andMe exome team saying they had been continuing to work on improving their exome analysis and that a “final” analysis was now ready to download.

This meant I had both an updated “variants of interest” report as well as updated variant calls in a new VCF file. I’ll get to the report in a second, which lists rare or novel variants in clinically associated genes, but first, let’s look at what changed in the variant calls.

New… and Improved?
At first blush, the variant files look quite different as you can see in the first Venn diagram below comparing variants (both unique and common) between the original and new files. But after I applied a conservative filter (Genotype Quality (GQ) >10 and Read Depth (RD) > 10) on the variants, things start to look less dramatic. So what is with all the new variants? It looks like many are just more aggressive variant calls. In fact, there were ten thousand variants with a read depth of just one (a single read) in the new file!

Variants present in the original and latest VCF file for my exome.

Variants after filtering out variants with
Read-Depth (RD) < 10 and Genotype Quality (GQ) < 10.

From the second figure, those overlapping 79K variants have a 99.58% genotype concordance. Looking through the 328 discordant genotypes (the 0.42%), it looks like the majority are those notoriously hard-to-call InDel variants. Also, of those 12,882 variants unique to the new VCF file, 10,184 are InDels. More on that later…

Ok, so now let’s look at my updated “report”…

Oh no! I now have a scary red highlighted variant right at the top:

The top of my “final” variant list report shows a new loss-of-function variant, not in my original report.

Well, that was certainly not there in my first report. As I blogged about before, all my reported variants were missense variants, unlike this loss of function frameshift insertion!

(As an aside on a pet peeve, it is a bit lazy to report this variant in the manner they did. It’s actually an insertion of a C at position chr6:7585968. Of course, the VCF file encodes it as a G->GC mutation at chr6:7585967, but that’s because the VCF spec requires there to be a reference nucleotide. So insertions are essentially always left-shifted one base to include a reference base.)

Anyway, back to the point. What is going on with this new scary loss of function mutation in the DSP gene?

Naturally, the first thing I did was pull it up in GenomeBrowse to look at the evidence and context of the insertion.

The site of the scary, loss of function insertion. There is clearly a common SNP (the highlighted bar) in my exome to the left of where the insertion is supposed to be, but definitely no insertion in sight!

There is an SNP in my exome near this location, but that’s not the variant we are looking for. In fact, my VCF file has that SNP too, which is a very nice and clean homozygous variant (quite common and benign). No insertion in sight at the alignment!

Here is a table of the relevant variant information from the VCF file:

The second row shows the buggy homozygous insertion. The third row contains an example of one of the other 8,242 bad InDel calls with the same pattern.

It’s important to note the “LKB250_AD” field (the 8th column from left) contains the number of reads that support the ref, alt genotype respectively. For our clean homozygous C_C genotype on row one, 1,153 means that 1 sequence read contained the reference base and 153 support the alternate “C” allele.

But my scary homozygous insertion (row 2) shows 153 reference bases and no reads supporting the insertion. Yet it was still called a homozygous variant!

I promptly sent an email off to 23andMe’s exome team letting them know what is clearly a bug in the GATK variant caller. They confirmed it was a bug that went away after updating to a newer release. I talked to 23andMe’s bioinformatician behind the report face-to-face a bit at this year’s ASHG conference, and it sounds like it was most likely a bug in the tool’s multi-sample variant calling mode as this phantom insertion was a real insertion in one of the other samples.

Since there were 8,242 other InDels that match this pattern, I am most likely not looking at random noise but real “leaked” variants from other members of the 23andMe Exome Pilot Program. (Edit: After some analysis with a fixed version of GATK, Eoghan from 23andMe found that these genotypes were not leaked from other samples but completely synthetic.)

So not only was there a buggy variant in my exome, but the potential for personal variant genotypes to show up in other customer’s exomes! Of course, because all these phantom variants are at the “population” level, there is no way to see whom they belong to or even how common they are. So I think the data leakage here is not that concerning.

Bugs in Rapidly Changing Research Software? Why, of Course!
GATK is developed largely by team members of the 1000 Genomes Project. Their primary goal when they were founded was to call variants for the project on as many low-coverage whole-genome sequences as accurately as possible in an environment where every read counts. Their secondary goal was to share with the research community their technology and algorithms in a modular suite of open source software that supports their publications (GATK stands for the Genome Analysis Toolkit, and it can do much more than just call variants).

With a research-focused mission, and with variant calling far from being a solved problem, it’s no surprise that each version of GATK introduces new features and tweaks to existing functionality with the possibility of bugs.

Another site of phantom variants. While I’m homozygous for a G insertion that is in dbSNP, there are surrounding phantom insertions most likely from other customer’s exome. This must be a highly mutated region.

I talked to a speaker at ASHG who works on GATK to improve calling of MNP (multi-nucleotide variants), and he mentioned a scenario where he saw GATK produce an obviously incorrect output and as he tried to nail it down, it seemed to disappear. My engineers lovingly call these “Heisen Bugs” because they follow a pattern similar to the uncertainty principle by which measuring or nailing down one aspect of the bug seems to morph it or make it temporarily disappear.

I don’t blame GATK for introducing bugs from one version to the next or having an occasional Heisen Bug. Complex software has these issues by nature and preventing/fixing them is asymptotically difficult.

But because GATK has been used so prolifically in publications and is backed by the Broad Institute, it can be viewed as a “safe” choice. As small labs and clinical centers around the world are starting to set up their DNA-seq pipelines for gene panel and exome sequencing, they may choose GATK with the assumption that the output doesn’t need to be validated.

And that would be a mistake.

GATK is as susceptible to bugs as much as any complex software. Their new mixed licensing model (free for academic, fee for commercial) is intended to add more dedicated support resources to the team. I suggest they think about adding dedicated testers as well.

Clinical Certification is Fixing, Testing, and Not Tinkering
Coming from the environment of fast-paced innovation and constantly keeping up with the treasure trove of new public annotations and data sources, it first seemed to me archaic that clinical labs would freeze their informatic pipeline for analysis and only update them once a year when they are willing to put ongoing effort into vetting and accrediting their entire stack of laboratory procedures to produce a test.

That seems crazy! For a whole year, they don’t get the benefits of the latest tools, the latest population catalogs, and the latest functional predictions!

Yes, but they also don’t get the uncertainty and risk of introducing bugs.

PS… As a side note, as I’ve been comparing and analyzing the output of various secondary alignment and variant calling tools, I’ve collected quite a few examples like these. I will be presenting my lessons learned and common pitfalls to avoid as part of a short course at the 2013 X-Gen conference. So if you plan on attending X-Gen in San Diego this March, check out Short Course 4: Alignment and Assembly strategies.

18 thoughts on “GATK is a Research Tool. Clinics Beware.

  1. Eoghan Harrington

    Hi Gabe,

    Thanks for the post, it was nice to meet you in person at ASHG. At 23andMe we followed up on this bug as soon as it was brought up in the community. The bug seems to have been brought up several times in the GATK communities however was only fixed in the GATK 2.2-2 release which unfortunately came out after we had sent back the data to you. We’ve rerun the DSP region using the newer version of GATK and the indel doesn’t show up in any individual – so it looks like the bug is fixed and that leakage from other pilot customers isn’t the issue in this case.

    We’vre in the process of rerunning all exomes using GATK 2.2-2 and will be sending out the updated results as soon as they’re ready.

    Thanks again,
    Eoghan

    Reply
    1. Gabe Rudy

      Thanks for the update Eoghan!

      Great meeting you at ASHG as well.

      Also great to know the InDels are not genotypes leaked from other pilot customers. I’ll update that paragraph to reflect that for sure.

      Look forward to updated results! I think you guys are do great work.

      Gabe

      Reply
  2. Peter Covitz

    This is a great example of the current state of the art of the field. Clearly this stuff is not ready for prime time in routine healthcare practice, and should fall under FDA medical device requirements once it is, as you suggest. But what I find interesting is the Direct-To-Consumer scenario that 23andMe represents. It is neither Research nor is it Healthcare. It is a consumer product. What is the right level of ‘protection’ for consumers? The debate about that question is really fascinating. I think blog posts like this one do a terrific job of informing that debate.

    Reply
    1. Gabe Rudy

      Hey Peter,

      Thanks for the comments. It’s true that many people forget to make the distinction between the consumer experience of 23andMe and what they are used to seeing in the research/clinical setting. It may be the same technology and even software, but it’s a very key distinction.

      Very good point about protection, as I don’t want to argue there might not be some appropriate level of consumer protection warranted. But given the pace of innovation and technological advancement, I’m in favor of the current approach to give consumers the benefit and responsibility of acquiring research-grade data.

      Reply
  3. Konrad Karczewski

    I’m glad you mention how crazy it seems not to use the latest software. Bugs are an unfortunate fact of this kind of software, but in a way, they’re not that much different from an inaccurate call due to an algorithm of lower sophistication. It’s unfortunate that in this case, the error was highlighted in red as a functional variant (and of course, interesting variants are going to be enriched for errors). But from a philosophical point of view, is that just as bad as not properly reporting a variant because the software didn’t pick it up? Of course, this all assumes proper understanding of the annotation and interpretation of variants (and occasionally, as you did, verification by painstaking manual inspection). I agree that freezing clinical pipelines is important, but at some level, we can also take control of our own data analysis. We’re developing some software to enable individuals to (hopefully) keep up with the moving target of these aligners and variant callers: http://www.stormseq.org

    Reply
    1. Gabe Rudy

      It’s definitely hard to see all the innovation going on and not want to have the absolute latest and best variant calls possible.

      But as I’ve talked to more folks running clinical labs, I’m beginning to understand the large differences between that “latest-is-greatest” attitude we have in research and the need for certainty and reproducability in the clinic.

      Even moreso, the types of variants they want for clinical reports are so slam-dunk obvious, you don’t really need more than the tools out there to call them. More importantly is having the databases of most-up-to-date clinical annotations about other lab’s work to classify those variants as pathogenic or benign. That’s where having feedback loops of near-realtime annotations adds real value (hopefully ClinVar takes us closer to that).

      But in contrast to exomes (which have huge variability in coverage), people doing amplicon resequencing on gene panels have absurdly clean and easy job of calling variants because their coverage is an even 200X over 100% of the target region. I’m not concerned about them needing the latest GATK to call germline mutations.

      Reply
      1. Konrad Karczewski

        I totally agree. In reality, our interpretation is way behind the calling anyway. I guess my concern is more on variants like indels or structural variants, that can arguably be more damaging than the non-synonymous SNP we find here and there, that some software doesn’t even try to call. Ignoring these is convenient, but maybe not the best strategy. No good answer though, I agree.

        Reply
  4. Daniel Nilsson

    I think you wonderfully pinpoint several current issues with variant calling in particular and genomics software development cycles in general.

    Some colleagues of mine caught a very similar problem some time back that could still cause this type of error even nowadays, bugfix or not. This occurred when additional “helper” genomes were included in calling for improved recalibration, and then subsequently removed to not be reported back to customer. Someone on the team had first taken a shortcut and not parsed the vcf genotype fields carefully, but only column stripped those vcf files. Thus rows for real variants present in another exome were included by GATK (perhaps UG at that point in time), very much as it was supposed to. As you say it was fixed, this is probably another issue: I’m not goint to say the 23andme team did the same thing mistake we did without having seen it first. But, to humor me, would you check to see if the genotype call for your indel in the vcf (not the graphical rendering) was really homozygous variant and not hom reference? The read counts would sort of point in the latter direction?

    Reply
    1. Gabe Rudy

      Hey Daniel,

      That’s another great example of leaking of variants, but as the 23andMe guy pointed out above, this bug turned out to be a different sort of beast.

      Here is the actual VCF line of the red-flagged variant:


      6 7585967 . G GC 30514.31 PASS FS=882.876;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000379802;BaseQRankSum=66.62;SNPEFF_GENE_NAME=DSP;DS=True;DP=24821;culprit=HaplotypeScore;HaplotypeScore=941.6383;IndelType=INS.NumRepetitions_3.EventLength_1.RepeatExpansion_C.;AC=118;AB=1.0;MQRankSum=3.637;AF=0.3986;VQSLOD=7.7442;AN=296;MQ0=2;SNPEFF_EXON_ID=exon_6_7582875_7586950;SNPEFF_GENE_BIOTYPE=protein_coding;ReadPosRankSum=-22.946;InbreedingCoeff=-0.0102;HRun=0;SNPEFF_EFFECT=FRAME_SHIFT;MQ=58.55;QD=1.9 GT:AD:DP:GQ:PL 1/1:153,0:153:99.0:554,102,0

      Notice my genotype is 1/1 (i.e. alt/alt). So I’m not homozgyous reference.

      Reply
  5. Pingback: Expert tours his own exome, and finds mainly false alarms : Nature News Blog

  6. Geraldine Van der Auwera

    Hi Gabe,

    Great writeup — everyone in the GATK development team was very interested to read this! I personally love the “Heisen Bugs” term.

    We’re glad you brought up the point that GATK is research software. Some of the key features that make the GATK highly valuable as a research tool — the cutting-edge capabilities that no other software offers right now — have that double-edge, so to speak, that they’re experimental and subject to bugs. And overall, the GATK development cycle is very fast — we add new features, bugs come up, we patch and release updates. That’s why we ask people to always use the latest version; it’s the best way to combine having the latest capabilities but also the latest bugfixes.

    Of course this presents a problem for companies that use results in production, especially if they need to follow certified processes. And you are correct that the commercial version of GATK is essentially our answer to this problem. It’s not just about increased support, although that is definitely a big part of it. But the commercial version will be “frozen” at a certain point and stabilized for a much longer period of time than our regular releases. I believe the commercial version will be updated on a quarterly basis (our partners at Appistry are in charge of those details). This means that new features will be made available to commercial users once they are deemed mature enough, and the releases will be much “safer” than the regular “bleeding-edge” releases.

    Finally, one less-covered issue that does affect the quality of results is that people (or companies that offer analysis-as-service) sometimes decide they don’t need to apply certain steps in our Best Practices workflow (eg base quality score recalibration), usually because those steps are compute-intensive. Cutting corners like that is a Bad Idea but it’s more common than we’d like. You should be able to check the file headers to see what processing steps were applied.

    Reply
    1. Gabe Rudy

      Hey Geraldine,

      Thanks for the engaging response!

      It’s great to hear the GATK team perspective, and that you have a strategy to support more “stable” releases with your commercial partner versus tracking the fast pace of your development.

      Although, I’m sure even researchers would appreciate having a “stable” branch of the repo they could track that would not be aggressive about adding features as much as providing a solid platform to implement best-practice pipelines off of.

      Even keeping an up-to-date “Best-Practice Pipeline” with the latest version of GATK and supporting reference/variant files, which you recommend, is definitely not a small task. In fact, I would argue it is beyond most researchers’ capabilities with the limited bioinformatics support they have access to. (Hence folks like Seven Bridges Genomics have a shot at a business model in just providing a working GATK pipeline through a cloud-based environment).

      But I can see your point on leaning on Appistry to work that out, as it is a project in and of itself to define those freeze points and backport in bugfixes.

      Out of curiosity in understanding the gory details of this bug the 23andMe team ran into, I actually tried to see if I could find the commit on the GATK github repo that presumably fixed it somewhere before the 2.2-2 release. But the volume of commits got the better of me! There is definitely a fast-moving development team at work here!

      Reply
  7. Pepetideo

    Especially worrying considering the ginourmous license fee GATK wants to start charging for their software… I hate when a OpenSource software developed with and for the community decides to Close Source and go commercial.

    Reply
  8. Pingback: Understanding Translational Effects of Variants With SnpEff | bio+tech

  9. Pingback: Do I Really Want to See the Results of a Mail-In DNA Test? | 2ndPOWER

  10. Pingback: Do I Really Want to See the Results of a Mail-In DNA Test? « Music RSS

  11. Pingback: VarSeq: Making Variant Discovery and Gene Panels Easy | Our 2 SNPs…®

Leave a Reply

Your email address will not be published. Required fields are marked *