I recently curated the latest population frequency catalog from the 1000 Genomes Project onto our annotation servers, and I had very high hopes for this track. First of all, I applaud 1000 Genomes for the amount of effort they have put in to providing the community with the largest set of high-quality whole genome controls available.
My high hopes are well justified.
After all, the 1000 Genomes Phase 1 project completed at the end of 2010, and they have released their catalog of computed variants and corresponding population frequencies at least five times since.
In May 2011, they announced an interim release based only on the low coverage whole genomes. This release was widely used, and one we also curated. Then in October 2011, their official first release was announced – an integrated call set that incorporated their exome data. Following that, version1 was released and re-released three times throughout November and December 2011. In 2012 we saw version2 in February, and finally version3 was released in April.
But as it turns out, simply using the 1000 Genomes Phase1 Variant Set v3 as your population filter will fail to filter out some common and well-validated variants.
Before jumping in, I have to mention that simply stating in a paper or talk “we filtered by 1000 Genomes data” is completely, entirely unreproducible. Not only does that not specify a version of the Phase1 dataset, it also doesn’t specify if you used the Phase 1 release or one of the preceding 3 pilot datasets. It also doesn’t indicate which, if any, of the sub-population frequencies were used.
For this reason, Golden Helix names all of its tracks with specific source and version information and recommend that they are cited with these details.
But why did the 1000 Genomes Project have so many releases of the same source data?
Well, they had very legitimate reasons.
So Why All These Versions?
In their recent publication in Genomic Medicine, Gholson Lyon and Kai Wang discuss the challenges of finding causal variants in a clinical setting. While the challenges range from incomplete phenotyping to lack of decisive functional validation of putative variants, fundamental to any strategy of interpreting a list of variants is knowing whether each variant is novel to the individual or family, or how common it is in the population background.
The paper points out examples where variants were published partly on the notion that they were novel and thus likely to be impactful, only later to be discovered in the NHLBI-ESP project set of controls.
[This] underscores the importance of public databases on allele frequencies from large collections of samples, such as the 1000 Genomes Project and the NHLBI-ESP projects, in helping researchers and clinicians decide on the clinical relevance of specific variants in personal genomes.
So it makes sense that the 1000 Genome Project revisits their collection of 1,092 samples of low coverage whole-genome and high coverage exomes to build up-to-date variant call sets that include their latest algorithms for variant calling and quality filters.
Bigger and more accurate sets of controls will result in higher quality research by the community. Especially when the novelty of variants is a key factor in classification as an important functional variant, we want these controls to be as comprehensive as possible.
It is not only the size of your control database that matters, but also the choices you make in calling variants and balancing Type I and Type II errors in the QA filters of those calls.
What It Takes to Build a Variant List
While we probably haven’t seen the last bid to make a better variant calling algorithm, GATK has become the de-facto industry best-practice choice. It has no rival in the task of simultaneously considering a thousand sample’s sequence alignments while making a consensus genotype call and associated likelihood score.
Variant callers rarely suffer from a problem of sensitivity. If you see mismatches in a few reads line up, they can be interpreted as a heterozygous or homozygous variant at that locus.
The real challenge to a variant caller is dealing with all the ways in which errors (in sequence reads, repeated regions in the reference that confound mapping, small insertion/deletions that disrupt local alignments, and variability in coverage and capture technologies) can result in false-positive variants.
So it’s not surprising that most of the effort that the 1000 Genomes Project lists in its read-me’s that distinguish previous Phase1 variant sets from newer ones is in systematically improving their variant calling and post-calling filtering to remove these false-positives.
In fact, in the case of InDels, they report that through orthogonal validation of their InDel calls and comparison to Complete Genomics public genomes of some shared samples, they discovered a number of systematic errors in the pipeline that produced a significant number of false-positives:
This is an updated set from the version 2 (February 2012) of the 20110521 release.
~ 2.38M problematic indels have [been] identified and removed.
See here for more information.
Shaving with Too Sharp a Blade
So when testing this latest v3 variant track from 1000 Genomes, I decided to try filtering down the Complete Genomics 69 sample set to see how it performed.
Complete Genomics not only uses a completely different sequencing technology, it also has developed a set of custom algorithms for alignment and variant calling specialized to its own technology.
So I expected that filtering by 1000 Genomes would knock down most Single Nucleotide Variants (SNVs) and many small insertions and deletions, but to leave unrecognized more complex variants that can be legitimately described in various ways or variants that favor one of the two technologies involved.
Instead, I was struck by this:
This is just one spot among many where 1000 Genomes v3 didn’t list what looks like common variants in the Complete Genomics public genomes and their own earlier variant call sets.
How pervasive is this?
Well, there are definitely new variants added to the v3 call set. But the May 2011 version did not yet have InDels. Of the 2.7 million variants unique to version3, 577,895 are insertions, 865,619 are deletions and 1,301,605 are SNVs. So for just the SNVs, there are roughly a million removed since the earlier release and a million added. The 1.1 million removed variants from v1 might have cut too deep.
But here is another important thing to keep in mind. dbSNP, the de facto catalog of all known variants, used the May 2011 interim variant set shown above to set the global allele frequency data for the variants in its catalog. While I’m sure in time they will update their attribute data with the latest frequencies, the pervasive use of dbSNP surely means people will assume the frequencies listed on their view of a SNP are the most trustworthy.
Still Waiting for the One Track to Rule Them All…
I wish I could recommend the use of a single variant population track that could solely be used as the source of population variant frequency data for your study.
Like most things in bioinformatics, however, the answer is that the best tool for the job varies by the job.
For exome data, the NHLBI ESP variant set is invaluable. For whole genomes, a combination of this latest 1000 Genomes variant set and the CGI 69 samples variant set are probably a good start.
Of course, for working with small sample sets, augmenting population frequencies with control individuals run on the same platform and secondary analysis pipeline helps to dramatically reduce systematic false-positives from your variant list.
So while I’m pleased to see the continuous effort to build the one true variant frequency catalog, I’m afraid we aren’t quite there yet.
It is worth noting that the may2011 set is quite different from the april2012 set not just because it only contains low coverage snps but also because the variant scoring process used to produce the set was different than the process used to produce the sets released between October 2011 and April 2012
For the 3 file sets released between October and April no new sites were added only sites which we felt to have a high FDR were removed
Great blog post about the problems in the lack of overlap between Complete Genomics and 1000 Genomes data. I just had a paper published that addresses these same issue that I think you and the Golden Helix team would find interesting: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0040294
Thanks for the note, and I’m glad you found my post interesting.
I actually read your paper already as it has been consistently echoing through the twitterverse recently.
Great read! I definitely found the variant set comparison between 1kG and CGI very interesting.
In the case of this over-filtering 1kG did in their recent release. I wish I could get to the bottom of what their filter was that culled these variants. I imagine it has something to do with being close to an indel or in some region with low mappability or something.
I understand why they might try to filter based on such a list of “trouble” regions.
I have been curious if there is some metric like mappability, repeat-master regions or large segmental duplications that when compared to these regions that are enriched for systematic differences between CGI and 1kG is very good at detecting problematic regions.
I would find such a validated tool to score or flag problematic regions very useful in analysis.
One approach I think looks very promising was in a recent publication called “Genomic Dark Matter” where Hayan Lee basically simulates short-reads (from the reference genome) to mimic a sequencing protocol or platform (like Illumina 100bp-PE) and then checks how well individual bases get accurately covered with the aligned simulated reads.
Anyway, these are just thoughts I’ve been having on finding ways to make these types of discrepancies see more manageable and less scary, but that might not have an easy answer 🙂
Glad you enjoyed my paper. I am actually part of the 1KGP and it is a very interesting process seeing how they try to deal with things. Their conference calls are very interesting since I get to hear all of these topics discussed. Here is a synopsis of why I think the variant lists show up the way they do:
1. Calling variants of low coverage (4x) sequencing is very hard and leads to tremendous variation among callers. The 1KGP has tried many different callers from different groups and their results tend to vary more than one would want.
2. Up until now, SNPs, indels and SVs are called separately and then integrated afterwards. This is admittedly not the ideal way to call the variants, but currently, there are specialized callers for each type of variant and there is no caller that can accurately call SNPs, indels, MNPs and structural variants all at the same time.
3. The project is more concerned about trying to reduce false positives than false negatives. Due to the error in the sequencing data there could be evidence of a huge amount of variants, and they want to avoid overcalling.
The project is definitely working on determining regions that are callable and not-callable, but it is still not ready for release. One current approach is to use defy sequences representing highly repetitive sequences to catch reads that map to a very large number of places in the genome and confuse the analysis.
Currently, they are working on phase 2 which involves the use of callers such as freebayes, cortex and a new version of GATK which do either haplotype-based calling or something akin to de novo assembly. I think that these callers will do a better job, but we are still comparing and assessing them.
I saw that paper from Hayan and I have had a few conversations with her and Mike Schatz about using their algorithm to produce metrics as to what read length is needed for different projects and whether paired-end is valuable. Illumina would always like people to do the longest paired-end reads possible so they make the most on reagents, but there are definitely cases where short reads would be sufficient (or 95% as good) which would be much cheaper.
I’m curious about the example you showed with the genome browser view of CG samples. There are 4 common variants shown over the span of about a dozen base pairs. I always get suspicious when I see a bunch of unusual features grouped tightly together like that. In my mind, it usually indicates a problem with the reference sequence, or the presence of a more complex polymorphism. Although the resolution makes it hard to be certain, it looks like the first 3 variants are all correlated, thus forming a haplotype that could be considered a 5-base substitution pattern with 2 alleles rather than a series of 3 SNVs. Is it possible that the variants are missing from the latest data release because the updated algorithm thus classified them as being part of a larger variant?
For the most part, I think the bioinformatics community is getting pretty good at dealing with SNVs. It is insertions, deletions, and larger variants that consistently cause problems.
Thanks for this insight.
I assumed they would want to lean towards reducing false positives over false negatives, but it’s good to know that more definitively.
It’s always great to hear that new and better things are coming down the pipe!
I’ve always been impressed though on how Complete Genomics seemed to have led the field in many of these informatics techniques (local indel re-alignment, haplotype/phased calling etc).
re: Hayan paper: That’s true, you could use those simulation of reads to mappability scores to get a good idea of what the most cost-effective protocol is for a given application. Although paired-end protocols can give evidence for structural variants I suppose, you’re probably right that single-end and shorter reads with the same or higher coverage gives you good small indel and SNV calls for certain applications at a more attractive cost.