Up until a few weeks ago, I thought variant classification was basically a solved problem. I mean, how hard can it be? We look at variants all the time and say things like, “Well that one is probably not too detrimental since it’s a 3 base insertion, but this frameshift is worth looking into.” What we fail to recognize is just how many assumptions went into the above statement. What transcript set are we using? In what part of the gene did the mutation occur? What subfeature of the gene are we looking at? Are there other ontologies for the variant? Why did we use the term we did? In order to develop a tool to annotate variants, rules to answer all these questions have to be codified into the software. Enumerating these assumptions means that a process that is subject to a great deal of human interpretation, is now a rigidly defined objective framework. There are currently three major tools that attempt to classify variants: Annovar, SnpEff and Variant Effect Predictor (VEP). It is no surprise that these tools do not always agree since the way the rules have been defined differ slightly between each application.
In a recent paper, Davis McCarthy et al. showed that both the choice of annotation software and transcript set can have a significant impact on the classification of variants. McCarthy compared Annovar and VEP. He also compared running Annovar using both the set of transcripts available from Ensembl and the commonly used set of RefSeq transcripts curated by NCBI. The paper’s key conclusion was that there was only about 65% concordance between the annotations for Loss of Function (LoF) variants produced by the two tools using the same transcript set. Using different transcript sets, the paper reported only 44% concordance among putative variants using Annovar.
I was so surprised by the finding I wanted to get a better feeling for the type of variants that cause these tools to differ. I also wanted to see how the widely used application, SnpEff, stacked up to the other two tools.
Since I was most interested in trying to find the types of variants that caused discrepancies between annotation algorithms, I decided the best way to dig into the problem was to create an artificial set of variants in the widely studied CFTR gene. Using Python I generated a set of variants amongst all positions covered by an exon in Ensembl’s database with 100 bp margins on either side (IPython Notebook). At each position I created:
All possible SNPs
All possible 1 base insertions following that position
Two possible 2 base insertions following that position
Two possible 3 base insertions following that position
Each possible 1, 2 and 3 base deletions following that position
At this point, a degree of subjectivity must enter into the analysis. I quickly had to start making decisions about how to compare annotations between tools. I did this comparison in Python using Pandas and HDF5 to back the analysis (IPython Notebooks normalizing annotations and analyzing Ensembl and Refseq comparisons).
The State of Variant Classification
The state of the variant annotator is…hmmm…complex? It’s extremely fragmented with each tool choosing to represent its results differently. But, the differences are not just superficial. Each algorithm makes significant choices about how to label variants. As mentioned above, these don’t always agree. Sometimes the tools differ in ways that are both arguably correct, but in a few cases, the generated annotations are flat out wrong.
Performing a perfect comparison of the three tools is frankly impossible. To start with each tool outputs its annotations in a different format. Annovar’s output is a tab separated file, while SnpEff and VEP produce VCF files which use the “INFO” field to encode their annotations. While SnpEff and VEP represent data in a consistent format, the format of Annovar’s rows changes depending on context. For example, Annovar uses the gene field to provide distance information for all intergenic variants. Similarly, for variants labeled as splicing, it overloads the gene field with HGVS notion. These decisions make sense from a human point of view but can cause problems when parsing the file. Since it is common to annotate thousands if not millions of variants at one time, I’d argue that the ability to easily parse the output file should be paramount when designing the tool.
Each tool uses a different nomenclature for defining sequence ontology. These differences encompass both semantically equivalent terms and terms which are similar but vary in their specificity. To compensate, I was forced to make some subjective decisions about how to normalize terms. Ultimately, this meant bucketing terms together. Since Annovar’s annotation categories are the most general, I used these to build my categories.
There is some hope that variant annotation in the future will start to use the standard ontology terms defined by The Sequence Ontology Project. Even with standardized terms, it is still extremely important to use precise language when classifying variants to avoid ambiguity during the discussion. For example, all the tools have a concept of a splicing variant. Using a term like splice_region_variant doesn’t really clarify the impact of the variant. The Sequence Ontology Project uses splice_region_variant as a parent category (an inner node in the sequence ontology tree). Thus, using this term all we’ve communicated is that the variant could be up to 3 bases exonic or 8 bases intronic of a splice site. The variant’s location within this interval could drastically alter the impact. When we actually want to communicate the punitive effect it is more useful to use a specific term like five_prime_cis_splice_site.
Finally, each piece of software deals with a single genomic variant producing annotations against multiple transcripts in a different way. If a gene contains more than one transcript, a variant will have multiple annotations depending on where it is located with respect to each transcript. Moreover, some genes overlap so a single variant could affect multiple transcripts on multiple genes. Both SnpEff and VEP list each way the variant was classified. Annovar instead returns only the most deleterious variant based upon a priority system. By necessity, I tried to follow this prioritization in order to compare “apples to apples”.
Although it might seem reasonable to collapse variants down to single most punitive annotation, doing so poses a multitude of problems. First, it imposes a subjective determination of what is the worst type of variant. Is a frameshift really worse than a stop gain? Maybe, maybe not. A frameshift might occur early in a coding sequence, but might be quickly offset by another frame shift. This situation is likely to arise since many variant callers (e.g. GATK’s UnifiedGenotyper) prefer to call small allelic primitives. However, it might be reasonable to argue that it makes little difference whether an indel is a frameshift or stop gain since they are both Loss of Function variants.
Second, collapsing annotations removes granularity that can be useful while filtering variants. Perhaps, your protocol only calls for looking at stop gain variants. If the classifications have been collapsed by the software, you are out of luck. I predict this collapsing of lower priority variants will become a larger issue in the future as the impact of variants outside of coding regions is discovered to be significant. In the future, a SNP in certain regulatory regions might be found to have a greater effect than an inframe substitution in a coding region.
Another interesting decision Annovar made is not to annotate variants that lead to start loss/gain mutations. While I can see the reasoning behind calling a start gain just a nonsynonymous variant, not annotating start losses is troubling. A start loss potentially results in the complete loss of a transcript’s product. In fact, SnpEff predicts the effect of start losses as high impact.
The concordance between the three algorithms in coding regions is relatively good. When we exclude variants with annotations that imply a non-coding region (ie downstream, upstream, intergenic, intron, 3 prime and 5 prime) and annotations in the “ignored” category there is 93% concordance between the algorithms. However, when these categories are included the concordance falls to 49%.
While this second number seems shocking, it’s less concerning as it is depressed by annotations that are similar, but vary in levels of precision. Additionally, if even one algorithm disagrees, it cannot be counted as a match.
Most of this variation in annotations is explained by the way each tool defines non-coding features. For example, SnpEff, uses 5kb to define upstream and downstream regions, while Annovar uses 1 kb. In many other cases, variants in non-coding regions were bucketed into the “ignored” category.
Next, note that only VEP annotated regulatory region variants. Since I used the web interface to VEP, this was done without having to hunt down additional datasets and get them into a format that the tool could use. This feature is a nice bonus for VEP users.These variants would likely be annotated as intronic or intergenic by the other algorithms decreasing non-coding concordance.
Another distinction for VEP web users to note is the available sets of transcripts. Ensembl is of course available, but the RefSeq database also contains CCDS and Expressed Sequence Tags. This caused the RefSeq comparison in my analysis to not be reliable; hence I focused on annotations using the Ensembl transcript set.
Given that the coding concordance is relatively good, but not perfect, we can say that we have a small but significant amount of disagreement. For example, in the categories of splicing, frameshift, and stop gain there is a non-trivial amount of disagreement. It is in these dark edge cases, where the demons of variant annotation hide.
Specific Variant Test Cases
Annovar and VEP classify this variant as a frameshift, while SnpEff classifies it as stop gain. Who’s right? Tough to say. It is a frameshift, but calling it a stop gain is more precise.
What is interesting about this annotation is that VEP is looking at every base affected by the indel. Thus it figures out that the T at 117105838 is the first base of this CFTR Exon and annotates the variant as a non-coding-exon variant, whereas Annovar calls it intergenic and SnpEff calls it an Exon, intergenic and upstream variant. This is a good example of SnpEff annotating all possibilities, even if two of the possibilities are mutually exclusive (exon and intergenic).
Both Annovar and VEP correctly label this variant as a stop loss. SnpEff labels it as a frameshift, which is true, but less precise.
This is an interesting and subtle case which elucidates why splicing variants displayed a lower amount of concordance. SnpEff annotates this as a splice_site_acceptor — technically wrong since it is 3 bases intronic. VEP annotates it as a splice_site_region — correct. Annovar annotates it as intron — true, but imprecise. The key takeaway is that all three are using different terms to describe the same variant, making comparative analysis extremely difficult.
One discrepancy that is worth investigating is the case of a stop codon being inserted after the last codon of an exon. VEP and Annovar call this a stop gain, while SnpEff calls it a splice site variant. If we think about how the transcription process works, we can infer that this variant would act as a stop gain. The first two intronic bases (GT) act as a signal for the intron to be spliced out. However, the stop codon is inserted just before these two bases. Thus, during translation, the ribosome will hit the stop codon and stop producing the protein rather than splicing the two exons together.
This 3 base pair deletion is a great example of a variant that causes the algorithms to use substantially different annotations. SnpEff labels it a frameshift mutation. This classification is wrong with respect to any transcript at this position. SnpEff makes this determination using the incomplete transcript (the one on top in the screenshot). VEP accurately calls it an 3_prime_UTR_variant against the incomplete transcript and an inframe deletion against the others. Annovar correctly calls it a “nonframeshift deletion” and uses the bottom transcript to make that determination.
Pulling on a Thread: Examining Transcript Sets
I wish all we needed to worry about was how different tools annotate variants. As discussed above some of the classifications disagree, but often filtering is performed on an aggregate class of variants (i.e. coding or loss of function). In these cases classifying a variant as a frameshift instead of a stop gain will not change the set of resulting variants. Even where a classification is flat out wrong, it is easily verifiable by viewing the variant in a genome browser. Additionally, with the availability of multiple variant classification algorithms, the ability to use another tool exists. Most importantly, we can talk very precisely about the variant since we assume the coordinates and transcript mappings to be accurate.
But, can we actually make this assumption that the transcripts we are using have been accurately mapped to the genome? As it turns out this mapping is the more confounding problem when classifying variants.
Transcript sets are are available from different organizations such as Ensembl and RefSeq. To understand what is included in these databases it helps to understand the origin of the data. Transcripts are usually determined by sequencing RNA or determining the primary amino acid of proteins, although some transcripts are computationally inferred. From this data, we can build up small pieces of the transcriptome. Finally, to place these pieces in genomic space we need to align these sequences to the reference sequence.
Before discussing the specific issues that affect transcript sets, it’s worth thinking about why it is such a problem. Having an accurate transcript mapping is crucial because it is used quite far upstream in the analysis pipeline. If the transcript is not placed accurately on the reference genome then all downstream analysis using this derived data will be inaccurate. Moreover, these inaccurate mappings are more difficult to spot than a misclassified variant. Finally, it makes it much harder to speak precisely about a variant since it’s context may be incorrect. Essentially, if we are unable to rely on transcript mapping, we have added a wide plane of uncertainty to our analysis.
When we look at a VCF file full of variants, we are seeing their positions in genomic coordinates — that is, we know their positions with respect to the reference sequence. However, when examining a disease causing variant, the variant’s genomic position is of secondary importance. What we care about is its biological impact. This functional impact is much more tightly linked to their position in coding space (i.e. HGVS c dot notation). Thus, we need to either translate the variant into coding coordinates or translate the transcripts’ coordinates into genomic coordinates. Unfortunately, it is impossible to translate a small variant directly into coding coordinates since there are a multitude of possible positions to which it could map. Thus, we are forced to map transcripts back to the genomic reference sequence.
This use of genomic coordinates implicitly assumes that there is such a thing as a canonical reference sequence. Although the human reference has come a long way from it’s first release, it still has many gaps, tiling issues and alleles that are rare or non-existent in humans. The alignment algorithms used to mapped transcripts back to genomic space must contend with these issues in addition to novel challenges such as the expectation of large gaps (introns) in the alignment. When mapping transcript sets such as RefSeq to the reference, these issues manifest themselves in the creation of exons and introns that are not in their correct position or break the biological rules of translation. Needless to say, the problem is difficult to solve and thus leads to imperfect alignment and ab initio gene construction.
At the recent Human Variome Project meeting in Paris, Invitae’s Reece Hart, gave an extremely important talk about just what a significant problem this mismapping is. He was able to show that the mapping for nearly 3% of transcripts in the NCBI RefSeq database that UCSC has aligned to the reference the coordinates is likely inaccurate.
NCBI reports the transcript RNA sequence,
but doesn’t provide genomic coordinates. [As Deanna pointed out below, NCBI does provide the genomic alignments]. UCSC, however, uses the RNA sequence provided by the NCBI RefSeq project and aligns it to the reference using the BLAT aligner. In some spots, BLAT makes a mistake and it is clear that the transcript mapping is biologically impossible. For example in the TNNI3, we see that BLAT doesn’t split exon 1 into two introducing the appearance of a premature stop codon 4 amino acids into the protein. Examining Ensembl’s mapping of the same transcript, we see that the Glycine codon is properly split preventing the stop codon. But it gets worse! BLAT uses an extra nucleotide in it’s alignment (UCSC reports the coding sequence length is 841 vs. the 840 that Ensembl reports). This causes the gene mapping to be frame-shifted 24 bases into the coding sequence! Thus, if your variant occurs anywhere in the remaining 816 bases it will be annotated incorrectly even by a perfect algorithm.
TNNI3 is an extreme case and was easy to spot because of the premature stop codon. Some of the other cases are not so obvious. For example, BLAT moves exon 12 in CARD9 323 bases and changes it’s length from 7 bases to 5, however, the rest of the sequence is left intact so this would only affect a variant that was in exon 12, but the likelihood of discovering this without prior knowledge is slim. Another interesting case is the EMG1 gene that I tweeted about last week.
So is the solution just to use the Ensembl transcript set? Regardless of how you answer this question, we don’t live in an ENST world and the dominance of “NM_” nomenclature means that we still need to provide RefSeq names. Furthermore, Ensembl is trying to produce a comprehensive set of transcripts for each gene, not a high confidence list of well studied transcripts. This breadth adds a lot of “noise” to an analysis pipeline. Thus, the large Ensembl transcript set might be suitable for a research environment but is less suitable in a clinical grade gene panel pipeline.
One devilish example of Ensembl’s “noise” is that transcripts with incomplete coding sequences are included. Often this causes the transcript to start out of frame. The GTF file spec handles this with a frame field. However, if a tool doesn’t handle this out of frame start, inaccurate annotations can be produced. I believe this is what happened with the classification discrepency of a single base deletion (7:117267851 -117267852 T/-). On transcript ENST00000468795, SnpEff called this a frameshift, while VEP correctly called it a stop lost. If the transcript was not adjusted it’s easy to see why SnpEff would miss it. Thus, this is a case where having a more complete data set introduced complexities into the underlying data that the tool (or data curator) must account for.
The effect of including all detected transcripts can cause additional edge cases that the data curator must be aware of. Here an incomplete CDS in Ensembl’s transcript set for CFTR starts out of frame. The top transcript is properly adjusted for the reading frame and thus ends in a stop codon. The bottom transcript ignores the frame parameter, resulting in the incorrect representation of all codons downstream of the exon which began out of frame. Without accounting for this irregularity, downstream analysis can become inaccurate.
One recent attempt to improve the accuracy and usefulness of variant annotations is the Loss of Function Transcript Effect Estimator (LOFTEE). This just-released software package is written by Konrad Karczewski and fits on top of VEP. It removes variants within a short distance (less than 15bp) intronic regions, non canonical (ie not GT AT signature) splice regions, LoF variants in the last 5% of the transcript, and variants where the LoF allele is the ancestral allele for that position. Each of these decisions can help to improve our confidence that the variant we are looking at is punitive.
Ultimately, we have to realize that there are many factors that enter into variant annotation and that it may not be in the class of problems that has a single correct solution. Instead, it may be the type of problem with many acceptable solutions whose value varies with context. I think the most important thing that you can do is to familiarize yourself with the transcript set you choose to use — know the limitations of RefSeq mapped by UCSC and how Ensembl can obfuscate results. You could consider using CCDS too. This database represents the intersection of the set of transcripts in Ensembl, RefSeq, and HAVANA. Since it only represents coding transcripts, it removes some of the less studied transcripts found in the comprehensive Ensembl. But, don’t stop with the transcript set! You should become very familiar with your classifier — study it’s documentation, run some test variants through it and study them in a genome browser. Also, think about what really matters in your analysis — can you bucket annotations in terms of impactfulness or do you need to know the specific type of variant.
If you made it this far, I want to offer my congratulations! I took an in depth look at variant classifiers over the last few weeks because I am building a new one and want to hear your feedback. What’s important in terms of annotations? How do you curate your data? How does it integrate into your current pipeline? We want to hear from you and incorporate the features you need into our next generation of the variant classifier.