Have you ever scratched your head when looking up a variant and it seems like the number you have for its position is one off from what it looks like in the file or database? You may be running into the dreaded world of 1-based versus 0-based coordinate representation!
If it’s any consolation, I can promise that all the bioinformaticians and computer scientist who are responsible for building those databases and defining those file formats have also probably scratched their head staring at a similar disparity.
From a tool builders perspective, I can assure you we have the best intentions when making the calls on how to represent variants in files, websites and in backend data stores!
It turns out the precise, unambiguous, programmatically friendly way of describing a variant’s placement on the genomic reference (0-based half open intervals) is also unintuitive, potentially confusing and not as pragmatic as the alternative (1-based positions).
Let’s go through by example, because any other way makes my brain hurt on this topic.
Here are two lines of a VCF file describing some common variants in the 1000 genomes variant set:
chr1 900009 . C G chr1 900010 . G GC
VCF is funny in the sense that the designers made the choice to require a “prefix” base for insertions and deletions rather than introduce a new symbol that meant “no-base” such as “-“ (they do define “.” to mean unknown base, but that is a distinctly different concept).
If you think of each base pair in the reference as a block, which we start counting from 1, then you are intuitively using 1-based positioning. GenomeBrowse makes a point of labeling the base positions in the middle of each 1bp “box” to encourage that perception.
The representations of these variants as intervals on the reference genome is best thought of in terms of coordinates that have beginnings and endings at the “edges” of these 1bp blocks. And in fact, that is what “0-based half open” intervals do. They also provide the nice programmatic property of being easy to compare, overlap and determine width of intervals by simply subtracting a “Stop” from a “Start” coordinate.
Every bioinformatic tool and database (including VarSeq and GenomeBrowse) thus use this 0-based representation internally for genomic features (including variants).
Back to our example. Here is the 0-based chromosome, start, stop and reference/alternate alleles of our two variants (note ‘-‘ is our sentinel for “no base”).
1 900008 900009 C/G 1 900010 900010 -/C
It is important to note that the insertion has the same start and stop, since it has no width in the reference genome but rather an insertion of a ‘C’ between two existing reference letters.
However, when representing variants in text files and on screens, we don’t want to use two numbers, but simply its starting position (since its stop is always inferable by the length of the number of reference bases changed, which of course is 0 for an insertion).
VCF files in fact are providing a single 1-based position for a variant. If you are mentally translating the insertion in our example to 0-based start and stop, don’t forget to strip the prefix “G” and add one accordingly to the provided position.
This great visualization from a Biostars Tutorial page may also help with translating between the coordinate spaces:
While it’s straight forward and unambiguous to describe our SNP using the 1-based system as 1:900,010 C/G, what is the correct 1-based representation for our C insertion?
Well it turns out, as you can see in the ensuing discussion which I heartily participated in, there is no consensus.
To demonstrate that point, let’s look at the text output representation of the insertion when annotating these variants with three tools: VarSeq, Ensembl VEP and ANNOVAR.
They are all different.
VarSeq will by default display and export the insertion as chr1:900011 -/C, because we unconditionally add one to the 0-based coordinate to make it 1-based, meaning an insertion always starts at the left-edge of the base pair of the provided coordinate.
VEP will produce annotations that start with these two lines for our variants:
1_900009_C/G 1:900009 G 1_900011_-/C 1:900010-900011 C
While the “auto-generated” name in the left column matches VarSeq, the variant position is the pair of the two book-ending base pairs that define the location of the insertion. I’ve never seen this clever representation before, but it does a good job of unambiguously describing the positional nature of insertions being in-between 1-based blocks, versus SNPs/Deletions/MNPs starting on a full block.
ANNOVAR on the hand takes a different route:
chr1 900009 900009 C G chr1 900010 900010 - C
It is providing a start and a stop, and clearly using 1-based coordinates, given the SNP’s position. But I have a hard time thinking of a justification for the insertion having an interval of 10-10.
Note that VEP, VarSeq and other tools can also output annotated variants to VCF files and will always agree on representations in that unambiguous output format.
But VCF is not easy for humans or ad-hoc scripts to get at those annotations, and not canonical in its insertion representation given the “insertion prefix” and the necessary adjustment of the position to accommodate it.
So when it comes to a text representation of variants (well really just insertions), you may unfortunately run into these discrepancies.
Having powered through this blog post, you should be more prepared to sort them out!
You may also find that using GenomeBrowse helps with visualizing the true relationship of variants. Just drag those VCF files in, and away you go!
Very interesting Gabe.
FYI, I will be back to my genome by the end of October. .
As I learn more it seems “balance” is not so much the key as “executive function”.
“VCF is funny in the sense that the designers made the choice to require a “prefix” base for insertions and deletions”
The designers of the VCF file format are all of us. Anyone of us can make suggestions to version 4.3 of the VCF file format to avoid any funny quirks.
I think it’s great that VCF has moved to a community driven format and the GA4GH collaborations on file formats in general.
But changing InDel representation at this point would be a backward incompatible break that I wouldn’t definitely not recommend. I think we are stuck with the VCF representation, for better or worse.
Note the “better” part is that it actually is very unambiguous and clear since you always start with a full base-pair of reference before adding or removing bases. So it side-steps a lot of the mental adjustments one makes when thinking about placing neucleotides in-between others.
The “worse” part is I’ve seen variants reported in the un-stripped representation (i.e. Chr1:100 A/AG), which although trivial to “normalize” also keeps that variant from being easily keyed and indexed. For example, that is how insertions were reported in my 23andMe exome report! Which is fine, it was a research project and I appreciate being part of it, just point out the proliferation of representation choices down to the “end-user”.