What’s in a Name: The Intricacies of Identifying Variants

There’s a strong desire in the genetics community for a set of canonical transcripts. It’s a completely understandable and reasonable thing to want since it would simplify many aspects of analysis and especially the downstream communicating and reporting of variants. Unfortunately, biology isn’t so tidy as to provide a clear answer for which transcript is the important one. Consequently, there isn’t a single resource that lists the “canonical” transcript for each gene. And while there have been some attempts to label important transcripts, like Ensembl’s “Gold” identifier, ultimately the issue is punted downstream to the lab (see: Which transcript should I use?).

Genes like HK1 contain many coding transcripts which differ greatly in structure. These differences can cause the same variant to have dramatically different names.

Genes like HK1 contain many coding transcripts which differ greatly in structure. These differences can cause the same variant to have dramatically different names.

What’s a Good Name?

One of the most important reasons to develop a set of canonical transcripts is to provide a common basis on which to name variants. With that common basis in mind, a variant should ideally have a name that is:

  1. Ubiquitous
  2. Unique
  3. Contains a functional tie to the underlying biology

In the academic community, the ultimate dream would be that every researcher uses the exact same name so that one text search across publications would reveal every instance where this variant has been discussed. This level of ubiquity is unfortunately unrealizable since as our understanding evolves, so too might the name. Moreover, the intricacies of the variant-transcript interaction being examined mean that the variant’s name will vary. However, resolving these discrepancies is part of the research process and is something that can be overcome.

Within a single clinic it is crucial that everyone use the same terminology so that variants are not overlooked or misinterpreted. Munz et al. proposed a framework to remove the inconsistencies in HGVS which can confound clinical analysis (http://biorxiv.org/content/early/2015/03/21/016808.1). This framework, however, doesn’t deal with the issue of assigning canonical transcripts and leaves it up to the lab to make that determination. This ambiguity can sometimes make coordination and communication between clinics difficult.

Computer Science to the Rescue?

From a Computer Science perspective, this issue of naming many objects uniquely and ubiquitously is a “solved problem”. Github’s entire business revolves around the solution which is just to make up a unique identifier for every object. To guarantee these properties, we are left with identifiers like “601fa1a5d09f19d10ede60576a5a6c42e33d9a51”. Unfortunately these identifiers violate our third desirable principle (a functional tie). In many ways these hashes aren’t too dissimilar from rsids which also have no functional tie, but the popularity of rsids lies in their goal of providing unambiguous variant representation.

The topic of determining unique identifiers for variants was recently raised in the GA4GH developers working group and the solution of “content addressable data” was proposed. As a developer, I love this idea and think it is a step in the right direction. In this implementation, the identifier is a hashed version of stable properties of the variant like ref/alt and genomic position. The id unfortunately still looks like the git hash above. This appearance makes these identifiers useful in the context of a database or API, but not suitable for day to day use in journal articles, clinical reports or conversation.

Or Maybe Not?

Since the functional impact of a variant will vary with the transcript it is mapped against, it might seem that we are out of luck when it comes to defining a list of canonical transcripts. However, if we allow ourselves to step back and observe that maybe the ubiquity and uniqueness qualities are more important than the exact functional tie, we can still come up with a set of transcripts for our naming conventions.

Ultimately the question is does it matter if we call a variant NM_12345:c.123T>G, when the clinically relevant functional consequence is that it actually disrupts a splice donor site? It is simply a question of priorities. If our primary concern is the recognition of a common variant across labs, clinics and researchers, then this may be an acceptable allowance.

What does the Clinic Want?

We heard over and over from labs that they would highly value a single transcript representation for each variant. The natural next question is, “What transcript do you want to use?”, and the most common answer is, “Whatever Clinvar uses.” After an email exchange with Clinvar, it became clear that the complex factors that they consider cannot be boiled down to an algorithmic solution. However, in order for this transcript selection to be scalable across the entire genome and internally consistent, we set out to devise a heuristic that can usually provide the desired transcript.

We used the following approach to make our determination:

  1. Order all of the overlapping transcripts by cds length
  2. Pick the longest transcript that has an associated Locus Reference Genome (LRG) sequence
  3. If no LRGs exist for the set of transcripts, pick the longest transcript that is coding
  4. If there is a tie, pick the transcript with the smaller accession id number

The heuristic is based on the ACMG Guidelines for the Interpretation of Sequence Variants.

A reference transcript for each gene should be used and provided in the report when describing coding variants. The transcript should represent either the longest known transcript and/or the most clinically relevant transcript.

Furthermore, the ClinVar paper helps clarify how they choose the preferred transcript:

When there are multiple transcripts for a gene, ClinVar selects one HGVS expression to construct a preferred name. By default, this selection is based on the first reference standard transcript identified by the RefSeqGene/LRG (Locus Reference Genomic) collaboration (Locus Reference Genomic sequences: an improved basis for describing human DNA variants)

We use the fact that highly studied, clinically relevant transcripts often have an associated LRG sequence. It is important to note that LRG existence alone isn’t enough to select a transcript, since some genes have multiple LRGs. Thus, we use the cds length to prioritize which transcript will be reported. Finally, we provide a tiebreaker to use the lower accession number since in practice these seem to be the most widely used transcripts.

One variant in the CEU Trio overlaps three transcripts in the gene HK1. Here the clinically relevant transcript is transcript where the variant is in an exon. A nice effect of using the longest cds region as a heuristic is that the clinically relevant transcript is likely to be the one in which the variant overlaps a coding exon.

One variant in the CEU Trio overlaps three transcripts in the gene HK1. Here the clinically relevant transcript is a transcript where the variant is in an exon. A nice effect of using the longest cds region as a heuristic is that the clinically relevant transcript is likely to be the one in which the variant overlaps a coding exon.

For each gene, the clinically relevant transcript is now found in our summary section for the transcript annotation algorithm in VarSeq. The corresponding HGVS C. and P. are given in this summary section as well. Since sequence ontology and effect annotations are used to filter variants, we still provide a combined representation of the most detrimental functional annotation. This detrimental interaction may not be found on the clinically relevant transcript. Thus, it can be the case that the combined effect is inconsistent with HGVS notation for the clinically relevant transcript. Of course, the complete set of annotations can be found in the transcript annotation output.

The choice of transcript, however, is only one factor that each variant name is premised upon. It is an important choice since it influences many of the other factors which create the inaccurate and varying names that I’ve discussed in previous posts. For users pulling VarSeq’s annotations into reports capturing their findings, annotating the clinically relevant transcript should add both consistency and simplicity to their workflow. This additional annotation evolved from feedback from our active users, so let us know how your lab has approached this problem and manages the complexity inherent in NGS reporting.

2 thoughts on “What’s in a Name: The Intricacies of Identifying Variants

  1. Amit

    I wonder if you could use RNA-seq data from sources like GTEx to see which transcript is the most abundant in your tissue/cell of interest. Of course GTEx isn’t all encompassing of all tissue types or biological conditions, but at least if you were studying particular disease and knew which transcripts were the most abundant, it would help guide your decision on a set of canonical transcripts.

    Reply
  2. Pingback: VarSeq is a better ANNOVAR, snpEff and VEP | Our 2 SNPs…®

Leave a Reply

Your email address will not be published. Required fields are marked *