In Pursuit of Longevity: Analyzing the Supercentenarian Whole Genomes with VarSeq

If you haven’t been closely watching the twittersphere or other headline sources of the genetics community, you may have missed the recent chatter about the whole genome sequencing of 17 supercentenarians (people who live beyond 110 years).

While genetics only explains 20-30% of the longevity of those with average life-spans, it turns out there is a number of good reasons to think extreme longevity has a very large genetic component (and interestingly not a lot to do with lifestyle choices like smoking, alcohol use, exercise etc).

The collaboration between Stanford researchers (including the visionary Dr Leroy Hood) and the Supercentenarian Research Foundation resulted in an open access paper in PloS One about the sequencing and analysis of 17 of the 22 supercentenarians alive in the US.

This group clearly wants their data and results to have the best chance of making an impact and being of significance to the genetics community. Along with the paper being open, there are supplemental tables of various sets of interesting variants, as well as a reference to their project’s website with directions for requesting access to the full raw variant calls for each individual.

Some Assembly Required

I honestly requested access on more of a whim than an expectation that I would be granted access, but two days later I received a friendly email with permissions granted to the Google hosted Complete Genomics “masterVarBeta” files for each of the 17 supercentenarians!

If you’re not familiar with Complete Genomics, the abbreviated history is that they pioneered whole genome sequencing and ground-breaking bioinformatics as a service, but ultimately bowed to being outspent on R&D and economy of scale by Illumina’s HiSeq high-throughput sequencer and open source informatics. They recently have been sold to BGI (yes, the Chinese government subsidized behemoth of a genome center), which still provides the Complete Genomics sequencing through their cut-rate sequencing services group.

The “masterVarBeta” files I downloaded were not ready for analysis until they were converted into a single VCF file using the cgatools (along with an obscure custom reference sequence file).

So with a 9.8GB VCF file in hand, I started to ponder what would be the appropriate analysis strategy to scour healthy genomes for threads of the story of why they are outliers.

Finding the “Long Life” Gene

Not many genetic research topics share the fortune cookie ring of “you will have a long life”. But explaining these cases of long life in retrospect is exactly what we would like to accomplish.

The PLoS paper is an interesting read, but if you were hoping for a break-through discovery of why some people live to be 110, then I’m sorry to drop the spoiler and say they didn’t find it.

By reading the paper, you get the sense that the genome is a vast expanse of varied terrain, most of which may be desert, but with plenty of oases to explore.

Another take-away is that the techniques used to diagnosis the genetic cause of rare and Mendelian diseases are likely to be ineffective in discovering the genetic underpinnings of longevity. The former might expect to have single causal mutation, with a very satisfying narrative of the sequence of biological steps it propagates to arrive at the disease or phenotype. In the later case, the sheer variety of the common and rare diseases successfully avoided implicates a broad and diverse protective mechanism, with potentially many components to its genetic architecture.

Whole Genome Variant Analysis in VarSeq

The PLoS paper takes a couple approaches that follow very common steps done in the bioinformatics of NGS variant data, which is exactly the specialty of our new product VarSeq!

So naturally I fired up VarSeq, started a new project and and ran through the import wizard for unrelated individuals with that behemoth of a VCF file.

While we have given examples and webinars of VarSeq as a variant annotation and filtering tool with exome and gene panel data, this is the first time I have had a chance to put it through its paces with whole genome data.

Not only does it work, but VarSeq is able to fulfill the initial design of supporting the “analysis at the speed of thought” user experience even at the scale of whole genomes.

Variant analysis of 17 Supercentenarians whole genomes (14.5M input variants). The columns left to right are mutations not seen in any population catalog that are potentially damaging, those including variants up to 1% in population catalogs, variants annotated as Pathogenic in ClinVar that are very rare (<0.1% in 1000 genomes) and seen in a homozygous state, and finally Pathogenic variants not seen in any population catalog.

Sure, instead of having to wait a minute to run the annotation against the 1000 genomes catalog on exomes, these genomes take a proportional 6.5 minutes to look up the 14 million imported variants. But the ability to construct filters from any imported or annotation column, move around cards, tweak filter thresholds and toggle buckets is still an amazingly responsive experience.

This is due to the inherent scalability of the TSF file format that powers VarSeq as well as the pedantic attention to performance the whole Golden Helix dev team has internalized, especially when it involves the user experience.

Peering Into the Barrel for a Glimpse of the Silver Bullet

Well, what did I do with my imported super life-extending whole genomes?

Look for the silver bullet of course!

The first step was to compare these genomes to the population catalogs and databases of clinically relevant variants.

One of the more interesting variant catalogs, is our up-to-date ClinVar annotation track with its Clinical Significance classification of Pathogenic, Likely Pathogenic, Uncertain Significance, Likely Benign and Benign.

As the PLoS paper noted, many variants annotated as pathogenic in ClinVar have unknown penetrance (meaning although they are implicated as disease causing in some individuals, they may be passenger mutations in other healthy individuals).

In the filter chain screenshot, you can see there are 257 variants annotated as Pathogenic in the supercentenarians as a whole. Using the Allele Counting algorithm, we can look at the count of homozygous (mutation on both chromosome) variants in the cohort.

Many of these variants are quite common, so we can added a strict population frequency filter of less than 0.1% (the third column in the filter screenshot).

The one homozygous variant left is in CYP2B6 (A>G mutation NP_000758.1:p.Lys262Arg) and ClinVar lists the condition of this variant as poor metabolism of Efavirenz (a antiretroviral drug used primarily in HIV treatment) . Although this variant is absent in the 1000 genomes and ESP6500 catalogs, it is actually present at 9% frequency in the recent ExAC catalog, demonstrating the importance of using more than one population catalog source during analysis.

Other Pathogenic variants in our supercentenarians are completely novel, such that they are not in any population catalog, but are still in Clinvar.

The 9 variants in the fourth filter column that are novel and annotated Pathogenic in ClinVar 2014-12-02.

Right at the top of the table of these 9 variants is a frameshift insertion (extending a 7-A homopolomer by another A, NP_000242.1:p.Ala230fs) in MSH2, with a Pathogenic ClinVar classification with the rare 3 star confidence rating for Lynch syndrome (Hereditary nonpolyposis colon cancer).

One Supercentenarian (GS000014098-ASM) holds this novel frameshift variant in MSH2 annotated as Pathogenic for Lynch Syndrome (Hereditary Nonpolyposis Colon Cancer).

This variant has very high diagnostic potential, and when reviewing it I recalled that the table in PLoS paper with the characteristics of the supercentenarians noted one individual previously had cancer.

The raw variant data I received has no phenotype information associated with it, only the anonymous sample identifiers. But as luck would have it, the one individual with cancer was also the only male in the cohort!

Immediately I fired up SVS, imported the X chromosome, ran our gender inference algorithm that looks at the heterozygosity rate of the genotype calls in X and a few seconds later had the inferred gender of the cohort:

Sure enough, our single male with a history of having cancer is the sample that contains the novel Pathogenic variant for hereditary risk of colon cancer!

There are other interesting variants in that table, including one for an autosomal dominant form of night-blindness and another for an autosomal dominant form of renal structure abnormalities. But rather than go through them one by one, just download the formatted Excel sheet of this table and take a look yourselves (VarSeq’s Excel export preserves formatting and hyperlinks).

Supercentenarian Novel Pathogenic Variants XLS File

Rare Variants Enriched in the Supercentenarians

Another strategy taken in the PLoS paper was looking at variants that are in a large proportion of the supercentenarians, but rare (or novel) in the population at large.

This can be done in VarSeq by sorting on the Heterozygous Counts column computed on the individual genotypes with the filters from the first two columns of our filter chain applied.

Here is the table for Novel Damaging Mutations:

You can see multiple examples of HYDIN variants with very high genotype quality in every samples. Yet they are not in any of our population catalogs!

Well, before we can get excited, it’s good to remember that things that look too good to be true usually are.

When following up with candidate genes like TSHZ3 in the PLoS paper, the authors explained that many variants present in multiple supercentenarians were not validated with Sanger and are thus false positive systematic sequencing errors.

This is almost certainly what we are looking at here.

Another way to pick through these rare damaging mutations is to look at critical disease associated genes like BRCA1/2:

Here we started with the Rare Damaging Mutations filter column, and added a table filter on the Gene Names column to show only BRCA1/BRCA2 variants.

Two of these variants occur in the population at very low allele frequencies (0.3% and 0.06% in 1000 Genomes) and have ClinVar Benign classifications for breast cancer.

That leaves the remaining 2 Loss of Function (LoF) variants in BRCA1 and the LoF and Missense in BRCA2 of uncertain interpretation.

Starting with the BRCA1 variants, we can use the built-in GenomeBrowse visualization to see their genomic context:

Two BRCA1 mutations that offset each other

As you can see, they are actually in the same individual and although they are both frameshift variants, the first inserts a base and the second deletes a base, leaving the rest of the gene to be in-frame. All told, only two amino acids should be altered in the protein sequence, and although a functional study would be needed to see how this affects BRCA1 protein levels, we do have at least one elderly woman with no breast cancer for evidence of its benignity.

The remaining two BRCA2 variants are open for interpretation, as they are not in clinical or population catalog, yet are both overlapping known pathogenic variants (in the right protein neighborhood).

Sequence Once, Revisit Analysis As We Learn

Ok, so we didn’t find a silver bullet gene that keeps cancer at bay, lifestyle choices at minimal impact levels and your body motoring along past its centennial.

Can you imagine the implications if the supercentenarian researchers did find such a gene? The doors it would open for drug development? The ethical and philosophical debates it would start? The milestone it may represent to human health?

But the thing is, we haven’t disproved that there isn’t a silver bullet still in there. And if not a single bullet, maybe the fragments of dispersed buckshot that if re-assembled, would make for a heck of a shotgun kick.

And with every advancing scientific step, every progression of our understanding of genetics and system biology, we have the whole genomes of supercentenarians to revisit and ask more questions of.

And someday, they may surrender the answers.

@gabeinformatics

an "The Golden Helix Blog" OUR 2 SNPS…