The advances in DNA sequencing are another magnificent technological revolution that we’re all excited to be a part of. Similar to how the technology of microprocessors enabled the personalization of computers, or how the new paradigms of web 2.0 redefined how we use the internet, high-throughput sequencing machines are defining and driving a new era of biology.
Biologists, geneticists, clinicians, and pretty much any researcher with questions about our genetic code can now more affordably and capably than ever get DNA samples sequenced and processed in their pursuit for answers. Yes, this new-found technology produces unwieldy outputs of data. But thankfully, as raw data is processed down to just the differences between genomes, we are dealing with rich and accurate information that can easily be handled by researchers on their own terms.
In Part 1 of this series, I followed the story of how this revolution came about through the innovation of companies competing during and after the years of the Human Genome Project. In Part 2, I enumerated the tiers of analysis that start at processing the parallel measurements of the current generation’s high throughput sequencing machines, and ends with genomic variants of sequenced samples. Though more innovations are sure to come in this realm of pipelined processing of sequence data, it’s not until the final tier, tertiary analysis, that researchers are able to apply their own personal expertise to the process of extracting insight from their research data.
In this final installment of the Hitchhiker’s Guide to Next Generation Sequencing, I’m going to explore in more depth the workflows of tertiary analysis, focusing primarily on genotypic variants. Over the last three to four years, the scientific community has proposed a set of methods and tools for us to review as we explore the current landscape of solutions. So let’s examine the current state of sense making, how the field is progressing, and the challenges that lay ahead.
Motivation Behind Next-Generation Sequencing
To properly discuss the methods and techniques of sequence analysis, we need to step back and understand the motivation behind using sequencing to search for genetic effects. We have, for the past five years, been a very productive scientific community investigating the common disease-common variant hypothesis. With significant Genome-Wide Association Study (GWAS) findings for hundreds of phenotype and disease-risk traits, we should consider this effort well spent.
Genotyping microarrays have been the workhorse of GWAS study designs as they cheaply enable the assaying of common variants on thousands, and even tens of thousands, of samples. Of course, the more we study human populations the harder time we have classifying one universal definition of what makes a common variant “common.” Indeed, due to the fact that all shared variants were inherited from somewhere, even “rare” variants are not rare in some geographical locations or sub-populations. And so it follows that with better cataloging of population variants, more mileage can be had in the future with the existing GWAS study design. Even more interestingly, next generation microarrays will be able to assay around 5 million variants, as opposed to 1-2 million with the Affy SNP 6 or the Illumina HumanOmni1. Besides trying to assay all common variants with Minor Allele Frequencies down to 1% on diverse populations, new microarrays can be based on variant catalogs for specific populations or even biologically relevant panels such as the “MetaboChip” from Illumina.
But even with a new generation of microarrays around the corner, the track record for the GWAS study design has not been perfect, and the limited ability to assay only sets of known variants is being blamed as the culprit. The common variants identified in association studies thus far have only accounted for a small portion of the heritability of complex traits studied. If complex diseases are driven by susceptibility alleles that are ancient common polymorphisms, as the common diseases-common variant hypothesis proposes, we should be seeing a larger portion of the heritability of these traits being accounted for by these associated polymorphisms.
The alternative hypothesis that must now be considered is that complex diseases are actually driven by heterogeneous collections of rare and more recent mutations, as with most Mendelian diseases! So how does one go about studying low-frequency or rare variants and how they contribute to genetic risk? Well, nothing beats the fidelity and comprehensiveness of whole-genome sequencing. But given that the cost of whole-genome sequencing may still be prohibitively expensive (though its rapidly dropping) for larger numbers of samples, whole exome sequencing is a good compromise.
Clearly getting the full set of variants provided by sequencing is the ideal tool, but microarrays should not be totally discounted. As our CEO, Dr. Christophe Lambert, explains in his post Missing Heritability and the Future of GWAS, there can be a productive pairing of next generation sequencing with next generation custom microarrays. Simply put, with the impending ability to customize microarrays, a small scale preliminary sequencing of affected individuals can be done to provide an enriched panel of rare and common variants specific to the disease population. The custom micorarrays could then be used affordably on a large set of cases and controls.
But no matter how you acquire your low frequency and rare variant data, you will quickly realize that the set of tools from the traditional GWAS study toolbox are inadequate to properly ascertain causal alleles and regions.
Sequencing and Study Design
To understand how to study sequence data, we have to better understand the nature of rare and low frequency variants. Out of the secondary analysis pipeline, you can expect an average of 20K single nucleotide variants to be detected for a single exome. That relatively small number may make the analysis of those variants seem deceptively easy. But let’s try to place those 20K variants in context with some figures from the 1000 Genomes Project on the cataloging of variants and an article by the CDC on the expected functional implications of those variants.
- 25 million unique variant sites have been identified on a set of 629 samples
- Of those 25 million, 15 million have frequencies below 2% in the population
- 7.9 million variants are in dbSNP 129 (closest thing to a database for common variants)
- You can expect roughly 4 million variants for a single sample at the whole-genome scale
- Around 20k will be in coding regions (exome)
- 250-300 will be loss-of-function variants (biologically damaging) in annotated genes
- 50-100 will be SNPs previously implicated in diseases
With these numbers to set perspective, the scale of the problem becomes apparent. How do you use the abundance of public data, as well as good study design, to filter down to causal regions of interest? How do you distinguish the functions of variants? How do you measure the combined effect of variants and their correlation with phenotypes?
First and foremost you need to start with good study design. In my previous post I advocated taking advantage of the centralized expertise and economies of scale that sequencing-as-a-service providers deliver for the sequencing and variant calling of your samples. While this still holds, a word of caution: having dealt with the confounding effects of poor study design in the analysis of many GWAS studies, Dr. Lambert has made a plea to take batch effects seriously when planning the logistics of assaying your samples. There is no reason to believe sequencing studies would be immune to the same effects as GWAS studies. So be sure to work with your genomic center or sequencing service provider to ensure proper randomization of cases, controls, and family members across runs. Also technical replicates should be considered to measure concordance of measurements across batches.
Besides proper randomization, there are other design level techniques to ensure the highest quality data for your study. Not surprisingly, if you have the luxury of including multiple related samples in your study, you can use the family structure not only to investigate inheritance patterns but to do extra quality checks on your variant calls themselves. Another neat trick is to pool the DNA of your cases and controls into groups, which you can then deeply sequence with over 100x coverage to get a highly accurate set of variants for your various groups.
So now, with proper study design and quality control checks in place, how does one go about finding causal variants?
Analysis of Rare Variants
With the hypothesis of susceptibility alleles being a heterogeneous collection of rare and low frequency alleles, the first step in the analysis of sequence data is to categorize the relative frequency of your sequenced variants. If your sample set is enriched for the disease or trait of interest or you simply don’t have enough samples to attain accurate frequencies, you will want to refer to an external catalog such as dbSNP 129 or maybe the catalog of variants from the 1000 Genomes project. Note that dbSNP 129 is often considered the last “clean” dbSNP build without many rare and unconfirmed variants from 1000 Genomes and other large scale projects being added.
With a classification of your variants, the next step is to search for regions where the heterogenous burden of rare, low frequency and common variants is strongly correlated with the trait under study. Traditional association techniques used in GWAS studies do not have the power to detect associations with these rare variants individually or provide tools for measuring their compound effect. To this end, there has been a field-wide development of analytic approaches for testing disease association with sequence data.
The first round of methods focused on finding association regionally, based on accounting for the presence of at least one rare variant or a counting of the number of rare variants. Cohort Allelic Sum Test (CAST; Cohen et al.) was developed in 2004 to simply count and compare the number of individuals with one or more mutations in a region or gene between affected and unaffected groups. In 2008, Li and Leal published the Combined Multivariate and Collapsing (CMC) method which similarly collapsed the rare variants into a single covariate that is analyzed along with all the common variants in the region using a multivariate analysis. Finally, Madsen and Browning published a method in 2008 that described a weighted-sum approach that tests group-wise association with disease while allowing for rare variants to have more weight than common variants.
The second round of rare variant analysis methods attempts to address a couple confounding issues. First is that not all variants, regardless of their rarity, are created equal. In fact, it’s possible for there to be both protective, neutral, and damaging variants all in the same gene or region being tested. While neutral variants generally just reduce power, they are fairly easy to filter out (see next section). The presence of both protective and damaging variants in the same region without knowing which is which is especially confounding. Using a very neat set of mathematical functions, called c(α), you can test for the mixture of neutral, risk, and protective variants. Benjamin Neale has presented on this approach at ASHG 2010 and other venues, but details of the method have not yet been published.
Suzanne Leal, the author behind the CMC method, has proposed an enhanced method with Dajiang Liu, which they call Kernel Based Adaptive Cluster (KBAC). This new method can account for other more general confounders such as age, sex, and population substructure in its association testing.
Almost all of these methods benefit from having another dimension of variant classification other than assessing its population frequency, and that’s inferring or predicting its functionality.
Getting More Clarity with Functional Prediction
Although we expect rare variants to be enriched for functional alleles (given the view that functional allelic variants are subject to purifying selection pressure), properly understanding and accounting for the functional significance or insignificance of variants improves the understanding of the contribution of those variants. Variants come in the form of single nucleotide variants (SNV), and insertions and deletions (often abbreviated as “indels”). When SNVs occur in coding regions, they are either synonymous (in that they code the same amino acid) or non-synonymous. Non-synonymous single base pair substitutions can then be classified into missense mutations where one amino acid is replaced with another, nonsense mutations where an acid codon is replaced with a stop codon, and splice site mutations where signals for exon-intron splicing are created or destroyed. Small insertions or deletions may introduce frameshift errors by adding or removing nucleotides that are not a multiple of three, hence entirely changing the downstream amino acid coding.
Although it seems these types of mutations are easy to classify (besides a mutation as being missense), what we really care about is whether the amino acid substitution affects the protein function. Protein function prediction is a world of complexity itself and quickly leads to understanding entire organisms and their networks. Programs such as SIFT and PolyPhen2 are able make a decent prediction of how likely a mutation is damaged by looking at things like how a protein sequence has been conserved through evolution as a proxy to its functional importance.
If you’re trying to get a perfect picture of the functional importance of all your exonic variants, you’ll see it’s already tough. When you throw into the mix splicing mutations that alter splice sites, affect splicing enhancers, or activate cryptic splice sites, things get a bit tougher. When you try to account for all potential gene regulation sites upstream and downstream, it gets near impossible. Although many of these things can be worked out through carefully run lab experiments on a gene-by-gene basis, possibly the best we can do at a genome-wide scan level is to keep variants within reasonable distances of genes and close to splice sites in our analysis as we filter.
Along with the variant calls themselves, a secondary analysis pipeline usually provides quality scores and the depth of the pileup where a variant was called. You can use this information to either pre-filter variants to a higher quality set or investigate your variants in your region of interest to ensure its biological validity. At that level of individually confirming variants, you may want to visualize the aligned sequences themselves around a variant. If things seem a bit wonky, doing a de Novo assembly of that local set of reads may clean up assembly.
Other Analyses on Sequence Data
Besides trying to nail down the elusive genetic components of complex diseases through variant analysis, DNA sequencing is an excellent tool for other study types. In the case of rare penetrant Mendelian diseases, sequencing whole genomes of affected individuals and their families can lead to near instant success in discovering the causal mutations. The analysis is not one of association, but really of data-mining. Just doing set-filters between the variants of different samples helps illuminate novel or compound heterozygous mutations in just a few, easy to follow-up on regions.
Although small SNV and indel variants are the current focus for association studies with sequence data, larger structural variants such as copy number variants (CNV) and other inherited genomic features such as runs of homozygosity (ROH) can also be studied with whole-genome and sometimes even whole exome sequence data. Many methods have been proposed to tackle these tasks, and as sequencing becomes more affordable, we expect further focus on maturing the tools to support these studies.
Finally, outside common and rare Mendelian diseases, the quest for understanding the mutations driving cancer are always in need of more precise and comprehensive measurements of the genome. In the variant-abundant world of cancers, nothing beats having a whole-genome scan of your cancer and normal samples for comparative analysis.
Current and Future Solutions
As we have seen in our exploration, tertiary analysis is where things really open up to the inquisitive process researchers apply in discovering regions of importance in the vast genomes of their samples. For a lot of the filtering steps I’ve described here, programs such as ANNOVAR can take large lists of variants and help narrow them down. Although many of the collapsing and testing methods designed for this field have been around for years, it’s hard to point to any package that allows easy use of them. Similarly, functional prediction and investigating variants in the context of a genome browser all require variant-by-variant queries to various websites.
Golden Helix hopes to open up this realm of analysis to more researchers with integrated and powerfully extensible tools that grow to reflect the best of breed methods and workflows for tertiary analysis of sequence data. In the release of SNP and Variation Suite version 7.4 next Thursday, we plan to kick off that focus with support of many of the steps described here, and more to come. We hope you will be one of our collaborators in this exploration! … And that’s my two SNPs.
Editor’s Note: There’s more to learn about NGS. Check out:
The State of NGS Variant Calling: DON’T PANIC!! Blog post by Gabe. March 25, 2013
Knowing Your NGS Upstream: Alignment and Variants. Webcast by Gabe. March 27, 2013
Knowing Your Downstream: Functional Predictions. Webcast by Bryce. May 15, 2013
23andMe Variant Analysis of My Personal Exome. Webcast by Gabe. December 5, 2012