Increase Power and Data Quality with Advanced Genotyping and Imputation Methods

         July 30, 2010

Accuracy and completeness of genotype data are among the most important factors for a successful genome-wide association study (GWAS), and must not be taken lightly.  The Golden Helix team is always on the lookout for methods to improve data quality, and we have recently found the BEAGLE and BEAGLECALL software packages to be very useful in this regard.  BEAGLE is a Java-based tool for genetic analysis developed and maintained by Drs. Brian Browning and Sharon Browning at the University of Auckland (the Brownings have recently joined the faculty at the University of Washington).  BEAGLE is particularly useful for inferring the genotypes of missing and untested SNPs.  BEAGLECALL is a companion to BEAGLE that we have found to be the most accurate genotype calling tool available today.

BEAGLE – Imputation
Genotype imputation is the process of inferring the genotype of one or more markers based on the correlation pattern (aka linkage disequilibrium or LD) of the surrounding markers for which genotypes are known.  Imputation is growing in popularity and has been repeatedly shown to be very accurate.  A major benefit of imputation is the ability to harmonize data from multiple sources.  For example, the Affymetrix 6.0 SNP genotyping array includes only a portion of the SNPs included on the Illumina 660K array, but imputation allows you to fill in the missing genotypes for subjects tested with each array, so that all subjects will have genotypes for the union of the two sets of markers.  This enables researchers to merge multiple data sources for a single large analysis without the cost of repeating the genotypes on a common platform and more subjects means greater statistical power.  Imputation is particularly useful when incorporating public data into an analysis, when a researcher doesn’t have control over the genotyping platform used in the public data.  As newer, denser genotyping arrays are released, the old study data that we have stored on the shelf doesn’t instantly become obsolete—BEAGLE provides a way to update that old data and use it in direct comparisons with data from the latest platforms.

Another common use for imputation is to infer the genotypes of candidate SNPs for replication studies.  For example, somebody may report that a certain SNP is associated with a particular trait.  If you are studying the same trait and would like to replicate the published finding, but do not have the exact SNP available, imputation can be used to infer that SNP from available marker data.

Imputation algorithms are often compared based on speed and accuracy.  BEAGLE performs well in both categories.  While it won’t finish a full-genome imputation during your lunch break, we have found the speed to be satisfactory and the progress log is helpful.  For internal projects, we’ve developed an automated system to simultaneously run each chromosome on a different CPU, which of course makes a huge difference in run times for full-genome imputation.  I strongly recommend that anybody who wants to try SNP imputation should begin by speaking with somebody who has experience with it.  There are difficulties with such things as file formats, allele/strand matching, map alignment and selection of reference panels that can make it difficult to get started.

One more note regarding imputation: I’ve spoken with people who have the impression that the power of a GWAS can be increased simply by imputing additional SNPs.  That is, by imputing millions of additional SNPs, you may improve the chances of finding a significant association signal.  While this is theoretically possible, I would like to urge caution on this point.  Remember that you cannot impute the genotype of a marker that is not correlated with the markers that you already have available.  That is, imputed genotypes are by definition correlated with the original genotypes.  New, previously unobserved association signals are therefore unlikely.  At best, our internal testing has shown minor gains in significance that are not sufficient to offset the increased multiple testing.  The greatest power gain from imputation does not come from adding markers, but from adding subjects, as described above.

BEAGLECALL – Genotype Calling
Despite the constant advances in genotyping technology and arrays, making correct genotype calls remains a problematic issue.  Experimental variability leads to noise in allele signal intensities, and a few incorrect genotype calls can quickly lead to false (positive or negative) association signals.  Conventional calling algorithms consider one marker at a time and rely exclusively on allele signal intensity clustering to make genotype calls.  In a perfect world, there would be no variability in DNA collection and processing procedures, and the signal intensities for every SNP would form clear, unambiguous clusters for AA, AB, and BB genotypes.  But real data is often very noisy.  It is not uncommon for the cluster boundaries to be very vague or to have substantial overlaps, which leads to incorrect calls and/or low call rates.


Left: Genome-wide association test results based on CRLMM genotype calls for Affymetrix 500K data.
Right: Result of the same association test for the same subjects after re-calling genotypes with BEAGLECALL .

BEAGLECALL improves on conventional genotype calling methods by considering nearby haplotype structure in addition to signal clustering.  In essence, it relies on the phasing and imputation capabilities of BEAGLE to confirm the accuracy of each genotype call.  If it is unclear whether a genotype is AA or AB based on the clustering, BEAGLECALL can use the LD structure of the surrounding markers, as determined by subjects whose genotypes are more certain, to resolve what the correct call is.  BEAGLECALL begins with a prior distribution of calls and uses multiple iterations of phasing and correcting genotypes before making a final determination (warning: this takes some time).  BEAGLECALL looks at the genotyping results, compares haplotypes across all individuals, then identifies and corrects unlikely genotypes in the same way that imputation fills in missing genotypes.  We have used this method for several internal projects using data from both Affymetrix and Illumina and have found that the call rates have improved substantially while the number of spurious association signals has dropped dramatically.  The figure below illustrates the substantial reduction in data noise that can be achieved with BEAGLECALL.

Similar to other calling methods, BEAGLECALL assigns a likelihood value to each of the output genotypes, allowing researchers to incorporate uncertainty in the analysis.  One of the most powerful features of the BEAGLE package is that the likelihoods produced by BEAGLECALL can be used as an input for BEAGLE imputation.  One of the major problems with using imputation to harmonize data sets is that differences in genotype calling for the various data sets can lead to numerous spurious associations.  When the raw data is available, we recommend running BEAGLECALL for both datasets, then using the likelihood output as the starting point for the imputation process.  We believe that this workflow is the best method currently available to limit the influence of batch effects in imputed datasets.

We believe that BEAGLECALL is a valuable tool to clean up noisy genotype data, and it is a great tool for eliminating batch-related data artifacts.  Fewer false positive association signals means less time spent validating results.  It does require some extra time at the beginning of the analysis, but the returns justify the investment.  Improved call rates mean less data is wasted, and reduced false positive rates mean less time is spent validating results.

We would like to see more people use BEAGLE and BEAGLECALL, but we are also mindful of the fact that these are advanced tools.  Anybody using them is likely to experience some trial and error along the way.  We have add-on scripts available to help our SVS users interface with BEAGLE and and we hope to have more Beagle-related scripts available soon. If you have any questions on how to use either program let us know.  We can’t offer direct support for the BEAGLE programs, but we’d be happy to share our experience with you.  …And that’s my two SNPs.


The above image shows allele signal intensity cluster plots for two different SNPs from the same study population. The upper panels show genotypes determined by Birdseed and the lower panels show BEAGLECALL genotypes. The plots on the right are for a SNP with a clear pattern of genotype clustering. The plots on the left show a SNP with poor resolution of A_B and B_B genotype clusters and the increased clarity of genotype calls that comes from using BEAGLECALL.

Editor’s note:
For a more comprehensive look at BEAGLE and BEAGLECALL, download an exclusive GHI whitepaper, “Advanced Imputation and Genotype Calling”.

3 thoughts on “Increase Power and Data Quality with Advanced Genotyping and Imputation Methods

  1. Dan Frost

    Well said Bryce! This post has really helped explain the role, importance, and inherent difficulty of imputation and genotype via BEAGLE and BEAGLECALL.

    Reply
  2. Mohammad Akbari

    Is there any script for converting genotype probability beagle output files to allele dosage files?
    Thanks,
    Mohammad

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *