By Popular Request: Our BEAGLE Algorithm Gains Support for Family Structure

Earlier this year we released our own optimized and integrated BEAGLE implementation for SVS based on the BEAGLE 4.1 and optionally 4.0 algorithms.

One of the commonly requested features since that released was to expand the algorithm implementation to be considerate of the parent-offspring relationship between samples to inform and improve the accuracy of the haplotype phasing.  With this information, both creating phased reference panels and the imputation of samples against those reference panels can improve in accuracy and also ensure there are no Mendelian errors in the resulting genotypes.

In this post, we review how phasing works in BEAGLE in the two algorithm variants (4.1 and 4.0), and how we brought together original “Java BEAGLE” and novel techniques to allow pedigree structure aware phasing to the latest SVS phasing and imputation algorithms.

Phasing in BEAGLE

In Beagle, and in the SVS implementation of the Beagle algorithms, one of the steps (and the only step for creating a reference panel) is phasing the data (which corresponds with determining the haplotype pairs in the data).

Each iteration of the Beagle phasing algorithm first creates a directed acyclic graph (DAG) encapsulating the most likely haplotypes as a localized haplotype-cluster model. This model, which is based on the haplotypes estimated by the previous iteration, is “leveled”–that is, its edges may be subdivided into sets, where each set corresponds with one marker, and its nodes may be subdivided into sets, where each node set may be thought of as lying between two markers. The entire process starts with a model based on what the haplotypes would be if there were perfect linkage equilibrium between the markers.

The next step in each iteration is to use the localized haplotype-cluster model as part of a Hidden Markov Model (HMM), and to use a forwards-backwards algorithm to sample phased haplotypes according to their probabilities as determined by the HMM and conditional on the genotypic data. The actual direction of “forwards” and “backwards” over the genome is switched between iterations.

After the last iteration, the algorithm determines which are the most likely haplotypes that have been estimated.

Finally, in Beagle 4.1, this is followed by a modified HMM algorithm that takes advantage of identity-by-state (IBS) segments that exist in the data.

As a by-product, isolated missing genotypes in this data are imputed.

Phasing Using Individual Samples

Normally, and always when the Beagle 4.1 algorithm is used, the Hidden Markov Model that is used is a diploid HMM–that is, an HMM created from ordered pairs of edges at each level of the model. Also, normally the sampling is based on the genotypic data of individual samples. The ordered pairs of edges correspond to pairs of allele values for possible genotypes for the marker at that level. For each sample, the “forward algorithm” computes the probabilities of each “state” (ordered pair of edges) at each level given the sample’s genotypes, and the (“backwards”) sampling randomly selects states according to their probabilities. The sampled path of hidden states corresponds to an ordered pair of haplotypes that are consistent with the individual’s genotype.

Of course, this process ignores any information that may come from the genotypes of one or both parents that could inform what the most likely haplotypes might be for the offspring, even if that information is available.

Therefore, the Beagle 4.1 algorithm cannot be made to be pedigree aware, but when using the optional Beagle 4.0 algorithm, we can optionally expand the phasing strategy to use pedigree information.

Phasing Using the BEAGLE 4.0 Pedigree Algorithm

In SVS, if the Beagle 4.0 algorithm is selected on a pedigree spreadsheet, a new option called “Use Pedigree Information” is available. When selected, the pedigree information will be used during the phasing of the samples in the spreadsheet.

Before this process begins, input data is scanned for Mendelian errors. Wherever Mendelian errors are found, both the offspring data and the parent data are set to “missing” in order to allow for other possibilities of what the actual input data is.

For each iteration, after the localized haplotype-cluster model is created, the Beagle 4.0 pedigree algorithm uses a unique and specialized HMM for each of the following family-aware scenarios:

  1. For complete parent-offspring trios (father/mother/offspring),
  2. For one-parent-offspring duos (one parent for which data is available/offspring), and
  3. For singles (individual samples for which parent data is not available).

(A sample can be a parent in one or more duos and/or trios while being the offspring of another duo or trio.)

When this process is finished, the estimated haplotype pairs are pooled together for the next iteration or for the final haplotype determination.

For singles, a diploid HMM is created from ordered pairs of edges at each level, and the forward-backward algorithm is used with these individual samples, as described above.

For duos, an HMM based on ordered triples of edges at each level is created, and sampling takes place for the parent-offspring duos, treating each duo as a unit. The principle here is that one haplotype will have been passed from parent to offspring, so that the ordered triples correspond to:

  1. The one haplotype that was passed from parent to offspring,
  2. The other haplotype that the parent did not pass to the offspring, and
  3. The other haplotype of the offspring that came from the other parent.

When finished, two haplotype pairs are created for each duo that share the one haplotype that was passed from parent to offspring.

For trios, an HMM based on ordered quartets of edges at each level is created, and sampling takes place for the father/mother/offspring trios, treating each trio as a unit. Here, the ordered quartets correspond to:

  1. The haplotype passed from father to offspring,
  2. The father’s haplotype that is not passed to the offspring,
  3. The haplotype passed from mother to offspring, and
  4. The mother’s haplotype that is not passed to the offspring.

When finished, three haplotype pairs are created for each trio, with the offspring haplotype pair sharing haplotypes from the two parents.

It would be possible to create all three HMM models from the same localized haplotype-cluster model. However, the Beagle 4.0 algorithm has been fine-tuned to use one model for singles or individual samples and one different model for duos and trios, where the only difference between these models is one parameter, “modelscale”, used for their creation. This parameter is set so that for singles (0.8), haplotype clusters must be more alike before they will be merged than they need to be for duos or trios (1.0).

Additional Outputs with Pedigree Information

When taking advantage of the PED file like data in your SVS spreadsheet, you will receive several benefits and additional outputs:

  1. A better estimation of the haplotypes of both parent samples and offspring samples will be performed, leading to both better phasing and, when imputation is requested, better imputation.
  2. Mendelian errors will be eliminated from the output.
  3. SVS will produce a spreadsheet summarizing the Mendelian errors by sample.

How this Compares To The Open Source Java BEAGLE

The latest BEAGLE algorithm (BEAGLE 4.1) dropped support for using PED files to provide pedigree structure to your phasing and imputation. You can go back one version to the BEAGLE 4.0 algorithm, but we were able to harmonize these approaches in our SVS implementation with the following improvements:

  1. There are some updates in the model-building algorithm between Beagle 4.0 and 4.1. SVS uses the updated Beagle 4.1 model-building algorithm.
  2. A subtle update exists in the algorithm for finding the most likely haplotypes from all haplotypes that have been estimated.
  3. If imputation was requested, SVS will always use the Li and Stephens algorithm, as implemented in Beagle 4.1, for the actual imputation step.
  4. In SVS, multi-threading is used for sampling duos and trios, as well as for singles and individual samples, where in Beagle 4.0, multi-threading is only used for sampling singles and individual samples.
  5. SVS will produce a spreadsheet summarizing the Mendelian errors by sample. Beagle will produce a lengthy warning file going into all of the individual Mendel errors for each member of each duo or trio having the error.
  6. SVS has a convenient interface!

If you are interested in learning more about imputation, please contact us at info@goldenhelix.com.

About James Grover

James Grover was hired as Golden Helix' third employee ten years ago and is still with the company today. He is a Senior Mathematician and Computer Engineer whose job is to enhance and maintain SVS through implementing state-of-the-art algorithm as software code, both in the core product and in scripts. James has a Masters in Applied Math from California Institute of Technology. Prior to joining Golden Helix, James did software programming for a scientific contracting company and for a non-profit organization. In his free time, he enjoys singing and playing the clarinet.

Leave a Reply

Your email address will not be published. Required fields are marked *