By Popular Request: Our BEAGLE Algorithm Gains Support for Family Structure

         April 13, 2017

Earlier this year we released our own optimized and integrated BEAGLE implementation for SVS based on the BEAGLE 4.1 and optionally 4.0 algorithms.

One of the commonly requested features since that released was to expand the algorithm implementation to be considerate of the parent-offspring relationship between samples to inform and improve the accuracy of the haplotype phasing.  With this information, both creating phased reference panels and the imputation of samples against those reference panels can improve in accuracy and also ensure there are no Mendelian errors in the resulting genotypes.

In this post, we review how phasing works in BEAGLE in the two algorithm variants (4.1 and 4.0), and how we brought together original “Java BEAGLE” and novel techniques to allow pedigree structure aware phasing to the latest SVS phasing and imputation algorithms.

Phasing in BEAGLE

In Beagle, and in the SVS implementation of the Beagle algorithms, one of the steps (and the only step for creating a reference panel) is phasing the data (which corresponds with determining the haplotype pairs in the data).

Each iteration of the Beagle phasing algorithm first creates a directed acyclic graph (DAG) encapsulating the most likely haplotypes as a localized haplotype-cluster model. This model, which is based on the haplotypes estimated by the previous iteration, is “leveled”–that is, its edges may be subdivided into sets, where each set corresponds with one marker, and its nodes may be subdivided into sets, where each node set may be thought of as lying between two markers. The entire process starts with a model based on what the haplotypes would be if there were perfect linkage equilibrium between the markers.

The next step in each iteration is to use the localized haplotype-cluster model as part of a Hidden Markov Model (HMM), and to use a forwards-backwards algorithm to sample phased haplotypes according to their probabilities as determined by the HMM and conditional on the genotypic data. The actual direction of “forwards” and “backwards” over the genome is switched between iterations.

After the last iteration, the algorithm determines which are the most likely haplotypes that have been estimated.

Finally, in Beagle 4.1, this is followed by a modified HMM algorithm that takes advantage of identity-by-state (IBS) segments that exist in the data.

As a by-product, isolated missing genotypes in this data are imputed.

Phasing Using Individual Samples

Normally, and always when the Beagle 4.1 algorithm is used, the Hidden Markov Model that is used is a diploid HMM–that is, an HMM created from ordered pairs of edges at each level of the model. Also, normally the sampling is based on the genotypic data of individual samples. The ordered pairs of edges correspond to pairs of allele values for possible genotypes for the marker at that level. For each sample, the “forward algorithm” computes the probabilities of each “state” (ordered pair of edges) at each level given the sample’s genotypes, and the (“backwards”) sampling randomly selects states according to their probabilities. The sampled path of hidden states corresponds to an ordered pair of haplotypes that are consistent with the individual’s genotype.

Of course, this process ignores any information that may come from the genotypes of one or both parents that could inform what the most likely haplotypes might be for the offspring, even if that information is available.

Therefore, the Beagle 4.1 algorithm cannot be made to be pedigree aware, but when using the optional Beagle 4.0 algorithm, we can optionally expand the phasing strategy to use pedigree information.

Phasing Using the BEAGLE 4.0 Pedigree Algorithm

In SVS, if the Beagle 4.0 algorithm is selected on a pedigree spreadsheet, a new option called “Use Pedigree Information” is available. When selected, the pedigree information will be used during the phasing of the samples in the spreadsheet.

Before this process begins, input data is scanned for Mendelian errors. Wherever Mendelian errors are found, both the offspring data and the parent data are set to “missing” in order to allow for other possibilities of what the actual input data is.

For each iteration, after the localized haplotype-cluster model is created, the Beagle 4.0 pedigree algorithm uses a unique and specialized HMM for each of the following family-aware scenarios:

  1. For complete parent-offspring trios (father/mother/offspring),
  2. For one-parent-offspring duos (one parent for which data is available/offspring), and
  3. For singles (individual samples for which parent data is not available).

(A sample can be a parent in one or more duos and/or trios while being the offspring of another duo or trio.)

When this process is finished, the estimated haplotype pairs are pooled together for the next iteration or for the final haplotype determination.

For singles, a diploid HMM is created from ordered pairs of edges at each level, and the forward-backward algorithm is used with these individual samples, as described above.

For duos, an HMM based on ordered triples of edges at each level is created, and sampling takes place for the parent-offspring duos, treating each duo as a unit. The principle here is that one haplotype will have been passed from parent to offspring, so that the ordered triples correspond to:

  1. The one haplotype that was passed from parent to offspring,
  2. The other haplotype that the parent did not pass to the offspring, and
  3. The other haplotype of the offspring that came from the other parent.

When finished, two haplotype pairs are created for each duo that share the one haplotype that was passed from parent to offspring.

For trios, an HMM based on ordered quartets of edges at each level is created, and sampling takes place for the father/mother/offspring trios, treating each trio as a unit. Here, the ordered quartets correspond to:

  1. The haplotype passed from father to offspring,
  2. The father’s haplotype that is not passed to the offspring,
  3. The haplotype passed from mother to offspring, and
  4. The mother’s haplotype that is not passed to the offspring.

When finished, three haplotype pairs are created for each trio, with the offspring haplotype pair sharing haplotypes from the two parents.

It would be possible to create all three HMM models from the same localized haplotype-cluster model. However, the Beagle 4.0 algorithm has been fine-tuned to use one model for singles or individual samples and one different model for duos and trios, where the only difference between these models is one parameter, “modelscale”, used for their creation. This parameter is set so that for singles (0.8), haplotype clusters must be more alike before they will be merged than they need to be for duos or trios (1.0).

Additional Outputs with Pedigree Information

When taking advantage of the PED file like data in your SVS spreadsheet, you will receive several benefits and additional outputs:

  1. A better estimation of the haplotypes of both parent samples and offspring samples will be performed, leading to both better phasing and, when imputation is requested, better imputation.
  2. Mendelian errors will be eliminated from the output.
  3. SVS will produce a spreadsheet summarizing the Mendelian errors by sample.

How this Compares To The Open Source Java BEAGLE

The latest BEAGLE algorithm (BEAGLE 4.1) dropped support for using PED files to provide pedigree structure to your phasing and imputation. You can go back one version to the BEAGLE 4.0 algorithm, but we were able to harmonize these approaches in our SVS implementation with the following improvements:

  1. There are some updates in the model-building algorithm between Beagle 4.0 and 4.1. SVS uses the updated Beagle 4.1 model-building algorithm.
  2. A subtle update exists in the algorithm for finding the most likely haplotypes from all haplotypes that have been estimated.
  3. If imputation was requested, SVS will always use the Li and Stephens algorithm, as implemented in Beagle 4.1, for the actual imputation step.
  4. In SVS, multi-threading is used for sampling duos and trios, as well as for singles and individual samples, where in Beagle 4.0, multi-threading is only used for sampling singles and individual samples.
  5. SVS will produce a spreadsheet summarizing the Mendelian errors by sample. Beagle will produce a lengthy warning file going into all of the individual Mendel errors for each member of each duo or trio having the error.
  6. SVS has a convenient interface!

If you are interested in learning more about imputation, please contact us at info@goldenhelix.com.

About Darby Kammeraad

Darby Kammeraad is the Director of Field Application Services at Golden Helix, joining the team in April of 2017. Darby graduated in 2016 with a master’s degree in Plant Sciences from Montana State University, where he also received his bachelor’s degree in Plant Biotechnology. Darby works on customer support and training. When not in the office, Darby is learning how to play guitar, hunting, fishing, snowboarding, traveling or working on a new recipe in the kitchen.

6 thoughts on “By Popular Request: Our BEAGLE Algorithm Gains Support for Family Structure

  1. Mike Keehan

    Awesome improvements to Beagle and a nice explanation.

    Could you show how to get read backed phasing (say from whatshap or hapcut2) into beagle before LD phasing?
    I’ve always wanted to get pedigree, read-backed phasing and LD phasing integrated into our phased genotypes used as a reference for imputation.

    Reply
    1. Gabe Rudy

      Thanks for the note Mike!

      The phasing step respects any phasing that already exists in your genotypes, so if you ran another phasing tool (which I assume whatshap or hapcut2 are) and get a set of phased genotypes, you could use those for running imputating or use those as a starting point for running the BEAGLE phasing algorithm.

      Reply
  2. Glib

    Thanks for the post, Darby.

    I am testing different software for phasing duos and trios, however BEAGLE seems to impute crazily across my loci and individuals. Is there any way I could switch off this option? I tried to specify impute-its=0, however this did not work.

    My data are unusual families: Mothers are normal sexual individuals, but fathers have complete meiotic drive of the either of the haplogenomes in one families or clonally passing two haplogenomes simultaneously in others.

    For instance, when two two genomes passed clonally without recombination in paternal germline:
    Mother Father offspring_1 offspring_2 offspring_3 offspring_4 offspring_5 offspring_6 offspring_7 offspring_8
    0/0 0/1 0/1 0/1 0/1 0/1 0/1 0/0 0/0 0/0

    And I want have simply:
    Mother Father offspring_1 offspring_2 offspring_3 offspring_4 offspring_5 offspring_6 offspring_7 offspring_8
    0|0 0|1 0|1 0|1 0|1 0|1 0|1 0|0 0|0 0|0

    where Maternal_allele | Paternal_allele

    Instead, I have high imputation, resulting in something like this:
    Mother Father offspring_1 offspring_2 offspring_3 offspring_4 offspring_5 offspring_6 offspring_7 offspring_8
    0|1 0|1 0|1 1|0 0|1 1|0 0|1 0|0 0|1 1|0

    I would also prefer to keep missing genotypes as they are ./. without any kind of imputation: my data are rigorously filtered for missingness already.

    I would appreciate any advice!

    Reply
  3. MJSW

    Thanks for the post. I am wondering whether the pedigree algorithm takes into account family relationships of individuals included in the reference panel? For example the reference consists of parents and the offspring are being phased and imputed using this reference consisting of parents. Will these parent offspring relationships be ignored?

    Reply
    1. Golden Helix

      The Beagle 4.0 Algorithm is using the pedigree data during the phasing step for the samples in the spreadsheet. The pedigree can be based on a complete trio set, one parent+ one offspring (duo), and single sample with no parent. A sample can also be a parent to one sample while being the offspring of another duo/trio. You can create a reference panel on the family set, but the pedigree data is again utilized during the phasing specifically.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *