Merging VCFs in an Imperfect World

         September 7, 2021

Merging variant records, VCFs, across samples is important when performing trio or family analysis as it ensures that hereditary relationships can be properly inferred. There are many ways to represent a single variant. Insertions and deletions may be right or left aligned, prefixes and suffixes can be added, and adjacent variants in the same sample may be combined or split at the variant caller’s discretion. A couple of different techniques can be used to help reconcile these different representations and ensure that a transmitted variant in a child is correctly merged with the variant from the parent.

These options fall under three categories in VarSeq, and can be modified on the last page of the import wizard:

  • Left Align – Left aligning variants moves all indels to their left most representation. This can make a big difference when variants are called in repeat regions. This is standard practice for Golden Helix annotations, and selecting this option on import will ensure the imported variants are correctly paired with annotation variants.
  • Allelic Primitives – This option splits multi-nucleotide variants into their SNP and indel components. These simpler forms are often more likely to appear in annotation sources.
  • Multi-Allelic Splitting – This option breaks apart variants where the sample has multiple alternate alleles into sets based on the affected samples. Splitting apart the alleles allows you to annotate and focus analysis on the affects of each allele individually.

For more information about these options, refer to the VarSeq manual.

One interesting variant that demonstrates the interplay between these features is illustrated in the below example where the Father has a multi-allelic variant, and the child inherits one of those alleles.

The raw father and child VCFs in Genome Browse. The child variant has had the suffix trimmed, but it matches the upper allele in the father.

On import, the multi-allelic record from the father is broken into three records. The original record is preserved since the father is an affected individual, and the father has both of the alleles in the record. The record is also split into two primitives, one for each allele. The deletion in the father matches the deletion in the child, and these two records are merged together by adding the common suffix to the deletion in the child. The final step of the merge process fills missing data from other records at the same site. Since these variants have the same reference sequence, the values can be filled across them.

Variant with sample values filled across records imported in VarSeq

For each record where one of the child alleles is present, the data for that allele is filled in. The same process is applied for the Father and Mother. This allows for the fields to be filled across the different representations and ensures the sample data is present if any of these records are evaluated further in the analysis. If this was not completed, it would be more difficult for downstream analysis to filter and find transmitted variants like this. Hopefully, this helps illustrate some of the difficulties when comparing and merging variant calls and the important steps that VarSeq takes to normalize and merge the data across different variant representations. If you have any questions, please reach out to us at support@goldenhelix.com.

Leave a Reply

Your email address will not be published. Required fields are marked *