Back to Basics: Importing/Exporting Data in Imputation Program Data Formats with SVS

         December 31, 2013

In a recent blog post (Comparing BEAGLE, IMPUTE2, and Minimac Imputation Methods for Accuracy, Computation Time, and Memory Usage), Autumn Laughbaum compared three imputation programs. Data can be exported from, or imported into, SVS in the standard file formats for these and other imputation programs. The goal of this blog post will be to review the different tools available to both export and import data to the correct file formats. The expected workflow for analyzing imputed data is in Figure 1 below. Depending on if you are running the imputation yourself, you may or may not need to perform the first three steps. Your data may also already be formatted correctly as input files for one of the imputation algorithms.
Backtobasics1

Exporting Data

BEAGLE Export Tools (BEAGLE/BEAGLECALL Scripts Package)
Export BEAGLE by Chromosome
BEAGLE input files can be created from a spreadsheet containing pedigree, phenotype and genotype data. If a marker map is applied to the spreadsheet, one file is created per chromosome to optimize the export time and memory usage. If a user selects a population stratum column, separate files for each unique strata can be created. For more information on how to run BEAGLE using the created files see: BEAGLE Genetic Analysis Software Package.

Export BEAGLE Marker File by Chromosome
This tool creates the marker file that is required by BEAGLE when multiple datasets are used for imputation. To export the marker map data, the spreadsheet must be marker mapped and contain all of the samples that will be used for imputation, in particular, the reference data and all of the study data.

A Note about BEAGLECALL
On the webpage that lists scripts for exporting/importing data in BEAGLE formats are two scripts for BEAGLECALL. BEAGLECALL is not an imputation software package, but it is an alternative to Birdsuite, CRLMM, or BRLMM for genotype calling or recalling. For more information on BEAGLECALL see BEAGLECALL Genetic Analysis Software Package. For a tutorial on how to import or export data in BEAGLECALL formats see: Recalling Genotypes with BEAGLECALL Tutorial.

Export Impute2 Genotype Probabilities
Impute2 input files require a genotype probability file per chromosome and a sample file. These files can also be used in most of the Wellcome Trust software programs including CHIAMO, HAPGEN, IMPUTE, SNPTEST and GTOOL (see Impute2 File Format) for more information on this format and these programs. The genotype probability file can take two names for markers, both a SNP ID and an RS ID. If there is only one identifier for the markers, the same identifier can be chosen for both fields. After selecting the identifiers from the marker map fields or column headers, the genotypes are converted into probabilities (probability for each genotype AA, AB, BB) with missing values getting a value of 0.333 for each genotype call. In the sample file, the row labels and row numbers are used for two different sample identifiers and the missing value rate is also included for each sample. The script and the documentation can be obtained from Export Impute2 Genotype Probabilities.

Note, other programs refer to this format as the “Oxford” format.

Export MACH PED_DAT Files
Both MaCH and Merlin take .ped and .dat files as inputs. This tool, takes a pedigree spreadsheet from SVS and exports it into the expected format for MaCH and Merlin. If a marker map is applied to the spreadsheet, the data can be exported into one file per chromosome. This is recommended as MaCH can crash if run on a dataset with too many markers. The script and the documentation can be obtained from Export MACH PED_DAT Files.

Importing Data

Import Impute2 GWAS Files
Impute2 output files consist of one or more genotype probability files (usually one per chromosome) and a sample file. These are text files with extensions of *.gen and *.sample. Both the genotype calls and the dosage information can be imported using SVS’s Import Impute2 GWAS Files tool. This tool is shipped with the software and found in the Import menu. The file specifications can be found at: Impute2 File Format. Information about all of the import options can be found at: Importing Data: Impute2 GWAS Files.

BEAGLE Import Tools (BEAGLE/BEAGLECALL Scripts Package)
BEAGLE output files consist of genotypes, genotype probabilities, and allelic R2 values. These files can be imported in to SVS as genotypes, allelic dosages, or allelic R2 values. Documentation on BEAGLE can be found at BEAGLE Genetic Analysis Software Package. A tutorial on exporting and importing BEAGLE and BEAGLECALL data to and from SVS can be found at Recalling Genotypes with BEAGLECALL Tutorial. More information on the specific import scripts are found below. The BEAGLE import tools do not create a marker map to apply to the imported datasets. They require either creating and converting a marker map to apply to the unmapped datasets or applying an already existing marker map to the datasets after import.

Import BEAGLE Allelic R2 Files
BEAGLE produces an allelic R2 file whenever a genotype probability (gprobs) file is created. The allelic R2 values range from 0 to 1 and can be used to determine the accuracy of the genotype imputation. The use of this score was very nicely demonstrated in a recent GWAS webcast: Back to Basics: Using GWAS to Drive Discovery for Complex Diseases

Import BEAGLE Files
Imputed genotype calls can be imported into SVS from either gzipped or extracted phased, unphased, or bgl files. This tool allows for importing more than one file at a time. This assumes that the samples are the same between all of the files, but the markers are different. An example would be having one BEAGLE file per chromosome.

Import BEAGLE gprobs Files as Allelic Dosage
Allelic dosage can be imported as well and analyzed directly. The genotype probabilities are converted to allelic dosage using the following formula: (0*AA + 1*AB + 2*BB). In addition, A and B alleles for each marker are available in a separate import spreadsheet.

MACH Output
MaCH output can consist of three files, mldose, mlgeno and mlinfo. For more information about MaCH as an imputation tool see MaCH FAQ. The mlinfo file is required as it contains information about the markers and the imputation quality, optionally the mlgeno (imputed genotype calls) and mldose (allele dosage values ranging from 0 to 2) can be imported and used for analysis. A marker map can be selected from already existing maps and applied to the data on import. These files do not contain enough information to generate a marker map on the fly. See Importing Data: MACH Output for more information on MaCH data import.

Import Minimac Output
Minimac is a variant of MaCH that expects the data to be pre-phased in order to optimize the genotype imputation algorithm. The data output from Minimac is expected to be in info and dose files (these can be in stored in one file per chromosome). As with MaCH data import, the info file is required, and the dose file is optional. The dosages can be converted to genotypes after import into SVS for analysis or analyzed directly. The script and the documentation can be obtained from Import Minimac Output.

Import Concatenated Genotype String Format
This import tool was written for a customer working with bovine data. We later learned that this format is consistent with data created by FindHap, a haplotype and imputation program that is commonly used by agri-genomics researchers but can be used on human data as well. These files contain one row per sample and all of the genotypes are numerically encoded in a single field without a delimiter. The import tool will try to pick the most likely field for the genotype string, but the user has control to override the selection to pick the correct fields for both sample names and the genotype string. The import script as well as the documentation can be found at: Import Concatenated Genotype String File

Note About Other Imputation Programs
FindHap and FImpute are two other imputation programs that SVS has tools for. The data generated by these tools can be imported by SVS using Import Concatenated Genotype String Format. Each has their own tool to export data from SVS in the proper format. As these export tools are a bit more specialized they are not currently available on our website, but are available on request.

What About My XYZ Imputation Data Files???

This was hopefully a whirlwind tour of all the tools available for exporting and importing data in standard formats used by some of the major imputation analysis packages available today. If you have data in another format not covered here please contact support@goldenhelix.com and we will help you get your data into (or out of) SVS for analysis.

One thought on “Back to Basics: Importing/Exporting Data in Imputation Program Data Formats with SVS

  1. Sander W. van der Laan

    Great overview of SVS’ capabilities. As a happy customer I can say it works great for these file formats! Much faster and clearer than the Linux command line… 🙂

    And yes, the GoldenHelix team is always thinking with you to provide some scripts for your work. In short: great service too!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *