Running VSPipeline from a Docker Container

         November 10, 2021

As a lab or group scales the number of NGS samples analyzed, it is important to automate the sample analysis pipeline from the sequencer to the point where it is ready for a variant scientist or lab personnel to follow the interpretation workflow and draft a clinical report.

VSPipeline leverages the core VarSeq capability to create reproducible test-specific workflows through project templates. It allows for the automation of the computationally expensive steps to prepare the NGS variant data for interpretation. This includes the work to import, merge and normalize the VCF data, annotated with the specific versions of annotation sources and algorithms like the ACMG Sample Classifier, and optionally to export the data back out as text, VCFs, or Excel files.

Docker has become popular in the bioinformatics community for its ability to ensure that a given command line package is able to run on any Linux environment, especially highly automated and scripted environments. In this post, we will explore how to run VSPipeline using Docker containers.

Docker Fundamentals

If you have used Linux for a while, you may be familiar with the binary compatibility problem. Linux distributions define all the software, shared libraries, and runtime environment on top of the Linux kernel. While the Linux kernel is amazingly careful to provide a stable runtime environment, the Linux distributions vary greatly in their choices of startup initialization scripts, C runtime libraries, networking configuration, and core system libraries. The result: a Linux application with any reasonable set of dependencies cannot be simply copied to a given Linux distribution and executed.

In the best-case scenario, the right mix of “commonly available” libraries are available on multiple distributions, and a packager can carefully choose to bundle copies of the remaining libraries. This is the approach we have taken to support VarSeq and VSPipeline on Linux. At this time, we support Ubuntu 18.04, 20.04, and RHEL/CentOS 7 and 8 with one binary package distribution. But there are still per-platform system libraries that must be installed that we depend on.

A Linux “Container” solves this problem, by essentially using a trick (originally a security feature) of the Linux kernel to run a process with a completely different “user space” environment. Everything above the Linux kernel can be swapped out, guaranteeing the program has all the dependencies it needs. The program runs in a “container” that includes a full minimal personal Linux distribution. And Docker makes running these as easy as typing a single command!

Along with removing the binary compatibility problem, Docker containers also standardize runtime configuration, networking, and security. This ability to reproduce complex Linux configurations, as you often see in production web application server environments, has made Docker extremely popular amongst developers as well as anybody that deploys to cloud-hosted or on-premise Linux servers. If you automate multiple Linux bioinformatics tools to execute an orchestrated workflow on generic Linux servers, you can imagine how useful it is to have access to complex programs that will run in any environment and have program-specific configuration backed in!

Getting Started with VSPipeline Docker

Let’s go through the process us using the VSPipeline docker images to run VSPipeline.

As a pre-requisite. I am assuming you are running on Linux or Mac and have installed Docker. This should work on Windows as well, but the path syntax for some commands will need to be changed. You can download Docker Desktop and follow along on any platform!

With docker installed and in our path, we first “pull” the image (caching it for future use) with the following command:

docker pull goldenhelix/vspipeline

Note: This will grab the “latest” version of VSPipeline. You can grab a specific version like this:

docker pull goldenhelix/vspipeline:2.2.3

Let’s see what running a command to print the current version of VSPipeline looks like using this pulled image, and then we will explain all the parameters of the command:

docker run --rm --net=host --user=$(id -u) \
  -v `pwd`/AppData:/appdata \
  -v `pwd`/RunData:/data \
  goldenhelix/vspipeline -c get_version

Which results in the output:

{
  "appName": "VarSeq",
  "version": "2.2.3",
  "platform": "Lin64",
  "release": "2021-04-21",
  "copyright": "Copyright (C) GoldenHelix 2021",
  "versionString": "VarSeq Version 2.2.3 Lin64 Released 2021-04-21 Copyright (C) GoldenHelix 2021"
}

The docker run command provides some per-run configuration in the form of various command arguments, and then the name of the docker image (goldenhelix/vspipeline) and any arguments you want to provide to the executable run by that image. Essentially, instead of running /path/to/vspipeline -c get_version we are running docker run <configuration> goldenhelix/vspipeline -c get_version.

The most important configuration choice needed to run the VSPipeline docker image is to provide access to two important folders:

  • Application Data (/appdata in the container): VarSeq and VSPipeline maintain a local cache of downloaded annotation files as well as user preferences like the currently logged in and activated user in this directory.
  • Input/Output Data (/data in the container): This will be the default directory for VSPipeline to read and write files based on the commands performed

Because a container has its own file system (which does not save changes from one run to the next), we need to “mount” folders from the host operating system into the container to allow it to read and write files in a permanent fashion.

In this example, we are going to use a folder called “AppData” in the current directory for the application data, and “RunData” in the current directory for all input/output data files. The -v command in docker provides mappings between the host and the container file system in the form <host_dir>:<container_dir>. So, our mapping command arguments (using `pwd` to expand out as our current directory) looks like:

-v `pwd`/AppData:/appdata -v `pwd`/RunData:/data

In a production environment, it is important that “AppData” is always the same directory, but “RunData” may be specific to the batch of samples currently being processed. So, for example, it may look something like:

-v /mnt/shared_drive/VSAppData:/appdata -v /mnt/shared_drive/batch001:/data

Finally, the other arguments we are passing define some specific runtime options:

  • --rm tells Docker to not leave the Linux container environment “running” after the command has finished. This is needed whenever running command line tools versus “servers” with Docker
  • --net=host tells Docker to pass through networking to the host Instead of the default of setting up a separate “bridged” network. Due to the way VarSeq Is licensed, this Is necessary to ensure each run of VSPipeline does not look we are on a new machine.
  • --user=$(id -u) tells Docker to run the process as the current user, so the files created by VSPipeline are owned by the current user (instead of root). Note that on Mac or Windows you don’t include this option, as it’s Linux specific.

Now that we have a configuration mapped out, we can start preparing VSPipeline for the automated workflow.

The first thing we will do is login and activate our VSPipeline license:

docker run --rm --net=host --user=$(id -u) \
  -v `pwd`/AppData:/appdata \
  -v `pwd`/RunData:/data \
  goldenhelix/vspipeline \
  -c login $EMAIL $PASSWORD -c license_activate $LICENSE_KEY \
  -c license_verify

Note this will save the logged in user and license activation state under AppData/VarSeq/User Data. You should see a “vslicense-XXXX.txt” file thereafter a successful activation and the license_verify command should report the status of your current license.

Now we are ready to run a workflow. In this example, I’m going to assume there Is a couple VCF files (file1.vcf and file2.vcf) Inside the “RunData” folder. By running a series of VSPipeline commands (see our manual for more details), we can execute the following series of steps:

  • Create a project with the “Cancer Gene Panel Starter Template” under our RunData folder called “CancerProject”
  • Import our two VCF files
  • Download any annotations referenced by the project (note this only happens once, as the downloaded files will be cached under AppData/Common Data/Annotations
  • Wait for all the algorithms and annotations to finish
  • Export each sample’s annotated variant table out as a Excel file
docker run --rm --net=host --user=$(id -u) \
  -v `pwd`/AppData:/appdata \
  -v `pwd`/RunData:/data \
  goldenhelix/vspipeline \
  -c project_create CancerProject "Cancer Gene Panel Starter Template" \
  -c import file1.vcf.gz,file2.vcf.gz \
  -c download_required_sources \
  -c task_wait \
  -c foreach_sample \
     "table_export_xlsx VariantTable '{name}_filtered_variants.xlsx'"

To adapt this to run your own VarSeq workflows, you will need to copy your custom project templates to AppData/VarSeq/User Data/ProjectTemplates and any custom annotation files to the AppData/Common Data/Annotations directory.

That’s It! VSPipeline has been used In many different lab contexts to automate clinical analysis for germline and cancer genetic tests. The use of Docker simplifies the deployment of automated workflows using the VarSeq Suite and ensures ongoing stability as the underlying Linux operating systems continue to change.

Leave a Reply

Your email address will not be published. Required fields are marked *