BLOG: Let’s build a workflow, part 1: foundational workflows in WDL

Over the past several months, we at BioSymetrics collaborated with DNAstack on genomic and phenomic data analysis pipelines. Throughout this process, we gained numerous insights on building workflows and ultimately settled on constructing foundational workflows that we can compose into larger umbrella workflows. Let us share our insights with you!

Question: What is WDL? 
 

The Workflow Description Language (WDL) is a standard for describing data processing workflows, originally developed at the Broad Institute for genome analysis pipelines. WDL has since grown and is now overseen by the OpenWDL community.

Our treatise of WDL is as a “grand orchestrator” between individual software packages, where each package internally chooses how it organizes its underlying operations, whether it be a monolithic program, a Nextflow or Snakemake workflow, or even a chain of underlying WDL tasks.

To illustrate the idea of WDL as a grand orchestrator, let us consider the following example workflow:

Example workflow to identify patients with similar patterns of genetic variation starting from genome sequences.

Our example workflow aims to identify similar variants of interest between two patients. First, we identify variants in a genome sequence, annotate these variants with their predicted effects, prioritize them in terms of variant rarity and consequence, and then look for patients with similar patterns.

Each of these steps involves running distinct software packages that a grand orchestrator like WDL would chain together. Furthermore, each of these software packages may have their own software requirements and may even have their own individualized workflow in Nextflow, Snakemake, or even WDL. They may also be distributed as isolated environments via Docker. The question then becomes: how do we bring together all these requirements into a single umbrella WDL workflow, made up of individual workflows?

A foundational workflow should run a single software package

We settled on a design philosophy of chaining individual workflows, which we term foundational workflows, that run single software packages or some/all of their components. A foundational workflow can have several small tasks, or just one task. Organizing a larger workflow as a chain of foundational workflows allows us to test these individual workflows as needed. In our example above, the software packages (and therefore foundational workflows) are simply the individual listed steps.

Let us consider an example foundational workflow to annotate variants:

version 1.0

workflow AnnotateVariants {
    input {
        Array[File] vcfs
    }

    scatter (vcf in vcfs) {
        call annotate_variants {
            input:
                vcf = vcf
        }
    }

    output {
        Array[File] out = annotate_variants.out
    }
}

This workflow ingests variant call format (VCF) files and calls the following task for each VCF file:

task annotate_variants {
    input {
        Array[File] vcfs
    }

    String out_name = ...

    command <<<
        set -euo pipefail
        ...
    >>>

    output {
        File out = out_name
    }
}

Foundational workflows can call more tasks upstream if additional preparatory steps are needed prior to running the relevant software packages. For example, a foundational workflow running Ensembl’s Variant Effect Predictor (VEP) might need to download a VEP cache and plugins.

Compose, compose, compose

WDL allows us to call other WDL workflows within WDL workflows. This feature allows us to compose workflows that chain multiple foundational workflows, as we have illustrated in the workflow above.

Calling and annotating variants are straightforward tasks, with official Docker images of established software packages we can directly invoke via our WDL workflows. However, there are also downstream tasks that may require bespoke software. We devised a strategy to bundle these downstream steps into an encapsulated software suite, like VEP and snpEff, in a way that leverages our data scientists’ existing knowledge and reduces their technical barrier to entry.

We will continue this discussion on encapsulating bespoke analyses in Part 2! In the meantime, follow us on LinkedIn!

Previous
Previous

BLOG: Let’s build a workflow, part 2: encapsulating bespoke analyses

Next
Next

PRESS RELEASE: BioSymetrics partners with Ginkgo Bioworks to create innovation network