Introduction to CWL and Docker

automating workflows in Life Sciences

Go to main page

How do I run tools or workflows in parallel?

Now that we know how to write workflows, we can start utilizing the ScatterFeatureRequirement. This feature tells the runner that you wish to run a tool or workflow multiple times over a list of inputs. The workflow then takes the input(s) as an array and will run the specified step(s) on each element of the array as if it were a single input. This allows you to run the same workflow on multiple inputs without having to generate many different commands or input yaml files.

requirements:
  ScatterFeatureRequirement: {}

The most common reason a new user might want to use scatter is to perform the same analysis on different samples. Let’s start with a simple workflow that calls our first example and takes an array of strings as input to the workflow:

scatter-workflow.cwl

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow

requirements:
  ScatterFeatureRequirement: {}

inputs:
  message_array: string[] 

steps:
  echo:
    run: echo.cwl
    scatter: message
    in:
      message: message_array
    out: []

outputs: []

Aside from the requirements section including ScatterFeatureRequirement, what is going on here?

inputs:
  message_array: string[]

First of all, notice that the main workflow level input here requires an array of strings.

steps:
  echo:
    run: echo.cwl
    scatter: message
    in:
      message: message_array
    out: []

Here we’ve added a new field to the step echo called scatter. This field tells the runner that we’d like to scatter over this input for this particular step. Note that the input name listed after scatter is the one of the step’s input, not a workflow level input.

For our first scatter, it’s as simple as that! Since our tool doesn’t collect any outputs, we still use outputs: [] in our workflow, but if you expect that the final output of your workflow will now have multiple outputs to collect, be sure to update that to an array type as well!

Using the following input file:

scatter.yml

message_array: 
  - Hello world!
  - Hola mundo!
  - Bonjour le monde!
  - Hallo welt!

As a reminder, 1st-tool.cwl simply calls the command echo on a message. If we invoke cwl-runner scatter-workflow.cwl scatter-job.yml on the command line:

$ cwl-runner scatter-workflow.cwl scatter-job.yml
[workflow scatter-workflow.cwl] start
[step echo] start
[job echo] /tmp/tmp0hqmg400$ echo \
    'Hello world!'
Hello world!
[job echo] completed success
[step echo] start
[job echo_2] /tmp/tmpu65_m1zw$ echo \
    'Hola mundo!'
Hola mundo!
[job echo_2] completed success
[step echo] start
[job echo_3] /tmp/tmp5cs7a2wh$ echo \
    'Bonjour le monde!'
Bonjour le monde!
[job echo_3] completed success
[step echo] start
[job echo_4] /tmp/tmp301wo7p8$ echo \
    'Hallo welt!'
Hallo welt!
[job echo_4] completed success
[step echo] completed success
[workflow scatter-workflow.cwl] completed success
{}
Final process status is success

You can see that the workflow calls echo multiple times on each element of our message_array. Ok, so how about if we want to scatter over two steps in a workflow?

Let’s perform a simple echo like above, but capturing stdout by adding the following lines instead of outputs: []

echo-mod.cwl

outputs:
  echo_out:
    type: stdout

And add a second step that uses wc to count the characters in each file. See the tool below:

wc.cwl

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
baseCommand: wc
arguments: ["-c"]
inputs:
  input_file:
    type: File
    inputBinding:
      position: 1
outputs: []

Now, how do we incorporate scatter? Remember the scatter field is under each step:

scatter-two-steps.cwl

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow

requirements:
 ScatterFeatureRequirement: {}

inputs:
  message_array: string[] 

steps:
  echo:
    run: echo-mod.cwl
    scatter: message
    in:
      message: message_array
    out: [echo_out]
  wc:
    run: wc.cwl
    scatter: input_file
    in:
      input_file: echo/echo_out
    out: []

outputs: []

Here we have placed the scatter field under each step. This is fine for this example since it runs quickly, but if you’re running many samples for a more complex workflow, you may wish to consider an alternative. Here we are running scatter on each step independently, but since the second step is not dependent on the first step completing all languages, we aren’t using the scatter functionality efficiently. The second step expects an array as input from the first step, so it will wait until everything in step one is finished before doing anything. Pretend that echo Hello World! takes 1 minute to perform, wc -c on the output takes 3 minutes and that echo Hallo welt! takes 5 minutes to perform, and wc on that output takes 3 minutes. Even though echo Hello World! could finish in 4 minutes, it will actually finish in 8 minutes because the first step must wait on echo Hallo welt!. You can see how this might not scale well.

Visualizing CWL

CWL viewer

This tool visualises and lists the details of a CWL workflow with its inputs, outputs and steps and packages the files involved into a downloadable Research Object Bundle (zip file with metadata in a manifest), allowing it to be easily viewed and shared.

Rabix

Power tools for the Common Workflow Language