Skip to content

Snakemake Workshop, part 2: Writing workflows

Introduction

Welcome to the second part of our Snakemake workshop! In this section, we will learn how to begin creating a Snakemake workflow from a series of shell scripts. We will learn about what types of workflows might benefit from being converted to Snakemake, how to go from a series of shell scripts to a pipeline, and how to make a pipeline configurable so that it can be used with different datasets. Keep in mind that this session is designed to give you a taste of developing Snakemake workflows, and uses a very simple example. If you want assistance with turning your own workflow into a Snakemake workflow, please reach out to us for help!

Run git pull at the beginning of the workshop

Because we have been making changes to the workshop content, if you have cloned the workshop repository before today, please run git pull in the root directory of the repository to make sure you have the latest version of the materials. If you downloaded the materials as a zip file, please redownload the zip file and extract it again.

What workflows are suitable for Snakemake?

Although we love workflow managers like Snakemake, that doesn't mean that every single time we're running something on our computers we are wrapping them in a Snakemake script. A workflow manager has the following powerful features:

  • Resumability: If a step in a workflow fails, you can rerun from that step.
  • Reproducibility: You can rerun the same workflow with the same input data and get the same results.
  • Parallelization: You can run multiple steps in parallel, and subsequent steps run without waiting for unrelated steps to finish.
  • Scalability: Running the same workflow for 1 or 100 files is the same

However, there is a good amount of overhead involved in writing a Snakemake pipeline, so in order to take advantage of these features, you may also ask whether your workflow has some of the following characteristics:

  • Multiple, interrelated steps: If you have multiple file transformations that depend on each other, this is a good candidate for Snakemake. Don't use Snakemake if you have just one or two simple steps that are easily launched in a shell script.
  • Need for reruns: If you need to rerun the same steps with different parameters (e.g. for benchmarking), or if you need to compare the results of different runs, Snakemake could help make this easier and more reproducible.
  • Can benefit from simple parallelization: You have many independent steps that can be run in parallel, such as multiple input files to be processed with the same steps
  • Complex dependencies across steps: Each rule in Snakemake can run in a different environment, so this would be useful if you want to compartmentalize your software environments

A classic example of a workflow that would NOT be worth converting into a Snakemake workflow is some one-off exploratory or data cleaning script that you aren't sure you will ever run again. Another example is a super convoluted workflow that you don't fully understand or didn't write yourself, and don't have time to debug. In that case it might be better to just start fresh rather than trying to convert.

Converting shell scripts to Snakemake

Drawing your own rulegraph

Planning ahead is important when writing a Snakemake workflow. Before you start writing any code, you should understand the relationships between the different steps in your workflow. One way to do this is to visualize the rulegraph of your workflow manually.

Exercise: Read through the shell script files in the shell-scripts directory. These scripts take a set of text files in the data directory, count the number of lines and words in each file, and then combine those counts into a summary file for each input file. Draw a diagram of the input-output relationships of these scripts. For example, you might draw something like this:

data/sample1.txt  -->  01_count_lines.sh  -->  results/sample1.lines

Or it may be easier to draw on paper.

Writing the first Snakemake rule

The main goal of a Snakemake rule is to gather inputs, compile expected outputs, and run a command on the inputs to generate the outputs. So, in order to convert a command or shell script to a Snakemake rule, we need to understand the syntax of Snakemake rules.

First, a rule name is defined. For our first script in the shell-scripts, 01_count_lines.sh, we might call the associated rule count_lines:. This is typed with the rule keyword:

rule count_lines:

Note that the colon is required syntax.

Question: Snakemake is based on Python. In Python, what must happen on new lines after lines that end in a colon (e.g. if x < 10: or while True:)?

Click for answer

Code to be executed within that statement must be indented after a colon.

In Python, these are referred to as blocks of code. In Snakemake, we simply call these sections or specifications, because these aren't necessarily doing any computations, but rather compiling information to perform tasks.

After you've named your rule, you need to give the rule information through directives. The basic directives for a Snakemake rule are input, output, and a directive to run a command, either shell, run, or script. There are other directives that allow us to control many other aspects of the command to be run and we'll talk about those later on, but to start we'll focus on these three tasks.

Input specification

To specify the inputs of your rule, you simply type input: <name of input file> after your rule declaration. For example, if we wanted our count_lines rule to run only on sample1 in our dataset, we would type:

rule count_lines:
    input: "data/sample1.txt"

Notice the indentation, which is required. Also notice that even though input: ends with a colon, since we keep the specification on the same line, no indentation is required.

Output specification

Next, we would specify our output directive with output: <name of expected output file>.

Exercise: Create a dev-01.smk file in the develop directory. Copy what we have so far (rule declaration and input specification) into the file. Now, add an output specification assuming we are only running this rule on sample1.

Solution
rule count_lines:
    input: "data/sample1.txt"
    output: "results/sample1.lines"

Conveniently, for output files, Snakemake will create any directories specified if needed. For example, in the above output section, we say to store the files in results/ relative to our current directory. At this point, no results/ directory exists, but Snakemake will still be able to run and just create it for us. If results/ already existed when we ran our workflow, it would then check whether results/sample1.lines exists and, depending on whether the input files or script had been updated, may or may not run the command again to (re-)generate it.

Running a command with the shell directive

Now that we have the input and output files specified with our rule, we need to tell Snakemake what to do with them. For now (and commonly), we'll use the shell directive. We can just plug in how we would normally run our shell-scripts/01_count_lines.sh script:

rule count_lines:
    input: "data/sample1.txt"
    output: "results/sample1.lines"
    shell: 
        "bash shell-scripts/01_count_lines.sh {input}"

Since this line is a bit longer, we've placed the command on a different line from the shell: directive itself and because it follows a colon on a new line, we've indented it!

Syntactically, the quotes around the command are required. For multi-line commands, use Python's triple quote syntax ("""; see below).

Importantly, notice that we pass this string as an argument to the 01_count_lines.sh script: {input}. This is super important! The curly brackets are the shell directive's way of accessing information from the other directives in the rule! Here, {input} means, "substitute in the value provided in the input: section above." You will use curly brackets in this way a lot, so if you have any questions about it, please ask!

Incorporating our script directly in the rule

What if we didn't want to rely on the shell script? Since the script is very simple, and Snakemake's shell directive can take any shell command, we could directly put the bash command in the script into the shell directive:

rule count_lines:
    input: "data/sample1.txt"
    output: "results/sample1.lines"
    shell: 
        "wc -l {input} | awk '{{print \$1}}' > {output}"

Curly brackets in shell commands

Recall that Snakemake uses curly brackets to substitute information from the rule's other directives. However, in the command above, we use curly brackets with awk's syntax. In order for Snakemake to know not to try to substitute something into awk's curly brackets, we have to escape them, by surrounding the whole statement in double curly brackets: {{ }}. You will need to do this any time you want to use curly brackets in your shell commands.

In practice, whether you want to directly write the command in your Snakemake file or call an external script is up to you. In this workshop, we will directly write the commands in the Snakemake file because they are short and easy to read. Directly writing the command in the shell: section has the benefit of making the workflow more self-contained and easier to understand. On the other hand, calling an external script is useful for running a more complex command. Note that if you do find yourself making a Snakemake workflow with few rules that all refer to complex scripts, it may be worth considering breaking up that complex script into multiple steps/rules. We will discuss this later in the section "What is the proper scope of a rule?".

Rule all

By default, Snakemake will execute the first rule in your script when run. If necessary, it will run other rules to create the inputs for the first rule. This works for our simple example, with one rule and hardcoded inputs and outputs. But for more complex workflows, it is natural to write rules top-to-bottom, which conflicts with this Snakemake behavior (it would look at the first rule and see all the inputs exist and do nothing).

So to make sure Snakemake knows what the final product of your workflow is, and to keep things more readable (top-to-bottom), the standard when writing Snakemake workflows is to place a rulle called all at the top of your script. This rule will have no output: directive and no command to run (e.g. with shell:). Instead, it will only have an input: directive, and the specified inputs to this rule should be the final expected outputs of your entire workflow.

Recall that Snakemake is based in principle on GNU Make, which is used to build software. In GNU Make, the final target is the executable or binary for the softare. In that context, the all rule is like the final completed executable.

rule: all also helps us think about the end goal of our workflow. However, in practice, you will usually be writing your workflow one rule at a time and updating rule: all each time you write a new rule in your workflow.

For now, our workflow only has one output, so we can write that as our desired final output:

rule all:
    input: "results/sample1.lines"

Your full dev-01.smk file should now look like this:

rule all:
    input: "results/sample1.lines"

rule count_lines:
    input: "data/sample1.txt"
    output: "results/sample1.lines"
    shell:
        "wc -l {input} | awk '{{print \$1}}' > {output}"

Notice that the output for rule count_lines is now the input for rule: all.

Let's leave everything hard coded for now so that we can focus on getting the workflow working. Soon, we will learn about how to use wildcards so that these rules can be generalized and take multiple files.

Testing your Snakemake workflow with --dryrun

The first thing you should do to test your Snakemake workflow is to run it with the --dryrun option. This will show you what Snakemake would do without actually running any commands. You can run this command in the terminal:

snakemake --snakefile develop/dev-01.smk --dryrun

A dry run is for checking the logic of your workflow. It does not read or modify any files. Its main function is to check each rule and find out if the required input files are available or will be generated by another rule. Make sure to read the output of your dry run to check that the logic matches your intention.

Importantly, this dry run WILL NOT check whether the commands in the shell: directives are valid or will succeed. It just assumes that they will and only looks at the input and output sections. The dry run should look like this:

Building DAG of jobs...
Job stats:
job            count
-----------  -------
all                1
count_lines        1
total              2


[Tue Aug 19 15:59:10 2025]
rule count_lines:
    input: data/sample1.txt
    output: results/sample1.lines
    jobid: 1
    reason: Missing output files: results/sample1.lines
    resources: tmpdir=<TBD>
[Tue Aug 19 15:59:10 2025]
rule all:
    input: results/sample1.lines
    jobid: 0
    reason: Input files updated by another job: results/sample1.lines
    resources: tmpdir=<TBD>
Job stats:
job            count
-----------  -------
all                1
count_lines        1
total              2

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        all
    output files have to be generated:
        count_lines
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Resetting your workflow by deleting output files

Now that we'll be actively developing the workflow, it's a good idea to reset the workflow state between runs. We can do this by running the command a rm -r results/*. This will ensure that we start with a clean slate each time we run the workflow.

Checkpoint: dev-01.smk

You can catch up to here by looking at complete/dev-01.smk.

Generalizing with wildcards

Of course, in our rule count_lines, we've currently just hardcoded a single file for the rule to run on. It would be pretty pointless if we had to change the file name each time we wanted to run the rule. Thankfully, Snakemake (and other workflow managers) have ways to generalize rule to lists of input files: wildcards.

Wildcards are placeholders that Snakemake uses to match patterns in, say, file names.

Task: Create a dev-02.smk file in the develop directory and copy over the contents of dev-01.smk. Now, we will modify the rules to incorporate wildcards. For example, we can change our count_lines rule to use a wildcard for the input file:

rule count_lines:
    input: "data/{sample}.txt"
    output: "results/{sample}.lines"
    shell:
        "wc -l {input} | awk '{{print \$1}}' > {output}"

Notice now in the input and output sections how we've replaced the actual file name with {sample}.

In shell: directives, recall that curly brackets are used to substitute information from the other directives in the rule. Here, they are used to substitute patterns within file names.

The part of the file path that we are changing for the wildcards is the part where the filenames go sample1, sample2, etc. This is the variable part of the filename. Essentially, somewhere we have defined a pattern. As Snakemake works backwords it will encounter this rule and see that {sample} belongs to the input of some other rule and will fill in the values appropriately.

However, just because we created a wildcard doesn't mean that Snakemake will scan your filesystem to look for files that match that pattern. The wildcards function only as a way to define the structure of the input and output file paths. In our case, we will need to provide a list of samples for Snakemake to process. Remember, a Snakemake script is just a Python script with extra syntax. This means we can use Python code in our Snakemake script, which is especially useful for workflow setup!

In python, you can create a list of strings - aka samples - like this:

SAMPLES = ["sample1", "sample2"]

Next, we need to substitute the wildcards with those sample names. In Snakemake, there is a special function called expand() that we can use to fill in the wildcards with the actual sample names. For example,

expand("data/{sample}.txt", sample=['sample1', 'sample2'])

will be parsed by Snakemake to

["data/sample1.txt", "data/sample2.txt"]

Exercise: Where do we put expand() functions? They are almost always used in the input section of rules, with the wildcards propagating backwards through the rules. In our simple Snakemake workflow, we only need to put the expand() function in one place, and you might need to edit the function slightly. Edit your dev-02.smk file and try to figure out where to put it and what edits need to be made. Remember that snakemake thinks backwards!

Run rm -r results/* to reset your project directory. Then do a dry run of snakemake -s dev-02.smk --dryrun. Generate the rulegraph and the DAG for dev-02.smk. If everything looks good, run it for real with snakemake -s dev-02.smk -j 1.

Checkpoint: dev-02.smk

You can catch up to here by looking at complete/dev-02.smk & related files dev-02-rulegraph.png and dev-02-dag.png.

More advanced expanding

You can have multiple wildcards in a single expand call. For example, you can expand both the sample name and the file type:

expand("data/{sample}.{ext}", sample=['sample1', 'sample2'], ext=['txt', 'csv'])

becomes

["data/sample1.txt", "data/sample2.txt", "data/sample1.csv", "data/sample2.csv"]

Exercises

Adding a new rule 1

In the next few sections, we will gradually build new rules for our Snakemake workflow based on the bash scripts in the develop/shell-scripts directory. We will name them dev-03.smk, dev-04.smk, etc. As we test and run our Snakemake workflows, we will be removing the results directory and its contents each time. This is because when we do a dry run, Snakemake checks to see if any intermediate or final files have already been created by the rules, and if so, excludes that rule from the dry run logic. This could cause unexpected behavior if we are intending to test the entire logic of the workflow.

Exercise: Copy dev-02.smk to dev-03.smk. In dev-03.smk, try to add another rule count_words that counts the number of words in each input file. This should do the same thing as the shell script shell-scripts/02_count_words.sh. When you add the new rule, make sure to also modify the all rule to reflect the new workflow logic and the final product. If you are lost, consult your workflow diagram that you made earlier. Test your code with --dryrun and then run it for real to make sure it worked. Remember to remove the results directory before running the workflow again. Generate the rulegraph and DAG for dev-03.smk.

Checkpoint

See complete/dev-03.smk, rulegraph: complete/dev-03-rulegraph.png, DAG: dev-03-dag.png

Adding a new rule 2

The next shell script we want to add to our Snakemake workflow is the 03_combine_counts.sh script, which will combine the line and word counts for each sample into a single output file for each sample. This rule will take as an input the line and word count files for each sample. The final output will now be a .summary file for each sample. As before, we will need to edit both rule all as well as add the new rule combine_counts. Additionally, we will have two inputs for the combine_counts rule: the line count file and the word count file.

To add two inputs to a rule in Snakemake, you can specify them on separate lines, separated by a comma, and name them accordingly. Then, when you need to refer to those inputs, you can use the input keyword followed by the name of that input. This also applies for the output block. For example:

rule example:
    input:
        A="{sample}.txt",
        B="{sample}.csv"
    output:
        C="{sample}.tsv"
    shell:
        "cat {input.A} {input.B} > {output.C}"

Not every input needs to be a wildcard. You may have a rule that takes a fixed input file name and then a series of variable files. For example:

rule example:
    input:
        A="fixed_input.txt",
        B="{sample}.csv"
    output:
        C="{sample}.tsv"
    shell:
        "cat {input.A} {input.B} > {output.C}"

Another concept that is important in Snakemake formatting is multi-line shell commands. If you have a long shell command, you can split it across multiple lines for better readability. You can do this by using triple quotes """ to enclose the entire command. We will be using this feature in the next exercise. For example:

rule example:
    input:
        A="{sample}.A",
        B="{sample}.B"
    output:
        C="{sample}.C"
    shell:
        """
        cat {input.A} {input.B} > {output.C}
        some_other_command here
        and so on
        """

Exercise: Copy dev-03.smk to dev-04.smk. Try making the necessary edits to add the rule to combine the outputs of the other two rules (as done ine shell-scripts/03_combine_counts.sh) now in dev-04.smk. You will need to replace the input for rule all with the summary files and add a rule combine_counts. Make a rulegraph and a DAG for dev-04.smk.

Checkpoint

See complete/dev-04.smk, Rulegraph: complete/dev-04-rulegraph.png, DAG: dev-04-dag.png

Aggregation/reduction rule

We will now move on to writing the final rule of our Snakemake workflow, which will adapt the script 04_aggregate.sh. This script takes all the .summary files and combines them into a nice tab-separated file (TSV) with a header. Until now, we have only made rules that have a one-to-one correspondence between input and outputs, i.e. each input creates a unique output. This time, the rule aggregate will take many files and output a single file. Because this is a more complicated rule, let's break up the exercise into steps.

Question: First step: what is the input to rule aggregate and how would you specify it using wildcards?

Answer

The input to rule aggregate is all the .summary files. We know that all the summary files will be named following the pattern of sample1.summary, sample2.summary, etc. So we need to create an expand() function that will get all of these.

rule aggregate:
    input: expand("results/{sample}.summary", sample = SAMPLES)

Question: Second step: based on the shell script, what should the output be?

Answer

The output should be a single named file. Because it does not vary based on the input, it should not contain any wildcards.

rule aggregate:
    input: expand("results/{sample}.summary", sample = SAMPLES)
    output: "results/combined.summary"

Now we will get to the difficult part of converting this script. First, notice how the input differs from previous rules' inputs. In our rule combine_counts, the input files were specified as separate lines for each count type (lines and words). So the input is a list of dictionaries like this:

{
    "lines": "results/sample1.lines",
    "words": "results/sample1.words"
}

Then each of those files will get referenced in the shell command when we do {input.lines} and {input.words}. The shell interpreter then sees the input as "results/sample1.lines" and "results/sample1.words", respectively.

In contrast, when we use the expand() function in the rule aggregate, we create one list of files that looks like this:

["results/sample1.summary", "results/sample2.summary"]

That means instead of one file being passed to the rule, we want a list of files being passed at the same time. When we refer to {input} in the shell command in this case, the shell interpreter will see the input as a single string with all the file names separated by spaces:

So the below shell directive

shell:
    cat {input} 

becomes

cat results/sample1.summary results/sample2.summary

Therfore, in order to work with individual files, we will need to use a loop to iterate over the list of input files. If you look at the 04_aggregate.sh script, you will see that it already has the for loop written. Now we need to figure out where to put the {input} and {output}.

To recap:

  • One-to-one rules (like combine_counts) use wildcards to process one set of files per sample.
  • Many-to-one rules (like aggregate) use an explicit input list (from expand()), combining all files in a single rule execution.

Exercise: Third step: Copy dev-04.smk to dev-05.smk. Complete the rule aggregate and adapt the shell command. Save this as dev-05.smk.

Solution
rule aggregate:
    input:
        expand("results/{sample}.summary", sample = SAMPLES)
    output:
        "results/aggregate-summary.tsv"
    shell:
        """
        echo -e "sample\tlines\twords" > {output}
        for summary_file in {input}; do
            SAMPLE_NAME=$(basename "$summary_file" .summary)
            LINES=$(grep -e "^lines\t" "$summary_file" | cut -f2)
            WORDS=$(grep -e "^words\t" "$summary_file" | cut -f2)
            echo -e "$SAMPLE_NAME\t$LINES\t$WORDS" >> {output}
        done
        """

Exercise: Fourth step: Modify the rule all to the correct final file and then test the workflow using dry-run. Finally, create the rulegraph and DAG for dev-05.smk.

Solution
rule all:
    input:
        "results/aggregate-summary.tsv"

In the following sections, we will introduce some more advanced topics related to Snakemake that allow you to better customize how your workflow runs. It will be lighter on coding exercises because there is a lot to cover.

Checkpoint: dev-05.smk

You can catch up to here by looking at complete/dev-05.smk, rulegraph: complete/dev-05-rulegraph.png, DAG: dev-05-dag.png.

Other ways to specify inputs

Automatic file discovery using glob

So far, we've used wildcards and hardcoded sample names to specify input files.

Here's how we did the hard-coded sample name:

rule count_lines:
    input: "data/sample1.txt"

Here's how we did the wildcard sample names:

SAMPLES = ["sample1", "sample2"]
rule all:
    input: expand("results/{sample}.lines", sample=SAMPLES)

Real world workflows need more flexible ways to specify inputs. Let's do a more advanced form of finding files using glob() and glob_wildcards(). glob is a python function that uses pattern matching to find files and directories. It can use regular expressions or python wildcards (aka *). Below is an example of using glob to find all files that match the pattern *.txt in the data folder, and then extracting the sample name. This can be plugged directly into the dev-05.smk file in place of the first line.

import glob
import os

SAMPLES = []
for txt_file in glob.glob("data/*.txt"):
    sample_name = os.path.basename(txt_file).replace(".txt", "")
    SAMPLES.append(sample_name)
Command breakdown
Command line option Description
import glob Imports the glob module, which provides functions for file pattern matching.
import os Imports the os module, which provides functions for interacting with the operating system.
SAMPLES = [] Initializes an empty list called SAMPLES to store sample names.
for txt_file in glob.glob("data/*.txt"): Uses glob to find all files in the data directory that match the pattern, then iterates over each file.
sample_name = os.path.basename(txt_file).replace(".txt", "") Extracts the base name of the file (removing the directory path) and removes the .txt extension.
SAMPLES.append(sample_name) Adds the extracted sample name to the SAMPLES list.

Another way to glob files is to use the snakemake-specific glob function called glob_wildcards(). This one lets you do in one line what we did in a for loop above.

SAMPLES = glob_wildcards("data/{sample}.txt").sample
Command breakdown
Command line option Description
glob_wildcards("data/{sample}.txt") Searches for all files matching the pattern in the data directory.
{sample} extracts the sample names from the matched file paths.

This returns a list of all file paths, while the .sample attribute returns a list of all parts of the file paths without the extension.

You can use this method to generate multiple sample names from a directory of files without having to list them all explicitly. You can glob more than one thing in your filename. For example:

# Extract both sample name and condition from filenames
SAMPLES, CONDITIONS = glob_wildcards("data/{sample}_{condition}.txt")

# Create all combinations
rule all:
    input: 
        expand("results/{sample}_{condition}.lines", 
               zip, sample=SAMPLES, condition=CONDITIONS)

rule count_lines:
    input: "data/{sample}_{condition}.txt"
    output: "results/{sample}_{condition}.lines"
    shell:
        "wc -l {input} | awk '{{print \$1}}' > {output}"

When you are writing your own glob patterns, it's important to double check what is happening by having print statements at each step. You can even create a python file just to practice globbing.

Activity: For example, let's create a file named glob_test.py with the following code to make sure that what we're globbing is correct. Then in your terminal, run python glob_test.py to double check that you are getting the right wild cards.

import glob
import os
from snakemake.io import glob_wildcards

# Test globbing
SAMPLES = glob_wildcards("data/{sample}.txt").sample
print(f"Found samples using glob_wildcards: {SAMPLES}")

SAMPLES = []
for txt_file in glob.glob("data/*.txt"):
    sample_name = os.path.basename(txt_file).replace(".txt", "")
    SAMPLES.append(sample_name)

print(f"Found samples using regular python glob: {SAMPLES}")

If you have a more complicated file pattern that you want to match, LLMs are decent at generating regular expressions to capture the parts of paths that you are interested in. A combination of feeding the LLMs some sample file paths, the part you want to extract, and trial and error with a test script, is a good way to do more advanced globbing.

Using samplesheets

While globbing for your files is effective, it does have some drawbacks.

  1. It requires your files to be in an organized pattern that can be easily matched with wildcards, so it can be less flexible when dealing with complex directory structures or varying file names.
  2. There is no record of the original file paths, which can make debugging more difficult. If you accidentally add a file that you didn't want, but that matches the glob, it'll have downstream effects.
  3. Files that don't match the glob are silently skipped over, which can lead to incomplete analysis if you're not careful.

Another way to manage your inputs is by listing your samples in a samplesheet file and reading that in. Because snakemake is written in Python, you can use any Python code to read in your samplesheet file.

Activity: To follow along, create a samplesheet called samplesheet.txt in the develop directory with the following contents:

sample1
sample2

Now, we can use base python to read that text file at the top of our snakefile. Create a new snakefile based off of dev-05.smk and call it dev-samplesheet.smk. Replace the first line SAMPLES = ["sample1", "sample2"] with:

SAMPLES = []
with open("samplesheet.txt", "r") as f:
    SAMPLES = [line.strip() for line in f.readlines()]

What this does is read the samplesheet.txt file line by line, strips away whitespace (line.strip()), and stores the result in the SAMPLES list. This SAMPLES list is identical to ["sample1", "sample2"], except that it was generated dynamically from the contents of the samplesheet.txt file.

Question: What do you think would happen if we had sample3 in our samplesheet.txt file? Try it out!

For more complicated samplesheets, you can use the pandas library to read in a CSV or other tab-delimited file. This is useful if you have a collection of metadata associated with each sample, or a column of sample names plus a column of file paths. pandas comes pre-installed when you install snakemake. Below is an example CSV of forward and reverse reads and how it might be parsed into a workflow:

sample,R1,R2,condition
sample1,/path/to/data/sample1_R1.fastq,/path/to/data/sample1_R2.fastq,sample
sample2,/path/to/data/sample2_R1.fastq,/path/to/data/sample2_R2.fastq,sample

Now we read it using pandas and then parse out the sample column to a list. For the file paths, we create a dictionary where the key is the sample name, and the value is the R1 or R2 file path. Then, in the input section, we use a lambda function to dynamically retrieve the correct file paths for each sample.

import pandas as pd

df = pd.read_csv("samplesheet.txt", sep="\t")
SAMPLES = df["sample"].tolist()
R1_FILES = dict(zip(df["sample"], df["R1"]))
R2_FILES = dict(zip(df["sample"], df["R2"]))

rule all:
    input:
        expand("results/{sample}.lines", sample=SAMPLES)

rule process_sample:
    input:
        R1=lambda wildcards: R1_FILES[wildcards.sample],
        R2=lambda wildcards: R2_FILES[wildcards.sample]
    output:
        "results/{sample}.lines"
    shell:
        "cat {input.R1} {input.R2} | wc -l > {output}"

Note though that in order to propagate information about your filenames back through other rules, we are still using wildcards ({sample}). So while you can be less organized while using sample sheets, good file naming convensions are still crucial.

script: Passing information to external scripts

So far we have just used shell commands in our rules. Sometimes, we might want to use external scripts in python, R, or other languages. To do this, we can use the script: directive in our rules in place of shell:. Within python or R scripts, you can access Snakemake input, output, and other objects using the snakemake object that will be passed to that script's environment. For example:

Here is what it might look like for writing a rule based on a python script

rule run_python_script:
    input: "data.csv"
    output: "results.txt"
    script: "scripts/my_script.py"
import pandas as pd

df = pd.read_csv(snakemake.input[0])

# ...
# Do some processing
# ...

df.to_csv(snakemake.output[0])

Similarly, this is how it might look inside an R script:

library(dplyr)

df <- read.csv(snakemake@input[["data"]])

# ...
# Do some processing
# ...

write.csv(df, snakemake@output[["results"]])

In the following sections, we'll learn about more customization Snakemake can do. Things like params, configs, etc, are also passed to the external scripts through the snakemake object and can be accessed in the same way.

Additional directives: customizing how rules are run

In this section we will learn important concepts for controlling the execution of rules in Snakemake. Most bioinformatics workflows are complex and require parameters, compute resources, and software dependencies. In addition to the typical input, output, and shell/script/run directives, Snakemake has other directives to use within rules.

Params

A params directive can be used to specify parameters for a rule. These parameters can be accessed within the shell command or script and can be used to customize the behavior of the rule. Params are typically used to define options for tools or scripts that are being called within the rule.

For example, if I wanted to change my shell script to take a parameter, it might look something like this:

rule param_shell:
    input: "data.csv"
    output: "results.txt"
    params: param1="value"
    shell:
        "some_command --param1 {params.param1} {input} > {output}"

If I had an external R script that was running a function, I could pass parameters to it like this:

rule run_r_script:
    input: "data.csv"
    output: "results.txt"
    params: param1="value"
    script: "scripts/my_script.R"

And my R script might look like this:

param1 <- snakemake@params[["param1"]]
some_function(snakemake@input[["data"]], param1)

Under the hood, the snakemake object has all the directive information of the rule and those directives (such as params, input, output, etc) can be accessed with the @ symbol and then keywords within double square brackets.

In the above, I gave my parameter the name param1 and now I can access it by that name in the R script. You can have as many parameters as you need, separated by commas, and they can all be accessed via keyword within the script.

If instead of an R script, I had a python script, I would access the params like this:

param1 = snakemake.params["param1"]
some_function(snakemake.input[0], param1)

The snakemake object in python has the same structure, but you access the directives using . instead of @ and single brackets for the keywords.

Often, parameters are the - or -- options in command-line tools. For example, in bwa mem, you have the option of specifying -t for number of threads and -v for verbosity, among others. The rule might look something like this:

rule align_sequences:
    input: "raw_data/{sample}.fastq"
    output: "aligned/{sample}.bam"
    params:
        verbosity=4,
        num_threads=4,
    shell:
        "bwa mem -t {params.num_threads} -v {params.verbosity} reference.fasta {input} > {output}"

Because of the way we named these parameters in the rule, it actually makes the command line more readable. Be sure to remember the comma after each param. That is required syntax for Snakemake.

Importantly, just because we put the number of threads as a parameter, that doesn't mean that the rule will actually run with that many threads. Parameters are just keywords that you tell the tool to use; they don't enforce any behavior on their own. In the next section, we will see how to specify the actual compute resources that a rule should use. If we try to run this rule on a machine that doesn't have the required number of CPUs, it may run more slowly because bwa is expecting 4 CPUs.

Params can also be global, meaning they can be defined outside of a rule and apply to the entire workflow. For example, you might want to have a param called workdir which points to where you want your working directory and all your files to be written. If your workflow has many parameters, it is a good idea to collect them all into a config file, which we will discuss later.

Resources

Resource management is crucial when running your workflow on a shared resource. If unspecified, SLURM will have access to all of your computer's resources. On an HPC cluster (like Cannon), the default resources for jobs are too low even for the smallest data analyses.

Specifying CPUs

Snakemake has a built-in directive called threads that works for both CPU allocation and allows the passing of that parameter to the shell directive. You can specify the number of threads a rule should use like this:

rule my_rule:
    input: "data.csv"
    output: "results.txt"
    threads: 4
    shell:
        "some_command {input} -t {threads} > {output}"

If used on a cluster with an executor plugin (e.g. the SLURM plugin ), the threads directive automatically sets the cpus_per_taskoption underresources` to the same number and also passes it to the shell command.

The resources: directive

We can also reserve arbitrary resources like memory using the resources directive:

rule my_rule:
    input: "data.csv"
    output: "results.txt"
    threads: 4
    resources:
        mem = 24 GB,
        runtime = 1 h
    shell:
        "some_command {input} -t {threads} > {output}"

These resources will be used locally or on a per-job basis if submitted to the cluster.

To learn more about the resources directive, see the snakemake docs .

Cluster resources

When implementing your pipeline to work with a cluster executor, there may be additional resources you can specify. For the SLURM executor, they include things like slurm_partition, and nodes. Many of these won't be necessary for simple pipelines, but you should be aware they exist. Besides slurm_partition, you will still mainly use the threads: directive to specify CPUs and the other default resource options discussed above.

Software

Another important component of a workflow is software dependencies. So far, we have been using basic bash or python code (or pseudocode), but in practice, we will be using specific software tools and libraries. One of the benefits of workflow managers is that you can specify a totally separate software environment for each rule, including different versions of software or even different programming languages. This allows you to keep your rules modular and contained so that they can be easily reused or modified without affecting other parts of the workflow.

If you already have a tool installed in your environment (i.e. you can type <tool_name> and not get an error), then you can simply use that tool in the shell: directive. For example, knowing that mafft is already installed, you might have a rule that looks like this:

rule mafft_aln:
    input: {locus_id}.fa
    output: {locus_id}.aln
    shell: "mafft {input} > {output}"

However, in Snakemake, you can also specify what software environment to use for the rule using, either the conda directive or the container directive. This precludes the user from necessarily installing the software themselves, rather they only need to have the environment specifications, which you can easily provide.

Conda

You can use the conda directive in a few ways. The first way is to specify an existing conda environment that has already been created. You can reference the environment by name or by the path to the environment.

rule some_rule:
    input: "data.csv"
    output: "output.txt"
    conda: "my_env_name"
    script: "scripts/my_script.R"

Or using the full path to the environment's folder:

rule some_rule:
    input: "data.csv"
    output: "output.txt"
    conda: "/path/to/my_env_name"
    script: "scripts/my_script.R"

However, doing this means that you need to manage the conda environment yourself. This is an issue if you want to run this script on a different computer where the environment doesn't yet exist. Another way to specify a conda environment is to use a yaml file that contains the requirements for the environment you want the rule to run in. You can get the yaml file by writing it yourself, like:

name: my_env_name
channels:
  - conda-forge
dependencies:
  - pandas=1.3.0
  - numpy=1.21.0
  - scikit-learn=0.24.2

You would then save that yaml file in your project directory. Perhaps under a folder named envs. And then you would use it in your Snakemake rule like this:

rule some_rule:
    input: "data.csv"
    output: "output.txt"
    conda: "envs/my_env_name.yaml"
    script: "scripts/my_script.py"

This causes Snakemake to create a temporary conda environment based on the specifications in the yaml file. This environment will be activated whenever the rule is executed, ensuring that all the required dependencies are available. This environment will be cached so that any other rule that uses it or any subsequent execution of the same rule can reuse it without having to recreate it.

The major benefit of using this yaml file is that it will always travel with your project so that it essentially makes the environment documented and portable. By specifying the version numbers of each software package, you can better ensure that your workflow will have consistent behavior.

Use the screen command when creating conda environments

Creation of the conda environments can take a while if it's the first time you run a script. If you don't want to get stuck waiting for it to finish, you can use the screen command in your terminal to create a detachable session. This way, you can go to other stuff while your environments are created. (This mostly applies to development. In production, you should submit your workflow using a SBATCH script)

Containers

Containers are a way of packaging all the requirements of a software into one image file. Containers are run by software called Docker or Singularity. You can usually find container images on dockerhub or the biocontainers registry. Here is an example of a rule using the software mafft from the biocontainers registry.

rule mafft:
    input: "data.fasta"
    output: "data_aligned.fasta"
    container: "biocontainers/mafft:7.475--hdfd78af_0"
    shell:
        "mafft {input} > {output}"

Once you have defined your rule with the container directive, you can then run snakemake with the option --sdm conda singularity. (sdm stands for "software deployment method") Then, depending on whether you've used the conda or container directive in the rule, Snakemake will automatically create and manage the necessary environments for you.

Remember the workflow profile!

Exercise

Exercise: Let's go back to our combine_counts rule. Imagine that your collaborator decided that they wanted to combine counts in a different way, using the program pandas. Also, your collaborator wants to add an option to return a CSV instead of a TSV. Thankfully, your collaborator has provided you with a Python file. The Python file can be found in scripts/combine_counts.py. Additionally, you've been given a yaml file for a conda environment, which can be found in envs/combine_counts.yml.

As a refresher, this is the original combine_counts rule:

rule combine_counts:
    input:
        lines="results/{sample}.lines",
        words="results/{sample}.words"
    output:
        summary="results/{sample}.summary"
    shell:
        """
        echo -n "lines\t" > {output.summary}
        cat {input.lines} >> {output.summary}
        echo -n "words\t" >> {output.summary}
        cat {input.words} >> {output.summary}
        """

Your task is to modify the combine_counts rule to do the following:

  1. Create a new snakefile called dev-customization.smk and copy over the contents of dev-05.smk. Then we will work with the combine_counts rule for the next items.
  2. Modify the shell command to call the combine_counts.py script instead of using shell commands.
  3. Add a parameter called output_format with a default value of tsv.
  4. Add a conda environment directive that points to the envs/combine_counts.yml file.
  5. Add a default resource of 1 thread and 1024 MB of memory
  6. Run the workflow for real to make sure it works
Solution

See complete/dev-customization.smk.

Logs

Our final topic on this section is the logs directive. This allows you to record log file for each individual rule. In fact, as we will see soon, it will allow you to record log files for each instance of a rule (aka each time a rule runs on a different piece of data). But first, let's discuss how to implement logging in our Snakemake workflow.

The first thing we need to do is to add the directive log to the rule. Then we need to specify a place for the log to go. Typically, this is a folder called "logs" that you create in your project directory.

rule aggregate:
    input:
        expand("results/{sample}.summary", sample = SAMPLES)
    output:
        "results/aggregate-summary.tsv"
    log:
        "logs/aggregate.log"
    shell:
        """
        echo -e "sample\tlines\twords" > {output}
        for summary_file in {input}; do
            SAMPLE_NAME=$(basename "$summary_file" .summary)
            LINES=$(grep -e "^lines\t" "$summary_file" | cut -f2)
            WORDS=$(grep -e "^words\t" "$summary_file" | cut -f2)
            echo -e "$SAMPLE_NAME\t$LINES\t$WORDS" >> {output}
        done &> {log}
        """

We have also modified the shell command. This is because just writing the log file is not enough. You need to make sure that the std out and std err streams of your command are properly directed to the file. std out is what is printed to the screen if you were to run this in a terminal window. std err is what is printed if there is an error. By adding &> {log} to the end of the command, you ensure that both streams are captured in the log file.

Other ways of incorporating a log into your run is to use the {log} variable if your command line tool already has an option for writing log. For example:

rule abc:
    input: "input.txt"
    output: "output.txt"
    log: "logs/abc.log"
    shell: "somecommand --log {log} {input} {output}"

If you are running a script, you can use the same redirect to write any print statements to the log:

rule print_log:
    input: "input.txt"
    output: "output.txt"
    log: "logs/print_log.log"
    shell: "python scripts/print_log.py {input} {output} &> {log}"

Another way to do this is to use the built-in logging library in python or the logger package in R to write to the log files. So within your python or R script, you would have lines that are like print statements except they will be time-stamped and written to your choice of std out or std err. This gives you more fine-grained control over how your script records itself.

In the above rule, aggregate only runs once, so we can just name the log file based on the rule. However, the other rules run multiple times, based on the number of samples. We can add wildcards to the log file name so that each instance of the rule gets its own log file.

rule count_lines:
    input: "data/{sample}.txt"
    output: "results/{sample}.lines"
    log: "logs/{sample}_count_lines.log"
    shell:
        "wc -l {input} | awk '{{print \$1}}' > {output} 2> {log}"

It is good practice to name your logs based on the rule, so that you know where it is coming from. In a more complex workflow, you can instead have directories for each rule's logs.

Getting values from a config file

The final basic Snakemake topic we'll cover is about using a config file to store parameters and other values that you want to use in your workflow. This is useful for keeping your workflow modular and easy to modify. A config file is typically a yaml file that contains key-value pairs, where the key is the name of the parameter you use in your workflow. For example, if in my config file I have a key : value pair such as input_dir: data/, then I can access that value in my snakefile using the dictionary lookup config["input_dir"]. The config object is a built-in dictionary that Snakemake automatically creates when you provide a config file.

Config values can be accessed anywhere in the snakefile, including in rules, input functions, the params directive, or even before any rules during any sort of pre-processing. For keys with individual values, the values are read as strings, so if you want to use them as integers or floats, you will need to convert them.

Config values can also contain lists. This can be done with Python syntax as my_list: [item1, item2, item3] or with YAML syntax as:

my_list:
    - item1
    - item2
    - item3

Likewise, config values can contain dictionaries, which can be nested. For example:

my_dict:
    key1: value1
    key2: value2
    key3:
        nested_key1: nested_value1
        nested_key2: nested_value2

These would be read by the Snakemake script as regular Python data structures, so you would access the list as config["my_list"] and the dictionary as config["my_dict"]["key1"] or config["my_dict"]["key3"]["nested_key1"].

Exercise: Copy the contents of dev-05.smk to a new file called dev-config.smk. Then, create a config file called dev-config.yaml in the same directory. Modify the Snakemake script and config file so that the following values are stored in the config file and accessed in the Snakemake script: sample_list, input_dir, and output_dir. Then assign the values in the config file, changing the output directory if desired. Make sure the workflow still runs correctly with either a dryrun or a full run. Recall that to run Snakemake with a config file, you need to provide the --configfile option in the command line.

Tip

While you can replace every instance of a hardcoded value with a config look-up (e.g. config["input_dir"]), it is often easier to first assign the config values to variables at the top of your snakefile, as we have done with the SAMPLES list. This may require other changes to how directories are defined in the context of wildcards (e.g. input: INPUT_DIR + "/{sample}.txt" instead of input: "data/{sample}.txt", or some other method of joining strings that doesn't interfere with the wildcards).

Solution

See complete/dev-config.smk and complete/config.yaml.

Now you are ready to share your workflow with collaborators so they can replicate your analysis, or use it on their own data. Just be sure to provide a config template !


Beyond One-to-One: Aggregating, Splitting, and Filtering Rules

Many Snakemake rules take one or many input files and produce the same number of output files. These are one-to-one rules, for example our count_lines rule:

rule count_lines:
    input: "data/{sample}.txt"
    output: "results/{sample}.lines"
    shell:
        "wc -l {input} | awk '{{print $1}}' > {output}"

This rule takes as many .txt files as samples we have and produces the same number of .lines files. This is accomplished simply by having the same wildcard(s) in the input and output sections, in this case {sample}.

Aggregation: many-to-one

We've also already covered combining multiple input files to a single output file, or aggregation, in our aggregate rule:

rule aggregate:
    input: expand("results/{sample}.summary", sample = SAMPLES)
    output: "results/combined.summary"

Here, we're using the expand() function to define many input files in the input section, all of which are used in our shell command, which produces a single output file. Here, the wildcard {sample} is defined in the input section, but not used in the output section. The {sample} wildcard is also propagated back to the previous rule(s) to be used in their input and output sections.

Filtering: many-to-fewer

A very common scenario in data analysis is a filtering step, where because of some quality metric or heuristic or experimental requirement, the number of output files is fewer than the number of input files in a rule. This is complicated for Snakemake, because the whole philosophy of it is that you know the expected outputs ahead of time. But when filtering, you often do not know which samples will pass your filtering thresholds.

Fortunately, it is possible, but requires us to introduce two slightly more advanced concepts: input functions and checkpoints.

Input functions

Up until know, we've been specifying our input sections only as a hardcoded individual files (as in rule all), which Snakemake uses as a list of length 1, or as a list or lists of files defined by wildcards. However, we always want to re-iterate that a Snakemake script is at its core a Python script, so we can do Python things like write functions. Sometimes you may want to use more complex logic to determine the input files for a rule. In these cases, you could write a function that returns a list of files that can be used in the input section of a rule. These are called input functions and are a common way to generate lists of input files.

For example, in our rule aggregate we define our list of input files based on the samples stored in the SAMPLES list, with {sample} becoming our wildcard:

rule aggregate:
    input:
        expand("results/{sample}.summary", sample = SAMPLES)

Since this is just one line of code, we can just stick this in the input section. But if we really wanted to, we could instead put this in an input function:

def get_aggregate_input(wildcards):
    my_samples = expand("results/{sample}.summary", sample = SAMPLES)
    return my_samples

rule aggregate:
    input:
        get_aggregate_input

Now, in our input section, instead of the expand() function directly, we have a call to our function, get_aggregate_input.

Question: What do you notice that's different about this function call compared to other Python function calls?

Answer

Here, there are no trailing parentheses, (), which are usually required after a function call in Python. This is because we are not directly calling the function, but rather telling Snakemake to call the function for us. Relatedly, you'll also notice that in our function definition, one argument, wildcards, is required. This is automatically passed to the function by Snakemake and so it must be included as an argument in all input function definitions. The wildcards object in the function is then a class object with which we can access our defined wildcards, as you'll see below.

Tip

However, you can still pass other arguments to the input function! In this case, you would add the parentheses, e.g. get_aggregate_input(my_arg1, my_arg2). However, it must be understood that the first argument received by the function will always be wildcards. Then, the function definition must be def get_aggregate_input(wildcards, my_arg1, my_arg2). So, even though you've only passed it two arguments, it will receive three and must be written as such.

The wildcards argument

As mentioned in the answer above, the wildcards argument is always passed by Snakemake to the input functions, so your input functions must be defined to accept it. However, in our example above, we didn't actually use it (because at that point in our workflow we didn't have any wildcards defined yet). To use it, you can access the current value of a given wildcard by typing wildcards.<wildcard name>. So, for example, if we made an input function for the count_lines rule, we may do something like this:

def get_count_lines_input(wildcards):
    current_sample = wildcards.sample
    current_file = f"data/{current_sample}.txt"
    return current_file

rule count_lines:
    input: get_count_lines_input

The rest of the rule remains unchanged.

Here, the input function get_count_lines_input is called each time the rule is invoked for a different wildcard ({sample}), and wildcards.sample returns the value of the current sample, either sample1 or sample2 in our small example.

Exercise: Copy the contents of dev-05.smk to dev-06.smk. Modify dev-06.smk so that the rule combine_counts takes a single input function in the input section. Because of the way input directives expect files to be returned from and input function (as a list), you will also need to modify the shell command.

Hint

Recall that the input object that Snakemake receives is also a list of files. In the case of rules that have just a single input file (e.g. count_lines and count_words), this is a list of length 1. For combine_counts this is a list of two files, a .words file and a .lines file. Since the easiest thing is to return a list from our input function and because we access them by name in the shell directive, we'll have to modify the command to use list indices rather than names.

Alternatively, our input function could return a dictionary, and we could keep using named inputs in the shell directive, if we wrap our input function call in unpack().

Solution

See complete/dev-06.smk

Checkpointing

A rule can be converted into a checkpoint if you don't know exactly the files that will be output from it. This is especially helpful in our scenario of filtering samples from a dataset. However, checkpoints can be complicated.

Activity: Take a look at complete/checkpoint-01.smk.

This is the same workflow as our initial dev-05.smk. It takes text files, counts the number of lines and words in them, combines those counts by within sample, then aggregates the counts across samples so we get a single output file, aggregate-summary.csv.

Except, now we've inserted a checkpoint perform_qc:

checkpoint perform_qc:
    input:
        expand("results/{sample}.summary", sample = SAMPLES)
    output:
        "results/qc-manifest.txt"
    run:
        with open(output[0], "w") as manifest:
            for input_file in input:
                sample_name = input_file.split("/")[-1].replace(".summary", "")

                second_line = open(input_file).readlines()[1].strip()
                num_words = int(second_line.split("\t")[1])
                print("NUM WORDS:", num_words)
                if num_words <= 7:
                    manifest.write(f"{sample_name}\n")

This checkpoint acts as a rule, and we use a run: directive to directly run Python code within the Snakemake script. The run directive looks at all the .summary files output by combine_counts and writes a manifest file that will contain the names of the samples that pass our filtering rules. In this case, our filtering logic is simple: only samples with 7 lines or fewer will be written to the manifest file.

This checkpoint is paired with an input function in the next rule, get_aggregate_input:

def get_aggregate_input(wildcards):
    manifest = checkpoints.perform_qc.get().output[0] # Wait for checkpoint and get output path
    with open(manifest) as mf:
        samples = [line.strip() for line in mf if line.strip()]
    return [f"results/{sample}.summary" for sample in samples]

This key line here uses checkpoints method to get the output from our perform_qc checkpoint:

manifest = checkpoints.perform_qc.get().output[0]

This means that when Snakemake looks backwards from rule aggregate, it will know it first has to generate the output from perform_qc. Then we have some logic that opens that manifest files, reads the samples and compiles the expected inputs after filtering.

Exercise: Do a dryrun and generate a rulegraph and DAG for this workflow. Note: be sure to move or remove any old outputs or else nothing will happen!

# The dryrun:
snakemake -j 1 -s complete/checkpoint-01.smk --dryrun

# Create rulegraph:
snakemake -j 1 -s complete/checkpoint-01.smk --rulegraph | dot -Tpng > checkpoint-01-rulegraph.png

# Create DAG
snakemake -j 1 -s complete/checkpoint-01.smk --dag | dot -Tpng > checkpoint-01-dag.png

What do you notice about the outputs of these commands?

One answer

The rulegraph is basically the same as dev-05.smk, just with the addition of the perform_qc checkpoint.

Exercise: Run the workflow. What do you notice about the final output that is different than before? Note: be sure to move or remove any old outputs or else nothing will happen!

This checkpoint is relatively easy to understand because the filtering is done right before a step that aggregates the outputs. In other words, the checkpoint and rule requiring the input function are adjacent steps. However, things can seem more complicated with intervening rules.

Filtering without aggregating

We can filter without immediately aggregating as well, using wildcards both before and after the filter step.

Activity: Take a look at complete/checkpoint-02.smk.

Here, we've moved the perform_qc checkpoint to work before the combine_counts rule instead of after it. Now it accepts as input the .lines and .words files and combine_counts now also accepts the manifest file as input. The filtering logic remains the same, as does the input function to aggregate.

Exercise: Do a dryrun and generate a rulegraph and DAG for this workflow. Note: be sure to move or remove any old outputs or else nothing will happen!

# The dryrun:
snakemake -j 1 -s complete/checkpoint-02.smk --dryrun

# Create rulegraph:
snakemake -j 1 -s complete/checkpoint-02.smk --rulegraph | dot -Tpng > checkpoint-02-rulegraph.png

# Create DAG
snakemake -j 1 -s complete/checkpoint-02.smk --dag | dot -Tpng > checkpoint-02-dag.png

What do you notice about the outputs of these commands?

Some answers
  • In the dryrun, combine_counts doesn't show up!
  • In the dryrun, the text "DAG of jobs will be updated after completion." is displayed after the perform_qc checkpoint.
  • combine_counts doesn't appear in either the rulegraph or the DAG!

This because Snakemake can't determine exactly what the inputs to combine_counts will be until the other rules are run, so it can't report anything about it.

Also note that we use expand() in the perform_qc rule to define the wildcards before filtering, which are propagated back to count_lines and count_words. However, we also use {sample} in combine_counts. The list of wildcards after filtering are defined in the input function get_aggregate_input and propogated backwards after the checkpoint is run. Unfortunately, you can't change the name of the wildcard after filtering, which would aid in clarity...

Optional: You can run this workflow if you want! Since we didn't change the filtering logic, the output should be the same as complete/checkpoint-01.smk

Splitting: one-to-many

Splitting one file into many for processing of parts can be a a great way to speed up a workflow. Whether it is straightforward or not depends on whether or not you know the splits beforehand or not.

Deterministic splitting

If you have a list of the splits you want to make beforehand (say, splitting a genome up by chromosomes), then splitting is similar to aggregating, but in reverse (complete/split-01.smk):

SPLITS = ["apple", "cherry", "dragonfruit"]

rule all:
    input:
        expand("split/{part}.txt", part=SPLITS)

rule split_file:
    input:
        "complete/split-data.txt"
    output:
        expand("split/{part}.txt", part=SPLITS)
    run:
        # Read full file contents:
        for line in open(input[0]):
            data = line.strip()
            if data in SPLITS:
                with open(f"split/{data}.txt", "w") as out:
                    out.write(data + "\n")

The input file, complete/split-data.txt, looks like this:

apple
banana
cherry
dragonfruit

This is a very simple workflow, that writes lines from an input file if they appear in our pre-defined list: SPLITS = ["apple", "cherry", "dragonfruit"].

Could we split in parallel?

The way we've written this rule we expect it to be run once and the code accommodates this: in the run block, we search the whole input file and the whole SPLITS list. This is also accomplished by using expand() in the output directive, meaning the output is one big list of files rather than one file depending on a wildcard.

However, we could imagine another way to write this in which the rule is run once per element in SPLITS. In that case, we'd have to change the logic in our run block and also use wildcards in output. However, the trade-off would be that we'd be opening and reading the file for each part we want to split. This is very slow, especially for large files. So be mindful of this when you write your rules.

Non-deterministic splitting

Splitting when you don't know the splits beforehand (say, splitting based on Ns in a genome, or a general pipeline for a new genome that you don't have a list of chromosomes for beforehand) is more complex and done similarly to filtering: with a checkpoint and an input function. Here is a small example (complete/split-02.smk):

# Input function to discover splits.
def get_split_outputs(wildcards):
    manifest = checkpoints.split.get().output[1]
    with open(manifest) as f:
        splits = [line.strip() for line in f if line.strip()]
    return [f"splits/{split}.txt" for split in splits]

rule all:
    input: get_split_outputs

# Checkpoint splits the file and writes a manifest.
checkpoint split:
    input:
        "complete/split-02-data.txt"
    output:
        directory("splits"),
        "manifest/splits.txt"
    run:
        import os

        os.makedirs("splits", exist_ok=True)
        splits = []
        with open(input[0]) as infile:
            lines = [line.strip() for line in infile if line.strip()]
            for l in lines:
                outfile = f"splits/{l}.txt"
                with open(outfile, "w") as out:
                    out.write(l + "\n")
                splits.append(l)
        os.makedirs("manifest", exist_ok=True)
        with open(output[1], "w") as mf:
            for s in splits:
                mf.write(s + "\n")

This does the same as the previous example, but instead of being provided a list of splits (SPLITS = ["apple", "cherry", "dragonfruit"]) we want a general rule that gets every line in the file, even if we don't know the file's contents. This means, like filtering, we won't know the outputs of our splittin until it is run, hence the need for the checkpoint and input function.

End Part 2

Let us know if you have any questions!