Being reproducible with Snakemake

Learning outcomes

After having completed this chapter you will be able to:

Create and use an input function
Deploy a conda environment and run scripts within it
Deploy a Docker/Singularity container and run scripts within it

Material

Reproducibility in Snakemake:

Download the presentation

Additional advanced concepts:

Download the presentation

Exercises

In this series of exercises, we will create the last two rules of the workflow. Each rule will execute a script, one in Python and one in R, and both rules will have dedicated environment that you will need to take into account in the snakefiles. These last two rules are quite different from the previous ones, so it would be a good idea to implement them in a new snakefile in workflow/rules, for example called analysis.smk.

Development and back-up

During this session, we will modify our snakefiles quite heavily, so it may be a good idea to start by making a back-up: cp -r worklow/ worklow_backup. As a general rule, if you have a doubt on the code you are developing, do not hesitate to make a back-up.

Hint

This is not a programming course, so you won’t need to write the scripts: they were already prepared for you!

Creating a rule to gather read count files

To perform a Differential Expression Analysis (DEA), it is easier to have a single file gathering all the read counts of the different samples.

Exercise: Implement a rule to list and merge read count files (coming from rule reads_quantification_genes) into a single file using the Python script provided [here]https://raw.githubusercontent.com/sib-swiss/containers-snakemake-training/main/scripts/solutions/day2/session4/workflow/scripts/count_table.py).

Information on the script to compute the table

You can download the script with wget https://raw.githubusercontent.com/sib-swiss/containers-snakemake-training/main/scripts/solutions/day2/session4/workflow/scripts/count_table.py
It goes in a special directory in your workflow
It is written in Python
It takes a list of files as input, each file being the read count output of featureCounts
It produces one output, a tab-separated table containing all the read counts of the different sample gathered by gene

Hint

While the goal of this rule is quite easy to grasp, setting it up requires using several advanced notions of Snakemake, so here is a little outline of the steps you should take:

Build the basic structure of your rule: name, output, log, benchmark
- Memory should be set at 500 MB
- Threads should be set at 1
- Don’t focus on the inputs for now
Think about the directive you want to use to run the script
- Looking at the script length with wc -l <path/to/script> could help you decide
Think about the location/path of the script
Check the beginning of the script to see if you need any special Python packages. You can do that with head <path/to/script>. If you see lines containing import <package_name>, it means that the script is using external Python packages
- If the script is using external packages, think on how can you provide them
Finally, identify the inputs your rule needs
- Think on how you can easily list and provide the inputs. Does it remind you of something you saw in the course?

Now, let’s solve these problems one by one!

Building the rule structure

You have done that a few times already, so it should not be too difficult.

Exercise: Set the output, log, benchmark, resources and thread values.

Answer

rule count_table:
    '''
    This rule merges all the gene count tables of an assembly into one table.
    '''
    input:
        ?
    output:
        count_table = 'results/total_count_table.tsv'
    log:
        'logs/total_count_table.log'
    benchmark:
        'benchmarks/total_count_table.txt'
    resources:
        mem_mb = 500
    threads: 1
    ?:
        ?

Note that we do not need to use wildcards in the output name: only one file will be created, and its name will not change depending on the sample name, because we use all the samples to create it.

Getting the Python script and running it

Exercise: Download the script and place it the proper folder. Remember that per the official documentation, scripts should be stored in a subfolder workflow/scripts.

Answer

wget https://raw.githubusercontent.com/sib-swiss/containers-snakemake-training/main/scripts/solutions/day2/session4/workflow/scripts/count_table.py  # Download the script
mkdir -p workflow/scripts  # Create the appropriate folder
mv count_table.py workflow/scripts  # Move the script in the newly created folder

Exercise: Check how long the script is.

Answer

If you run wc -l workflow/scripts/count_table.py, you will see that the script is 67 lines long. This is too much to use a run directive, so we will use the script directive instead. This means that we need to add the following to our rule count_table:

script:
    '../scripts/count_table.py'

Script path

If you placed included files in subfolders (like rules/analysis.smk), you need to change relative paths for external script files, hence the ../ in the script path.

Hint

In many cases, it would be nice to have a script that can be called by Snakemake but also work with standard Python, so that the code can be reused in other projects. There are several ways to do that:

You could implement most of the functionalities in a module and use this module in a simple script called by Snakemake
You could test for the existence of a snakemake object and handle parameter values differently (e.g. command-line arguments) if the object does not exist

Inside the script, you have access to an object snakemake that provides access to the same objects that are available in the run and shell directives (input, output, params, wildcards, log, threads, resources, config), e.g. you can use snakemake.input[0] to access the first input file of a rule, or snakemake.input.input_name to access a specific named input.

Exercise: Check the script content to see whether it requires specific packages.

Answer

If you run head workflow/scripts/count_table.py, you will see several import commands at the start of the script, including import pandas as pd. pandas is a great package, but it is not part of the default packages natively shipped with Python. This means that we need to find a solution to provide it to the rule. The easiest way to do that is to create a conda environment dedicated to the rule. Conda environments should be stored in a subfolder workflow/envs.

Create the appropriate folder: mkdir -p workflow/envs. Then, write the following configuration in the environment file, workflow/envs/py.yaml:

# Environment file to perform data processing with python
name : python
channels :
    - conda-forge
    - bioconda
dependencies :
    - python >= 3.10
    - pandas == 1.4.3

This means that you need to add the following to your rule count_table:

conda:
    '../envs/py.yaml'

Environment file path

If you placed included files in subfolders (like rules/analysis.smk), you need to change relative paths for conda environments files as well, hence the ../ in the environment file path.

Using conda environments improves reproducibility for many reasons, including version control and the fact that users do not need to manually manage software dependencies. Note: the first execution of the workflow after adding Conda environments will take some time, because Conda will have to download and install all the software.

Identifying and listing the input files

The before-last step is the most complex one: identifying all the inputs of the rule and gathering them in a list. Here, there are only 6 samples, so in theory, we could list them directly… However, by now you should that is not a good solution. Fortunately, there is a much more elegant and convenient way to do this: an input function, which provides the added benefit of scaling up very well if the number of samples increase.

We already wrote the input function for you:

# Input function used in rule count_table
def get_gene_counts(wildcards):
    '''
    This function lists all the gene count tables of samples in the config file
    '''
    # Note that here {sample} is not a wildcard, it is an f-string variable!
    return [f"results/{sample}/{sample}_genes_read_quantification.tsv"
            for sample in config['samples']]

This function will loop over the list of samples in the config file, replace ‘{sample}’ with the current sample name of the iteration to create a string which is the output path from the rule reads_quantification_genes of said sample. Then, it will aggregate all the paths in a list and return this list.

Details on input functions

Input functions take the wildcards global object as single argument
You can access wildcard values inside an input function with the syntax {wildcards.wildcards_name}
Input and output functions can return a list of files, which will then be automatically handled like multiple inputs or outputs by Snakemake. These functions can also return a dictionary; in this case, the function should be called with the syntax input: unpack(<function_name>). The dictionary’s keys will be interpreted as input/output names and the dictionary’s values will be interpreted as input/output values
Functions are evaluated before the workflow is executed. As a consequence, they cannot be used to list the content of an output directory, since the directory does not exist before the workflow is executed!

Exercise: Insert the function in the proper Snakefile and adapt the input value of the rule accordingly.

Answer

There are two things to do: * Insert the input function in workflow/rules/analysis.smk, before the rule, otherwise you will get a name 'get_gene_counts' is not defined error (the function needs to be defined before Snakemake looks for it when it parses the rule input). * Use the function name as value for the input directive

Your rule and function should resemble this:

# Input function used in rule count_table
def get_gene_counts(wildcards):
    '''
    This function lists all the gene count tables of samples in the config file
    '''
    return [f"results/{sample}/{sample}_genes_read_quantification.tsv"
            for sample in config['samples']]

rule count_table:
    '''
    This rule merges all the gene count tables of an assembly into one table.
    '''
    input:
        get_gene_counts
    output:
        count_table = 'results/total_count_table.tsv'
    log:
        'logs/total_count_table.log'
    benchmark:
        'benchmarks/total_count_table.txt'
    conda:
        '../envs/py.yaml'
    resources:
        mem_mb = 500
    threads: 1
    script:
        '../scripts/count_table.py'

Input functions usage

You don’t need to use parentheses or specify any argument when you call an input function in the input directive.

Adapting the Snakefile and running the rule

Now, all that is left is to run the rule to create the table.

Exercise: Which command should you use to create the output? Is there anything else to do beforehand?

Answer

It turns out that we cannnot launch the workflow directly: we need to include the new rule file in the Snakefile and adapt the output of the rule all! Your Snakefile should now resemble this:

'''
Main Snakefile of the RNAseq analysis workflow. This workflow can clean and
map reads, and perform Differential Expression Analyses.
'''

# Path of the config file
configfile: 'config/config.yaml'

# Rules to execute the workflow
include: 'rules/read_mapping.smk'
include: 'rules/analyses.smk'

# Master rule that launches the workflow
rule all:
    '''
    Dummy rule to automatically generate the required outputs.
    '''
    input:
        'results/total_count_table.tsv'

Finally, run the workflow with snakemake --cores 4 -r -p --use-conda

–use-conda

Do not forget --use-conda otherwise Snakemake will not use the environments!!!

Creating a rule to detect Differentially Expressed Genes

It is now time to write the final rule of the workflow. This rule will perform the DEA using the global count table we previously created.

Exercise: Implement a rule to perform DEA using the R script provided here.

Information on the script to compute the table

You can download the script with wget https://raw.githubusercontent.com/sib-swiss/containers-snakemake-training/main/scripts/solutions/day2/session4/workflow/scripts/DESeq2.R
It goes in a special directory in your workflow
It is written in R
It takes a global read counts table as input
It produces two outputs:
1. A tab-separated table containing the DEG and the associated statistical results
2. A pdf file containing control plots of the analysis

Hint

While not being trivial, this rule is much easier than the previous one and some things work similarly. Still, here is a little outline of the steps you should take:

Build the basic structure of your rule: name, input, outputs, log, benchmark
- Memory should be set at 1 GB
- Threads should be set at 2
Think about the directive you want to use to run the script
- Remember than there is only one way to run an R script
Think about the location/path of the script
This script is using a lot of external packages. Fortunately, all these packages are available in a certain Docker image you worked with yesterday

Now, let’s write this last rule!

Building the rule structure

You have done that a few times already, so it should not be too difficult.

Exercise: Set the input, outputs, log, benchmark, resources and thread values.

Answer

rule differential_expression:
    '''
    This rule detects DEGs and plots associated visual control graphs (PCA,
    heatmaps...).
    '''
    input:
        table = rules.count_table.output.table
    output:
        deg = 'results/deg_list.tsv',
        pdf = 'results/deg_plots.pdf'
    log:
        'logs/differential_expression.log'
    benchmark:
        'benchmarks/differential_expression.txt'
    ?:
        ?
    resources:
        mem_gb = 1
    threads: 2
    ?:
        ?

Note that we do not need to use wildcards in this rule, because all the files are precisely defined.

Getting the R script and running it

Exercise: Download the script and place it the proper folder. Remember that per the official documentation, scripts should be stored in a subfolder workflow/scripts.

Answer

wget https://raw.githubusercontent.com/sib-swiss/containers-snakemake-training/main/scripts/solutions/day2/session4/workflow/scripts/DESeq2.R  # Download the script
mv DESeq2.R workflow/scripts  # Move the script in the newly created folder

Exercise: Find a way to run the script.

Answer

There is only one way to run an R script: use the script directive. This means that we need to add the following to our rule differential_expression:

script:
    '../scripts/DESeq2.R'

Script path

If you placed included files in subfolders (like rules/analysis.smk), you need to change relative paths for external script files, hence the ../ in the script path.

Hint

Inside the script, an S4 object named snakemake analogous to the Python case available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. you can access the first input file with snakemake@input[[1]] (note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. snakemake@input[["myfile"]].

Exercise: Find an efficient way to create a computing environment for the rule.

Hint

Remember what you did during Day 1, session 3 “Working with Dockerfiles”!

Answer

During Day 1, you built your own docker image, called deseq2. This image actually contains everything we need to run DEA, so let’s use it again, but with Snakemake this time! This means that you need to add the following to your rule differential_expression:

container:
    'docker://geertvangeest/deseq2:v1'

Your own Docker image

First try with your own image. If it doesn’t work, then you can use Geert’s image: geertvangeest/deseq2:v1.

After all these modifcations, this is what your final rule should look like:

Answer

rule differential_expression:
    '''
    This rule detects DEGs and plots associated visual control graphs (PCA,
    heatmaps...).
    '''
    input:
        table = rules.count_table.output.table
    output:
        deg = 'results/deg_list.tsv',
        pdf = 'results/deg_plots.pdf'
    log:
        'logs/differential_expression.log'
    benchmark:
        'benchmarks/differential_expression.txt'
    container:
        'docker://geertvangeest/deseq2:v1'
    resources:
        mem_gb = 1
    threads: 2
    script:
        '../scripts/DESeq2.R'

Adapting the Snakefile and running the rule

Now, all that is left is to run the rule to create the DEG list.

Exercise: Which command should you use to create the output? Is there anything else to do beforehand?

Answer

It turns out that we cannnot launch the workflow directly: we need to include the new rule file in the Snakefile and adapt the output of the rule all! Your Snakefile should now resemble this:

'''
Main Snakefile of the RNAseq analysis workflow. This workflow can clean and
map reads, and perform Differential Expression Analyses.
'''

# Path of the config file
configfile: 'config/config.yaml'

# Rules to execute the workflow
include: 'rules/read_mapping.smk'
include: 'rules/analyses.smk'

# Master rule that launches the workflow
rule all:
    '''
    Dummy rule to automatically generate the required outputs.
    '''
    input:
        'results/deg_list.tsv'

Finally, run the workflow with snakemake --cores 4 -r -p --use-singularity. You should see new Snakemake information messages:

Rscript --vanilla /path/to/snakemake_rnaseq/.snakemake/scripts/tmpge97d_lz.DESeq2.R
Activating singularity image /path/to/snakemake_rnaseq/.snakemake/singularity/8bfdbe93244feb95887ab5d33a705017.simg
INFO:    squashfuse not found, will not be able to mount SIF
INFO:    fuse2fs not found, will not be able to mount EXT3 filesystems
INFO:    gocryptfs not found, will not be able to use gocryptfs
INFO:    Converting SIF file to temporary sandbox...
INFO:    Cleaning up image...

–use-singularity

Do not forget --use-singularity otherwise Snakemake will not pull the image!!!

Hint

If you want to see how a Snakemake-generated Dockerfile looks like, use: snakemake --cores 1 --containerize > Dockerfile

Exercise: How many DEG were detected?

Answer

Have a look at the list that was just created: cat results/deg_list.tsv. 8 genes are differentially expressed!

Exercise: If you had to re-run the entire workflow from scratch, what command would you use?

Answer

You would need to execute snakemake --cores 4 -r -p --use-conda --use-singularity -F. * -F is to force the execution of the entire workflow * Don’t forget --use-conda --use-singularity! Otherwise, you will lack some software and packages and the workflow will crash!

Exercise: Visualise the DAG of the entire workflow.

Answer

You should now be used to this. snakemake --cores 1 -r -p -F --dag | dot -T png > images/total_dag.png

This is the DAG you should see:

Congratulations, you are now able to create a Snakemake workflow and make it reproducible thanks to conda/mamba and Docker/Singularity! To make things even better, have a look at Snakemake’s best practices!