Running Snakemake in an HPC environment

Learning outcomes

After having completed this chapter you will be able to:

Understand resource usage in each rule with the benchmark directive
Optimise resource usage in a workflow with the resources directive
Run a Snakemake workflow in an SLURM-based HPC environment
Set up workflow-specific configuration profiles

Material

Workflow from previous session

If you didn’t finish the previous part or didn’t do the optional exercises, you can restart from a fully commented workflow with a supplementary .fastq files quality check rule and multithreading implemented in all rules. You can download all the files (workflow and config) here or copy them after cloning the course repository locally:

git clone https://github.com/sib-swiss/containers-snakemake-training.git  # Clone repository
cd containers-snakemake-training/  # Go in repository
cp -r docs/solutions/session4 destination_path  # Copy folder where needed; adapt destination_path to required path

Exercises

In this last series exercises, you will learn how to get a better idea of the resources used by your workflow and how to optimise their usage, as well as the necessary command line arguments, files and rule directives to run your workflow in an SLURM-based High Performance Computing (HPC) environment. In addition, you will also learn how to set up configuration profiles and further develop the Snakefile to define rule-specific settings.

Benchmarking a workflow

Knowing how many resources are being used by each job can be very useful to optimise your workflow for both local and remote execution. The benchmark directive allows you to do exactly that. Adding it in your rule definitions will trigger the recording of several metrics for every job generated by your workflow execution. Each job will have an associated benchmark file with the following information:

s	h: m: s	max_rss	max_vms	max_uss	max_pss	io_in	io_out	mean_load	cpu_time
31.3048	0:00:31	763.04	904.29	757.89	758.37	1.81	230.18	37.09	11.78

s and h:m:s give the job wall clock time (in seconds and hours-minutes-seconds, respectively), which is the actual time taken from the start of a software to its end. You can use these results to infer a safe value for the runtime keyword
Likewise, you can use max_rss (shown in megabytes) to figure out how much memory was used by the job and use this value in the memory keyword

What are the other columns?

In case you are wondering about the other columns of the table, the official documentation has detailed explanations about their content.

Exercise: Add the and benchmark directive to the rules. You can follow the same logic as you did with the log directive.

Benchmarks and wildcards

benchmark directives must contain the same wildcards as the output directive, here sample

Answer

The code snippet below shows the answer for the reads_quantification_genes rule. You can do the same with the other rules.

rule reads_quantification_genes:
    """
    This rule quantifies the reads of a bam file mapping on genes and produces
    a count table for all genes of the assembly
    """
    input:
        bam_once_sorted = rules.sam_to_bam.output.bam_sorted,
    output:
        gene_level = 'results/{sample}/{sample}_genes_read_quantification.tsv',
        gene_summary = 'results/{sample}/{sample}_genes_read_quantification.summary'
    params:
        annotations = config['annotations']
    log:
        'logs/{sample}/{sample}_genes_read_quantification.log'
    benchmark:
        'benchmarks/{sample}/{sample}_genes_read_quantification.txt' # Path to the benchmark file
    threads: 2
    container:
        'https://depot.galaxyproject.org/singularity/subread%3A2.0.6--he4a0461_2'
    shell:
        '''
        echo "Counting reads mapping on genes in <{input.bam_once_sorted}>" > {log}
        featureCounts -t exon -g gene_id -s 2 -p --countReadPairs \
        -B -C --largestOverlap --verbose -F GTF \
        -a {params.annotations} -o {output.gene_level} {input.bam_once_sorted} &>> {log}
        echo "Renaming output files" >> {log}
        mv {output.gene_level}.summary {output.gene_summary}
        echo "Results saved in <{output.gene_level}>" >> {log}
        echo "Report saved in <{output.gene_summary}>" >> {log}
        '''

This means that a workflow has to run at least once to produce benchmark values, which leads to a “vicious circle”: to run a workflow, you need benchmark information. But to get them, you need to run the workflow. A good practice is to make a first run with only a small part of the samples and sensible resource requirements to allow for the benchmarking to happen.

Controlling memory usage and runtime

Controlling resources such as the amount of memory or runtime of each job is a very good way to optimise resource usage in a workflow. This ensures your instance (computer, HPC cluster…) won’t run out of memory during computation (which could interrupt jobs or even crash the instance) and that your jobs will run in a reasonable amount of time (a job taking more time to run than usual might be a sign that something is going on).

Resource usage and schedulers

Optimising resource usage is especially important when submitting jobs to a scheduler (for instance on a cluster), as it allows a better and more precise definition of your job priority: jobs with low threads/memory/runtime requirements often have a higher priority than heavy jobs, which means they will often start first.

Controlling memory usage and runtime in Snakemake is very simple: you only need to need to use the resources directive with the memory or runtime keywords and in most software, you don’t even need to specify the amount of memory available to the software via a parameter.

Here are some suggested values of memory usage in the current workflow:

fastq_trim: 500 MB
read_mapping: 2 GB
sam_to_bam: 250 MB
reads_quantification_genes: 500 MB

Exercise: Implement memory usage limits in the workflow.

Two ways to declare memory values

There are two ways to declare memory values in resources:

mem_<unit> = n
mem = 'n<unit>': in this case, you must pass a string, so you have to enclose n<unit> with quotes ''

<unit> is a unit in [B, KB, MB, GB, TB, PB, KiB, MiB, GiB, TiB, PiB] and n is a float.

Answer

We implemented memory usage control in all the rules so that you can check everything. We implemented all the memory usage definitions using mem_mb. Feel free to copy this in your Snakefile:

rule fastq_trim:
    '''
    This rule trims paired-end reads to improve their quality. Specifically, it removes:
    - Low quality bases
    - A stretches longer than 20 bases
    - N stretches
    '''
    input:
        reads1 = 'data/{sample}_1.fastq',
        reads2 = 'data/{sample}_2.fastq',
    output:
        trim1 = 'results/{sample}/{sample}_atropos_trimmed_1.fastq',
        trim2 = 'results/{sample}/{sample}_atropos_trimmed_2.fastq'
    log:
        'logs/{sample}/{sample}_atropos_trimming.log'
    benchmark:
        'benchmarks/{sample}/{sample}_atropos_trimming.txt'
    resources:  # Add directive
        mem_mb = 500  # Add keyword and value with format 1
        # mem = '500MB'  # Add keyword and value with format 2
    threads: 2
    container:
        'https://depot.galaxyproject.org/singularity/atropos%3A1.1.32--py312hf67a6ed_2'
    shell:
        '''
        echo "Trimming reads in <{input.reads1}> and <{input.reads2}>" > {log}
        atropos trim -q 20,20 --minimum-length 25 --trim-n --preserve-order --max-n 10 \
        --no-cache-adapters -a "A{{20}}" -A "A{{20}}" --threads {threads} \
        -pe1 {input.reads1} -pe2 {input.reads2} -o {output.trim1} -p {output.trim2} &>> {log}
        echo "Trimmed files saved in <{output.trim1}> and <{output.trim2}> respectively" >> {log}
        echo "Trimming report saved in <{log}>" >> {log}
        '''

rule read_mapping:
    '''
    This rule maps trimmed reads of a fastq onto a reference assembly.
    '''
    input:
        trim1 = rules.fastq_trim.output.trim1,
        trim2 = rules.fastq_trim.output.trim2
    output:
        sam = 'results/{sample}/{sample}_mapped_reads.sam',
        report = 'results/{sample}/{sample}_mapping_report.txt'
    params:
        index = config['index']
    log:
        'logs/{sample}/{sample}_mapping.log'
    benchmark:
        'benchmarks/{sample}/{sample}_mapping.txt'
    resources:
        mem_mb = 2000
    threads: 4
    container:
        'https://depot.galaxyproject.org/singularity/hisat2%3A2.2.1--hdbdd923_6'
    shell:
        '''
        echo "Mapping the reads" > {log}
        hisat2 --dta --fr --no-mixed --no-discordant --time --new-summary --no-unal \
        -x {params.index} --threads {threads} \
        -1 {input.trim1} -2 {input.trim2} -S {output.sam} --summary-file {output.report} 2>> {log}
        echo "Mapped reads saved in <{output.sam}>" >> {log}
        echo "Mapping report saved in <{output.report}>" >> {log}
        '''

rule sam_to_bam:
    '''
    This rule converts a sam file to bam format, sorts it and indexes it.
    '''
    input:
        sam = rules.read_mapping.output.sam
    output:
        bam = 'results/{sample}/{sample}_mapped_reads.bam',
        bam_sorted = 'results/{sample}/{sample}_mapped_reads_sorted.bam',
        index = 'results/{sample}/{sample}_mapped_reads_sorted.bam.bai'
    log:
        'logs/{sample}/{sample}_mapping_sam_to_bam.log'
    benchmark:
        'benchmarks/{sample}/{sample}_mapping_sam_to_bam.txt'
    resources:
        mem_mb = 250
    threads: 2
    container:
        'https://depot.galaxyproject.org/singularity/samtools%3A1.21--h50ea8bc_0'
    shell:
        '''
        echo "Converting <{input.sam}> to BAM format" > {log}
        samtools view {input.sam} --threads {threads} -b -o {output.bam} 2>> {log}
        echo "Sorting .bam file" >> {log}
        samtools sort {output.bam} --threads {threads} -O bam -o {output.bam_sorted} 2>> {log}
        echo "Indexing sorted .bam file" >> {log}
        samtools index -b {output.bam_sorted} -o {output.index} 2>> {log}
        echo "Sorted file saved in <{output.bam_sorted}>" >> {log}
        '''

rule reads_quantification_genes:
    '''
    This rule quantifies the reads of a bam file mapping on genes and produces
    a count table for all genes of the assembly. The strandedness parameter
    is determined by get_strandedness().
    '''
    input:
        bam_once_sorted = rules.sam_to_bam.output.bam_sorted,
    output:
        gene_level = 'results/{sample}/{sample}_genes_read_quantification.tsv',
        gene_summary = 'results/{sample}/{sample}_genes_read_quantification.summary'
    params:
        annotations = config['annotations']
    log:
        'logs/{sample}/{sample}_genes_read_quantification.log'
    benchmark:
        'benchmarks/{sample}/{sample}_genes_read_quantification.txt'
    resources:
        mem_mb = 500
    threads: 2
    container:
        'https://depot.galaxyproject.org/singularity/subread%3A2.0.6--he4a0461_2'
    shell:
        '''
        echo "Counting reads mapping on genes in <{input.bam_once_sorted}>" > {log}
        featureCounts -t exon -g gene_id -s 2 -p --countReadPairs \
        -B -C --largestOverlap --verbose -F GTF \
        -a {params.annotations} -T {threads} -o {output.gene_level} {input.bam_once_sorted} &>> {log}
        echo "Renaming output files" >> {log}
        mv {output.gene_level}.summary {output.gene_summary}
        echo "Results saved in <{output.gene_level}>" >> {log}
        echo "Report saved in <{output.gene_summary}>" >> {log}
        '''

In the following section we will see that the values established in resources will be very useful to communicate with the job schedulers.

Setting up Snakemake to send jobs via the SLURM job scheduler

Running jobs remotely on an HPC environment usually requires interaction with job schedulers such as SLURM. Thankfully, adapting our workflow to send jobs via a scheduler does not require a lot of changes. In this section you will use the cluster-generic plugin to send jobs via SLURM.

Snakemake plugins

In Snakemake v8+, interaction with job schedulers requires the installation of certain plugins. There are multiple plugins available for different schedulers. You can find them here.

In this tutorial, you will use the cluster-generic plugin and use it to configure how to send jobs through SLURM. While there is a SLURM-specific plugin, we will focus on the cluster-generic one because it can be used with any scheduler.

Among others, the cluster-generic plugin can take the following settings, which can be passed in different ways:

--cluster-generic-submit-cmd: command that will be used to submit each job to the cluster. Since we are using SLURM, the command is sbatch. This string can contain any valid sbatch arguments and use values from the Snakefile (e.g. using threads specified in the Snakefile as a value for the --cpus-per-task argument)
--cluster-generic-cancel-cmd: command to use to cancel jobs. This is important if you want to stop your workflow while jobs are running while also cancel the running jobs

Then, you will need to tell Snakemake to run the workflow through SLURM. An important parameter when doing so is --jobs, which will tell Snakemake how many jobs it can submit concurrently. In addition, you will also need to specify which executor plugin to use and the required settings:

snakemake \
    --executor cluster-generic \
    --cluster-generic-submit-cmd \
        'sbatch <slurm_arguments>' \
    --cluster-generic-cancel-cmd 'scancel' \
    --jobs 2

The command above will:

Tell Snakemake to use the cluster-generic plugin
Use the sbatch command to submit jobs
Use the scancel command to cancel submitted jobs if you stop the execution of the workflow
Set the maximum number of simultaneously running jobs to 2

As you can see, the command to launch the workflow can get quite long. Thankfully, Snakemake allows the usage of configuration profiles to store all of these settings in a file, increasing reproducibility and also simplifying running your workflow. The next section will show how to set up a config file.

Using configuration profiles

Configuration profiles allow you to specify settings regarding the execution of your workflow. Currently, Snakemake supports two types of profiles: global and workflow-specific. In this tutorial we will focus on workflow-specific profiles.

Exercise: Create a workflow profile named slurm_profile by:

Creating a directory named slurm_profile in the project directory.
Creating a config.yaml file inside slurm_profile.

File structure example

Your file structure should resemble this:

│── config
│   └── config.yaml
│── slurm_profile
│   └── config.yaml
└── workflow
    │── Snakefile
    │── envs
    │   │── tool1.yaml
    │   └── tool2.yaml
    │── rules
    │   │── module1.smk
    │   └── module2.smk
    └── scripts
        │── script1.py
        └── script2.R

The config.yaml file inside the slurm_profile directory can contain multiple settings related to the execution of the workflow. Below you can find an example of configuration file:

executor: cluster-generic
cluster-generic-submit-cmd: 'sbatch --job-name={rule}_{wildcards} --cpus-per-task={threads}'
jobs: 2

Exercise: What do --job-name={rule}_{wildcards} and --cpus-per-task={threads} do?

Answer

These settings allow binding information defined in the workflow to SLURM arguments such as --job-name or --cpus-per-task. For example, a rule where threads: 2, --cpus-per-task={threads} will become --cpus-per-task=2, indicating SLURM that the job should have 2 cpus available.

Exercise: Add the required SLURM argument to:

Log the SLURM output of each job to a file in slurm_logs/{rule}/{rule}_{wildcards}.log
Use the thread number specified in each rule during job submission
Use the scancel command to cancel running jobs

SLURM arguments

You can find the SLURM argument to use by running sbatch -h or by checking this SLURM cheatsheet.

Make sure slurm_logs exists!

In older SLURM versions, the directory slurm_logs needed to exist before running the workflow! Therefore, in order to be able to save the logs into a directory, you need to have created the folder before running Snakemake. There are two ways to go about this:

Manually create the folder before running the workflow
Adding the directory creation command in cluster-generic-submit-cmd. This option ensures that no problems arise if you are running the workflow for the very first time

Answer

Your config file should look like this:

executor: cluster-generic
cluster-generic-submit-cmd:
    "mkdir -p slurm_logs/{rule} &&
    sbatch
        --job-name={rule}_{wildcards}
        --cpus-per-task={threads}
        --output=slurm_logs/{rule}/{rule}_{wildcards}.log"
cluster-generic-cancel-cmd: scancel

Here’s a quick breakdown of the arguments:

mkdir -p slurm_logs/{rule}: create directory where logs will be saved
--job-name={rule}_{wildcards}: set job name to rule name and wildcards
--cpus-per-task={threads}: set number of CPUs to number of threads defined in the rule
--output=slurm_logs/{rule}/{rule}_{wildcards}.log: set output file for job logs to slurm_logs/{rule}/{rule}_{wildcards}.log
cluster-generic-cancel-cmd: scancel: set command to cancel jobs to scancel

Passing rule-specific resources to the job scheduler

As shown before, it is possible to specify rule-specific memory usage to have better control over how the workflow uses the available resources with the resources directive:

rule myrule:
    input:
        ...
    output:
        ...
    resources:
        mem_mb = 100
    shell:
        """
        cp {input} {output};
        """

This will ensure that jobs spawn from rule myrule never use more than 100 MB of RAM. Note that the jobs will crash if they surpass this limit, so try to account for some wiggle room when you set resources.

Exercise: Update the configuration profile to set the amount of memory per job to the values specified in each rule through the resources directive. To simplify things, make sure all rule memory requirements are set in megabytes.

Think about the SLURM parameter you need to use and how to set its value

Answer

The SLURM parameter to use is --mem. Its value should be set to the mem_mb keyword defined in the resources directive:

executor: cluster-generic
cluster-generic-submit-cmd:
    "mkdir -p logs/{rule} &&
    sbatch
        --job-name={rule}_{wildcards}
        --cpus-per-task={threads}
        --output=slurm_logs/{rule}/{rule}_{wildcards}.log
        --mem={resources.mem_mb}"
cluster-generic-cancel-cmd: scancel

In addition to remote execution parameters, a configuration profile allows to set anything that can be passed through the command line. This can significantly simplify launching a workflow if there are a lot of arguments to pass when to run it. For example, the extra lines below indicate that we should run a maximum of 2 jobs at a time and to use conda and singularity:

executor: cluster-generic
cluster-generic-submit-cmd:
  "mkdir -p slurm_logs/{rule} &&
  sbatch
    --job-name={rule}-{wildcards}
    --cpus-per-task={threads}
    --output=slurm_logs/{rule}/{rule}_{wildcards}.log
    --mem={resources.mem_mb}"
cluster-generic-cancel-cmd: scancel
jobs: 2
software-deployment-method:
  - conda
  - apptainer

Exercise: Add the jobs and software-deployment-method parameters to your configuration profile.

Adapting the workflow to the available resources

Before launching a workflow, it is very important to quantify the resources available in the machine supposed to run it. This information is typically provided by the HPC user guide. In our case, you can use the following commands to know the total number of cores and the amount of memory available in the server:

nproc --all    # Number of cores
free -g        # Amount of memory in GB in row "Mem" and column "total"

Exercise: Based on the available resources, figure out if you need to update any of the rule resources.

Answer

The cluster has 48 CPUs and 96 GB of memory, so there is no need to change the resource settings. It is also more than enough to run several jobs in parallel without any problem.

Always check the available resources

A very important role of schedulers is to ensure that running jobs do not surpass the total amount of available resources. When a job is asking for more resources than a machine has (e.g. the job asks for 50 CPUs, but the machine only has 20), the HPC (if configured properly) will prevent Snakemake from submitting the job and the workflow will fail. However, if the HPC is not configured properly or if the workflow is running on another type of instance without proper safety checks, Snakemake will be able launch the job but will display the message Waiting for more resources. to indicate that the job is waiting until enough resources are available for it to run, which can never happen. In this case, Snakemake will get stuck and never manage to launch the job.

Final exercise

Exercise: To conclude, run the workflow in the HPC environment with the following command:

snakemake --profile slurm_profile

You can then see what jobs are being run by using the squeue command in combination with watch to check the status of your workflow on regular intervals (here, 10s):

watch -n 10 squeue -u <username>

This will give you information such as the job id, the job status, how long is has been running for, and the reason if the status is PD (pending).

JOBID	PARTITION	NAME	USER	ST	TIME	NODES	NODELIST(REASON)
91	local	read_map	user1	PD	0:00	1	(Resources)
90	local	read_map	user1	R	0:30	1	localhost

Congratulations, you made it to the end! You are now able to create a Snakemake workflow, make it reproducible thanks to Conda and Docker/Apptainer and even run it in an HPC! This is a great time to get yourself a coffee/tea and celebrate!

To make things even better, have a look at some additional concepts and Snakemake’s best practices!