Skip to content

Server login + unix fresh up

To conduct the practicals of this course, we will be using a dedicated High Performance Computing cluster. This matches the reality of most NGS workflows, which cannot be completed in a reasonable time on a single machine.

To interact with this cluster, you will have to log in to a distant head node. From there you will be able to distribute your computational tasks to the cluster using a job scheduler called Slurm.

This page will cover our first contact with the distant cluster.

You will learn to :

  • understand a typical computer cluster architecture.
  • connect to the server.
  • use the command line to perform basic operations on the head node.
  • exchange files between the server and your own machine.
  • submit a job to the cluster.

Note

If you are doing this course on your own, then the distant server provided within the course will not be available. Feel free to ignore or adapt any of the following steps to your own situation.

The computing cluster

The computing cluster follows an architecture that enables several users to distribute computational tasks among several machines which share a number of resources, such as a common file system.

cluster_overview

Users do not access each machine individually, but rather connect to a head node. From there, they can interact with the cluster using the job scheduler (here slurm). The job scheduler’s role is to manage where and how to run the jobs of all users, such that waiting time is minimized and resource usage is optimized.

Warning

Everyone is connected to the same head node. Do not perform compute-intensive tasks on it or you will slow everyone down!

Connect to the server

Say you want to connect to cluster with address xx.xx.xx.xx and your login is login.

Warning

If you are doing this course with a teacher, use the link, login and password provided before or during the course.

The first step will be to open a terminal

Open a terminal, for instance with the application Xterm, or Xquartz.

Open a new terminal.

Open the application mobaXterm (or any ssh-enabling terminal aplpication you prefer).

On mobaXterm, click on “Start a local Terminal”.


In the terminal type the following command:

ssh login@xx.xx.xx.xx

When prompted for your password, type it and press Enter.

Note

There is no cursor or ‘‚óŹ’ character appearing while you type your password. This is normal.

After a few seconds, you should be logged into the head node and ready to begin.

Using command line on the cluster

Now that you are in the head node, it is time to get acquainted with your environment and to prepare the upcoming practicals. We will also use this as a short reminder about the UNIX command line.

You can also refer to this nice Linux Command Line Cheat Sheet.


At any time, you can get the location (folder) your terminal is in at by typing the “print working directory” command:

pwd

When you start a session on a distant computer, you are placed in your home directory. So the cluster should return something like:

/shared/home/<login>

Creating a directory

Use the command line to create a repository called day1 where you will put all materials relating to this first day.

Answer
mkdir day1

Move to that directory.

Answer
cd day1

The directory /shared/data/ contains data and solutions for most practicals. Check the content of that directory.

Answer
ls /shared/data/

Note

You don’t need to move to that directory to list its contents!

Copy the script fastqc_Liu2015_SRR1272187_1.sh from /shared/data/Solutions/Liu2015 into your current directory.

Answer
cp /shared/data/Solutions/Liu2015/fastqc_Liu2015_SRR1272187_1.sh .

Print the content of this script to the screen.

Answer

more fastqc_Liu2015_SRR1272187_1.sh
output:
#!/usr/bin/bash
#SBATCH --job-name=fastqc
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH -o fastqc_Liu2015.o
#SBATCH -e fastqc_Liu2015.e

ml fastqc

dataDir=/shared/data/DATA/Liu2015

mkdir -p FASTQC_Liu2015/
fastqc -o FASTQC_Liu2015/ $dataDir/SRR1272187_1.fastq.gz

We’ll see what all this means soon.

Creating and editing a file

To edit files on the distant server, we will use the command line editor nano. It is far from the most complete or efficient one, but it can be found on most servers and is arguably among the easiest to start with.

Note

Alternatively, feel free to use any other CLI editor you prefer, such as vi.

To start editing a file named test.txt, type :

    nano test.txt

You will be taken to the nano interface :

nano screenshot

Type in your favorite movie quote, and then exit by pressing Ctrl+x (control+x on a Mac keyboard), and then y and Enter when prompted to save the modifications you just made.

You can check that your modifications were saved by typing

more test.txt

Exchanging files with the server

Whether you want to transfer some data to the cluster or retrieve the results of your latest computation, it is important to be able to exchange files with the distant server.

There exists several alternatives, depending on your platform and preferences.

We will use scp.

To copy a file from the server to your machine, use this syntax on a terminal in your local machine (open a new terminal if necessary).

scp <login>@<server-adress>:/path/to/file/on/server/file.txt /local/destination/

For example, to copy the file test.txt you just created in the folder day1/, to your current (local) working directory :

scp login@xx.xx.xx.xx:~/day1/file.txt .

To copy a file from your machine to the server (NB: here ~ will be interpreted as your home directory, this is a useful and time-saving shorthand):

scp /path/to/file/local/file.txt <login>@<server-adress>:/destination/on/server/

There are nice and free software with graphical user interfaces, such as filezilla, to help you manage exchanges with the distant server. Feel free to install and experiment with it during the break. filezilla_mainscreen

If you are using mobaXterm, the left panel should provide a graphical SFTP browser in the left sidebar which allows you to browse and drag and drop files directly from/to the remote server.

mobaxterm_leftpanel

Submitting jobs

Jobs can be submitted to the compute cluster using sbatch scripts, which contain 2 parts :

  • information for the job scheduler:

    • how much RAM / CPUs do I need ?
    • where to write the logs of my job ?
  • bash commands corresponding to your task

But an example is worth a thousand words :

#!/usr/bin/bash
#SBATCH --job-name=test
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH -o test_log.o


echo "looking at the size of the elements of /shared/data/"
sleep 10 # making the script wait for 10 seconds - this is just so we can see it later on. 
# `du` is "disk usage", a command that returns the size of a folder structure.
du -h -d 2 /shared/data/

The lines beginning by #SBATCH specify options to the job scheduler:

  • The first line indicates what shell should be used to interpret the commands.
  • #SBATCH --job-name=test : the job name
  • #SBATCH --time=00:30:00 : time reserved for the job : 30min.
  • #SBATCH --cpus-per-task=1 : CPUs for the job
  • #SBATCH --mem=1G : memory for the job
  • #SBATCH -o test_log.o : file to write output or error messages

Warning

Your job will fail as soon as it takes more time or RAM than requested. You might need to test it to find the appropriate values.

Copy this script inside a new file named mySbatchScript.sh, then submit it to the job scheduler using :

    sbatch mySbatchScript.sh

Afterward, use the command squeue to monitor the jobs submitted to the cluster. Locate your job and wait for it to be accepted (RUNNING status), and then to complete (the job disappears from the output of squeue).

Check the output of your job in the output file.

Note

When there are a lot of jobs, squeue -u <username> will limit the list to those of the specified user.

Advanced cluster usage : loading modules

During our various analysis, we will call onto numerous software.

Fortunately, in most case we do not have to install each of these ourselves onto the cluster : they have already been packaged and prepared to be made available to you or your code.

However, by default these are not loaded and you have to explicitely load the module containing the software you want in your script (or in the interactive shell session).

Question: Why aren’t all the module already pre-loaded ?

Answer

Many toolsets have dependencies toward different, sometimes incompatible libraries. Packaging each tool independently and loading them separately circumvents this as you only load what you need, and you can always unload a toolset if you need to load another, incompatible, toolset.

Modules are managed with the module command.

Basic commands are :

  • module list : lists currently loaded modules
  • module load <modulename> alias ml <modulename> : loads module <modulename>
  • module unload <modulename> : unloads module <modulename>
  • module purge : unloads all loaded modules
  • module avail : lists all modules available for loading
  • module keyword <KW> : lists all modules available for loading which contains <KW>

Try it for yourself: soon, we will need the fastqc software. If we type in the terminal:

fastqc
this should give you an error, telling you there is no such command.

Now, if we load it first

ml fastqc # shortcut for "module load fastqc"
fastqc
Now you should not have any error.

Note: our module provider is ComputeCanada, which has a lot of available software. To avoid storing all these on our cluster, each time a new module is loaded, it is fetched first on the compute canada servers, so sometimes it can take a bit of time to load a module for the first time.

Advanced cluster usage : job array

Often, we have to repeat a similar analysis on a number of files, or for a number of different parameters. Rather than writing each sbatch script individually, we can rely on job arrays to facilitate our task.

Say you want to execute a command, on 10 files (for example, map the reads of 10 samples).

You first create a file containing the name of your files (one per line); let’s call it readFiles.txt.

Then, you write an sbatch array job script:

#!/usr/bin/bash
#SBATCH --job-name=test_array
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH -o test_array_log.%a.o
#SBATCH --array 1-10%5


echo "job array id" $SLURM_ARRAY_TASK_ID

# sed -n <X>p <file> : retrieve line <X> of file
# so the next line grabs the file name corresponding to our job array task id and stores it in the variable ReadFileName 
ReadFileName=`sed -n ${SLURM_ARRAY_TASK_ID}p readFiles.txt`

# here we would put the mapping command or whatever
echo $ReadFileName

Some things have changed compared to the previous sbatch script :

  • #SBATCH --array 1-10%5 : will spawn independent tasks with ids from 1 to 10, and will manage them so that at most 5 run at the same time.
  • #SBATCH -o test_array_log.%a.o : the %a will take the value of the array task id. So we will have 1 log file per task (so 10 files).
  • $SLURM_ARRAY_TASK_ID : changes value between the different tasks. This is what we use to execute the same script on different files (using sed -n ${SLURM_ARRAY_TASK_ID}p)