Read alignment - basics

Learning outcomes

After having completed this chapter you will be able to:

Describe the general workflow of library preparation and sequencing with an Illumina sequencer
Explain how the fastq format stores sequence and base quality information
Calculate probability from phred quality and the other way around
Explain why base quality and mapping quality are important for detecting variants
Illustrate the difference between short-read and long-read sequencing
Explain which type of invention led to development of long-read sequencing
Explain what impact long read sequencing can have on variant analysis
Describe how alignment information is stored in a sequence alignment (.sam) file
Define a duplicate alignment and explain how alignment duplicates can affect variant analysis
Perform an alignment of genomic reads with bwa mem
Generate and interpret the alignment statistics from samtools flagstat

Material

Exercises

Running into problems during the exercises?

Use the “Comments” box at the bottom of each page 👇 for asking questions or giving feedback. It requires a github account.

1. Download data and prepare the reference genome

Let’s start with the first script of our ‘pipeline’. We will use it to download and unpack the course data. Use the code snippet below to create a script called A01_download_course_data.sh. Store it in ~/workdir/scripts/A-prepare_references/, and run it.

A01_download_course_data.sh

#!/usr/bin/env bash
cd ~/workdir

wget https://ngs-variants-training.s3.eu-central-1.amazonaws.com/ngs-variants-training.tar.gz
tar -xvf ngs-variants-training.tar.gz
rm ngs-variants-training.tar.gz

Exercise: This will create the directory data. Check out what’s in there.

Answer

The directory data contains the following:

data
├── fastq
│   ├── father_R1.fastq.gz
│   ├── father_R2.fastq.gz
│   ├── mother_R1.fastq.gz
│   ├── mother_R2.fastq.gz
│   ├── son_R1.fastq.gz
│   └── son_R2.fastq.gz
├── reference
│   └── Homo_sapiens.GRCh38.dna.chromosome.20.fa
└── variants
    ├── 1000g_gold_standard.indels.filtered.vcf
    ├── GCF.38.filtered.renamed.vcf
    ├── NA12878.vcf.gz
    └── NA12878.vcf.gz.tbi

3 directories, 11 files

These are:

input reads (at fastq)
a part of the human reference genome (at reference)
some vcfs with variants for calibration and evaluation (at variants)

Use data only for input

The directory data that you have just downloaded, contains only input files for the exercises. So, don’t write output (except for indexes) to this directory.

In order to index the reference sequence we are going to need some bioinformatics tools. All required tools are pre-installed in a conda environment called ngs-tools. In order to use them, every time you open a new terminal, you will have to load the environment:

conda activate ngs-tools

activate ngs-tools

Every time you open a new terminal you will have to activate the environment again.

The software bwa is in this environment. We will use it for the alignment. Like all alignment software, it requires an index of the reference genome. You can make an index like this:

bwa index <reference.fa>

Make an index of the reference sequence of chromosome 20 of the human genome. You can find the fasta file in ~/workdir/data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa. Do it with a script called A02_create_bwa_index.sh. Also store this in the directory A-prepare_references.

Answer

A02_create_bwa_index.sh

#!/usr/bin/env bash

cd ~/workdir/data/reference/
bwa index Homo_sapiens.GRCh38.dna.chromosome.20.fa

2. Read alignment

Check out the synopsis and manual of bwa mem. We’ll be using paired-end reads of three samples that can be found at ~/workdir/data/fastq. If we run bwa mem with default options, which three arguments do we need?

Answer

The manual says:

bwa mem [-aCHMpP] [-t nThreads] [-k minSeedLen] ... db.prefix reads.fq [mates.fq]

So, we’ll need:

a database prefix (db.prefix)
forward reads (reads.fq)
reverse reads (mates.fq)

For our reference sequence a command would look like:

cd ~/workdir/

bwa mem \
data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
<forward_reads.fq> \
<reverse_reads.fq> \
> <alignment.sam>

We will now go through all the steps concerning alignment for the sample mother. To store the results of these steps, we will create a directory within ~/workdir called results. For the alignment, make a script called B01_alignment.sh. Since we will perform a similar analysis later on for all samples, we store this script in~/workdir/scripts/B-mother_only.

Your directory ~/workdir/scripts should now like this:

scripts
├── A-prepare_references
│   ├── A01_download_course_data.sh
│   ├── A02_create_bwa_index.sh
├── B-mother_only
│   └── B01_alignment.sh
└── C-all_samples

In B01_alignment.sh write the commands to perform an alignment with bwa mem of the reads from the mother (mother_R1.fastq and mother_R2.fastq) against chromosome 20. Write the resulting .sam file to a directory in ~/workdir/results called alignments.

Index prefix is the same a reference filename

With default values, the name of the index of a reference for bwa mem is the same as the name of the reference itself. In this case, this would be Homo_sapiens.GRCh38.dna.chromosome.20.fa.

Answer

B01_read_alignment.sh

#!/usr/bin/env bash

cd ~/workdir/
mkdir -p results/alignments

bwa mem \
data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
data/fastq/mother_R1.fastq.gz \
data/fastq/mother_R2.fastq.gz \
> results/alignments/mother.sam

3. Alignment statistics

Exercise: Check out the statistics of the alignment by using samtools flagstat. Write the output of samtools flagstat to a file called mother.sam.flagstat. Do this by creating a script called B02_get_alignment_statistics.sh, and add this script to ~/workdir/scripts/B-mother_only. Find the documentation of samtools flagstat here. Any duplicates in there?

Answer

B02_get_alignment_statistics.sh

#!/usr/bin/env bash

cd ~/workdir/results/alignments
samtools flagstat mother.sam > mother.sam.flagstat

This should result in:

133477 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
317 + 0 supplementary
0 + 0 duplicates
132892 + 0 mapped (99.56% : N/A)
133160 + 0 paired in sequencing
66580 + 0 read1
66580 + 0 read2
131470 + 0 properly paired (98.73% : N/A)
131990 + 0 with itself and mate mapped

585 + 0 singletons (0.44% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

No duplicates were found (0 + 0 duplicates). The aligner doesn’t automatically flag duplicates. This needs to be done after the alignment.

4. Sorting and compression

Many downstream analyses require a coordinate sorted alignment file. Now, your alignment file is in the same order as the fastq file. You can coordinate sort an alignment file with samtools sort. You can find the documentation here.

Exercise: Sort the alignment file according to coordinate. In order to do this, create a script called B03_sort_alignment.sh (in ~/workdir/scripts/B-mother_only).

Answer

B03_sort_alignment.sh

#!/usr/bin/env bash

cd ~/workdir/results

samtools sort -o alignments/mother.sorted.sam alignments/mother.sam

Tip: samtools sort and samtools view can write to stdout

Like bwa mem, samtools sort and samtools view can write its output to stdout. This means that you need to redirect your output to a file with > or use the the output option -o.

The command samtools view is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let’s start with that.

Exercise: compress our SAM file into a BAM file and include the header in the output. For this, use the -b and -h options. Perform the calculation from a script called B04_compress_alignment.sh (in ~/workdir/scripts/B-mother_only). Find the required documentation here. How much was the disk space reduced by compressing the file?

Answer

B04_compress_alignment.sh

#!/usr/bin/env bash

cd ~/workdir/results

samtools view -bh alignments/mother.sorted.sam > alignments/mother.bam

By using ls -lh, you can find out that mother.sorted.sam has a size of 55 Mb, while mother.bam is only 16 Mb.

5. Recap

You have now performed:

alignment
sorting
compression
flag statistics

On the sample mother. Your scripts directory should look like this:

scripts
├── A-prepare_references
│   ├── A01_download_course_data.sh
│   └── A02_create_bwa_index.sh
├── B-mother_only
│   ├── B01_alignment.sh
│   ├── B02_get_alignment_statistics.sh
│   ├── B03_sort_alignment.sh
│   └── B04_compress_alignment.sh
└── C-all_samples

Read alignment - basics

Learning outcomes

Material

Exercises

1. Download data and prepare the reference genome

2. Read alignment

3. Alignment statistics

4. Sorting and compression

5. Recap

Questions & Answers