Read alignment - basics

Learning outcomes

After having completed this chapter you will be able to:

Describe the general workflow of library preparation and sequencing with an Illumina sequencer
Explain how the fastq format stores sequence and base quality information
Calculate probability from phred quality and the other way around
Explain why base quality and mapping quality are important for detecting variants
Illustrate the difference between short-read and long-read sequencing
Explain which type of invention led to development of long-read sequencing
Explain what impact long read sequencing can have on variant analysis
Describe how alignment information is stored in a sequence alignment (.sam) file
Define a duplicate alignment and explain how alignment duplicates can affect variant analysis
Perform an alignment of genomic reads with bwa mem
Generate and interpret the alignment statistics from samtools flagstat

Material

Exercises

1. Prepare the reference genome

Download and unpack the data files in your working directory (~/workdir).

cd ~/workdir
wget https://ngs-variants-training.s3.eu-central-1.amazonaws.com/ngs-variants-training.tar.gz
tar -xvf ngs-variants-training.tar.gz
rm ngs-variants-training.tar.gz

Exercise: This will create the directory data. Check out what’s in there.

Answer

The directory data contains the following:

data
├── fastq
│   ├── father_R1.fastq.gz
│   ├── father_R2.fastq.gz
│   ├── mother_R1.fastq.gz
│   ├── mother_R2.fastq.gz
│   ├── son_R1.fastq.gz
│   └── son_R2.fastq.gz
├── reference
│   └── Homo_sapiens.GRCh38.dna.chromosome.20.fa
└── variants
    ├── 1000g_gold_standard.indels.filtered.vcf
    ├── GCF.38.filtered.renamed.vcf
    ├── NA12878.vcf.gz
    └── NA12878.vcf.gz.tbi

3 directories, 11 files

These are:

input reads (at fastq)
a part of the human reference genome (at reference)
some vcfs with variants for calibration and evaluation (at variants)

Use data only for input

The directory data that you have just downloaded, contains only input files for the exercises. So, don’t write output (except for indexes) to this directory.

We’ll use bwa mem for the alignment. Like all alignment software, it requires an index of the reference genome. You can make an index like this:

bwa index <reference.fa>

Make an index of the reference sequence of chromosome 20 of the human genome. You can find the fasta file in ~/workdir/data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa.

Answer

cd ~/workdir/data/reference/
bwa index Homo_sapiens.GRCh38.dna.chromosome.20.fa

2. Read alignment

Check out the synopsis and manual of bwa mem. We’ll be using paired-end reads of three samples that can be found at ~/workdir/data/fastq. If we run bwa mem with default options, which three arguments do we need?

Answer

The manual says:

bwa mem [-aCHMpP] [-t nThreads] [-k minSeedLen] ... db.prefix reads.fq [mates.fq]

So, we’ll need:

a database prefix (db.prefix)
forward reads (reads.fq)
reverse reads (mates.fq)

For our reference sequence a command would look like:

cd ~/workdir/

bwa mem \
data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
<forward_reads.fq> \
<reverse_reads.fq> \
> <alignment.sam>

Perform an alignment with bwa mem of the reads from the mother (mother_R1.fastq and mother_R2.fastq) against chromosome 20. Write the alignment file to a directory in ~/workdir called alignment.

Index prefix is the same a reference filename

With default values, the name of the index of a reference for bwa mem is the same as the name of the reference itself. In this case, this would be Homo_sapiens.GRCh38.dna.chromosome.20.fa.

Answer

We’ll first make the alignment directory:

cd ~/workdir/
mkdir alignment

Then, we run the alignment:

bwa mem \
data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
data/fastq/mother_R1.fastq.gz \
data/fastq/mother_R2.fastq.gz \
> alignment/mother.sam

3. Alignment statistics

Exercise: Check out the statistics of the alignment by using samtools flagstat. Find the documentation here. Any duplicates in there?

Answer

cd ~/workdir/alignment
samtools flagstat mother.sam

Should give:

133477 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
317 + 0 supplementary
0 + 0 duplicates
132892 + 0 mapped (99.56% : N/A)
133160 + 0 paired in sequencing
66580 + 0 read1
66580 + 0 read2
131470 + 0 properly paired (98.73% : N/A)
131990 + 0 with itself and mate mapped

585 + 0 singletons (0.44% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

No duplicates were found (0 + 0 duplicates). The aligner doesn’t automatically flag duplicates. This needs to be done after the alignment.

4. Sorting and compression

Many downstream analyses require a coordinate sorted alignment file. Now, your alignment file is in the same order as the fastq file. You can coordinate sort an alignment file with samtools sort. You can find the documentation here.

Exercise: Sort the alignment file according to coordinate.

Answer

samtools sort -o alignment/mother.sorted.sam alignment/mother.sam

Tip: samtools sort and samtools view can write to stdout

Like bwa mem, samtools sort and samtools view can write its output to stdout. This means that you need to redirect your output to a file with > or use the the output option -o.

The command samtools view is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let’s start with that.

Exercise: compress our SAM file into a BAM file and include the header in the output. For this, use the -b and -h options. Find the required documentation here. How much was the disk space reduced by compressing the file?

Answer

samtools view -bh mother.sorted.sam > mother.bam

By using ls -lh, you can find out that mother.sorted.sam has a size of 55 Mb, while mother.bam is only 16 Mb.