Read alignment - basics
Learning outcomes
After having completed this chapter you will be able to:
- Describe the general workflow of library preparation and sequencing with an Illumina sequencer
- Explain how the fastq format stores sequence and base quality information
- Calculate probability from phred quality and the other way around
- Explain why base quality and mapping quality are important for detecting variants
- Illustrate the difference between short-read and long-read sequencing
- Explain which type of invention led to development of long-read sequencing
- Explain what impact long read sequencing can have on variant analysis
- Describe how alignment information is stored in a sequence alignment (
.sam
) file - Define a duplicate alignment and explain how alignment duplicates can affect variant analysis
- Perform an alignment of genomic reads with
bwa mem
- Generate and interpret the alignment statistics from
samtools flagstat
Material
Exercises
1. Prepare the reference genome
Download and unpack the data files in your working directory (~/workdir
).
cd ~/workdir
wget https://ngs-variants-training.s3.eu-central-1.amazonaws.com/ngs-variants-training.tar.gz
tar -xvf ngs-variants-training.tar.gz
rm ngs-variants-training.tar.gz
Exercise: This will create the directory data
. Check out what’s in there.
Answer
The directory data contains the following:
data
├── fastq
│ ├── father_R1.fastq.gz
│ ├── father_R2.fastq.gz
│ ├── mother_R1.fastq.gz
│ ├── mother_R2.fastq.gz
│ ├── son_R1.fastq.gz
│ └── son_R2.fastq.gz
├── reference
│ └── Homo_sapiens.GRCh38.dna.chromosome.20.fa
└── variants
├── 1000g_gold_standard.indels.filtered.vcf
├── GCF.38.filtered.renamed.vcf
├── NA12878.vcf.gz
└── NA12878.vcf.gz.tbi
3 directories, 11 files
These are:
- input reads (at
fastq
) - a part of the human reference genome (at
reference
) - some vcfs with variants for calibration and evaluation (at
variants
)
Use data
only for input
The directory data
that you have just downloaded, contains only input files for the exercises. So, don’t write output (except for indexes) to this directory.
We’ll use bwa mem
for the alignment. Like all alignment software, it requires an index of the reference genome. You can make an index like this:
bwa index <reference.fa>
Make an index of the reference sequence of chromosome 20 of the human genome. You can find the fasta file in ~/workdir/data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa
.
Answer
cd ~/workdir/data/reference/
bwa index Homo_sapiens.GRCh38.dna.chromosome.20.fa
2. Read alignment
Check out the synopsis and manual of bwa mem
. We’ll be using paired-end reads of three samples that can be found at ~/workdir/data/fastq
. If we run bwa mem
with default options, which three arguments do we need?
Answer
The manual says:
bwa mem [-aCHMpP] [-t nThreads] [-k minSeedLen] ... db.prefix reads.fq [mates.fq]
- a database prefix (
db.prefix
) - forward reads (
reads.fq
) - reverse reads (
mates.fq
)
For our reference sequence a command would look like:
cd ~/workdir/
bwa mem \
data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
<forward_reads.fq> \
<reverse_reads.fq> \
> <alignment.sam>
Perform an alignment with bwa mem
of the reads from the mother (mother_R1.fastq
and mother_R2.fastq
) against chromosome 20. Write the alignment file to a directory in ~/workdir
called alignment
.
Index prefix is the same a reference filename
With default values, the name of the index of a reference for bwa mem
is the same as the name of the reference itself. In this case, this would be Homo_sapiens.GRCh38.dna.chromosome.20.fa
.
Answer
We’ll first make the alignment directory:
cd ~/workdir/
mkdir alignment
bwa mem \
data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
data/fastq/mother_R1.fastq.gz \
data/fastq/mother_R2.fastq.gz \
> alignment/mother.sam
3. Alignment statistics
Exercise: Check out the statistics of the alignment by using samtools flagstat
. Find the documentation here. Any duplicates in there?
Answer
cd ~/workdir/alignment
samtools flagstat mother.sam
Should give:
133477 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
317 + 0 supplementary
0 + 0 duplicates
132892 + 0 mapped (99.56% : N/A)
133160 + 0 paired in sequencing
66580 + 0 read1
66580 + 0 read2
131470 + 0 properly paired (98.73% : N/A)
131990 + 0 with itself and mate mapped
585 + 0 singletons (0.44% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
No duplicates were found (0 + 0 duplicates
). The aligner doesn’t automatically flag duplicates. This needs to be done after the alignment.
4. Sorting and compression
Many downstream analyses require a coordinate sorted alignment file. Now, your alignment file is in the same order as the fastq file. You can coordinate sort an alignment file with samtools sort
. You can find the documentation here.
Exercise: Sort the alignment file according to coordinate.
Answer
samtools sort -o alignment/mother.sorted.sam alignment/mother.sam
Tip: samtools sort
and samtools view
can write to stdout
Like bwa mem
, samtools sort
and samtools view
can write its output to stdout. This means that you need to redirect your output to a file with >
or use the the output option -o
.
The command samtools view
is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let’s start with that.
Exercise: compress our SAM file into a BAM file and include the header in the output. For this, use the -b
and -h
options. Find the required documentation here. How much was the disk space reduced by compressing the file?
Answer
samtools view -bh mother.sorted.sam > mother.bam
ls -lh
, you can find out that mother.sorted.sam
has a size of 55 Mb, while mother.bam
is only 16 Mb.