Read alignment - advanced
Learning outcomes
After having completed this chapter you will be able to:
- Use
samtools
to mark duplicates from an alignment file - Use
samtools
to add readgroups to an alignment file - Use a for loop in bash to perform the same operation on a range of files
- Use
samtools
in a pipe to efficiently do multiple operations on an alignment file in a single command
Material
samtools
documentation
Exercises
1. Marking duplicates
For variant analysis, it’s important to mark reads that possibly originated from PCR duplication. We can do that with samtools markdup
. However, we can not directly run that on our .sam
file nor on our compressed .bam
file.
Exercise: Which samtools commands would we need to run to mark duplicates?
Hint
More info on this at the samtools markdup
documentation.
Also: the reads are already collated (i.e. forward and reverse are grouped) after the alignment, so no need to run samtools collate
.
Answer
The commands would be:
samtools fixmate
(with the-m
option)samtools sort
samtools markdup
Exercise: Run the three commands that are required to mark duplicates.
Answer
cd ~/workdir/alignment
samtools fixmate -m mother.bam mother.fixmate.bam
samtools sort -o mother.positionsort.bam mother.fixmate.bam
samtools markdup mother.positionsort.bam mother.markdup.bam
Exercise: Run samtools flagstat
on the alignment file with marked duplicates. How many reads were marked as duplicate?
Answer
samtools flagstat mother.markdup.bam
Gives:
133477 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
317 + 0 supplementary
17329 + 0 duplicates
132892 + 0 mapped (99.56% : N/A)
133160 + 0 paired in sequencing
66580 + 0 read1
66580 + 0 read2
131470 + 0 properly paired (98.73% : N/A)
131990 + 0 with itself and mate mapped
585 + 0 singletons (0.44% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
2. Adding read groups
For variant analysis, it’s important to know which read came from which sample. Right now, that’s easy. All reads come from one individual. But this can become less trivial if you are combining samples. Therefore we add a tag to each read specifying its origin.
You can add a readgroup to your marked alignment file like this:
samtools addreplacerg \
-r ID:mother \
-r SM:mother \
-r PL:ILLUMINA \
-o mother.markdup.rg.bam \
mother.markdup.bam
This command modifies the sam header and read tags.
Exercise: Run the samtools addreplacerg
command and compare the header and first alignments of mother.markdup.bam
and mother.markdup.rg.bam
. Notice any differences?
Hint
You can view the header with
samtools view -H <alignment.bam>
And the first few alignments with
samtools view <alignment.bam> | head
Answer
Compared to the header of mother.markdup.bam
, the header of mother.markdup.rg.bam
contains an extra line starting with @RG
:
@RG ID:mother SM:mother PL:ILLUMINA
In the alignment records, a tag was added at the very end of each line: RG:Z:mother
.
3. Indexing
To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of ‘phonebook’ of your sequence alignment file. Indexing can be done with samtools
as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this:
samtools index mother.markdup.rg.bam
4. Piping and looping
Samtools can quite easily be used in a UNIX pipeline. This has the advantage that you don’t need to write many intermediate files. However, the developers have not been very consistent with managing input and output (I’m sure they had their reasons). To use samtools in a pipe, the input argument needs to replaced with a -
. Also, some commands do not write by default to stdout, but to a specified file (this is the case for e.g. samtools fixmate
and samtools markdup
). In that case, also the output argument should be replaced with -
.
Example
For the command samtools addreplacerg
the samtools documentation provides the following synopsis:
samtools addreplacerg [-r rg-line | -R rg-ID] [-m mode] [-l level] [-o out.bam] in.bam
Meaning that it requires in.bam
, and can write to out.bam
if option -o
is provided. By default, it writes to stdout. So, if you pipe to samtools addreplacerg
you would only need to replace the input file with a -
:
some_command | samtools addreplacerg [options] - > output.sam
For the command samtools fixmate
, the samtools documentation provides this synopsis:
samtools fixmate [-rpcm] [-O format] in.nameSrt.bam out.bam
Meaning that it requires both the input file in.nameSrt.bam
and the output file out.bam
. So, if you pipe to samtools fixmate
and you want to write to stdout (so piping from), you’ll need to replace both the input and output with a -
:
some_command | samtools fixmate [options] - - > output.sam
The most frequently used samtools commands don’t require an input nor an output file, and therefore behave like many UNIX commands. An example of this is samtools sort
. The synopsis is:
samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format] [-n] [-t tag] [-T tmpprefix] [-@ threads] [in.sam|in.bam|in.cram]
Telling us that both the input file and output (with option -o
) file are optional. If the input file is absent, it reads from stdin. So, you could use it without a -
replacing input or output files:
some_command | samtools sort > output.sam
Let’s put everything we’ve done so far in a pipe and loop over our three samples.
The command below loops over the strings father
, mother
and son
, and performs these tasks:
- Create a variable to work on data of mother, father and son separately
- Perform the alignment
- Fill in the mate coordinates and sort on coordinate
- Mark duplicates
- Add readgroups
- Compress the output
- Create an index
#!/usr/bin/env bash
cd ~/workdir
for sample in mother father son
do
bwa mem data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
data/fastq/"$sample"_R1.fastq.gz \
data/fastq/"$sample"_R2.fastq.gz \
| samtools fixmate -m - - \
| samtools sort \
| samtools markdup -s - - \
| samtools addreplacerg -r ID:$sample -r SM:$sample -r PL:ILLUMINA - \
| samtools view -bh > alignment/$sample.bam
samtools index alignment/$sample.bam
done
Exercise: For each task (1-7), figure out which part of the script performs that task. After that, run it to get the alignments of all three samples.
Answer
Creating variables (1):
for sample in mother father son
do
...
done
Perform the alignment (2):
bwa mem data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
data/fastq/"$sample"_R1.fastq.gz \
data/fastq/"$sample"_R2.fastq.gz
Fill in the mate coordinates and sort on coordinate (3):
samtools fixmate -m <INPUT> <OUTPUT> \
| samtools sort
Mark duplicates (4):
samtools markdup -s <INPUT> <OUTPUT>
Add readgroups (5):
samtools addreplacerg -r ID:$sample -r SM:$sample -r PL:ILLUMINA <INPUT>
Compress the output (6):
samtools view -bh > alignment/$sample.bam
Create an index (7):
samtools index alignment/$sample.bam
5. Additional exercise: Add readgroups and mark duplicates with picard
Prepare a sorted .bam (without RG and marked duplicates) from mother.sam (the current mother.bam has readgroups and marked dups):
samtools sort mother.sam | samtools view -bh > mother.nrg.bam
Then, run gatk AddOrReplaceReadGroups
on the file mother.nrg.bam . After that run: gatk MarkDuplicates
on the alignment file with readgroups added.
Find the documentation here:
Exercise: do you see any differences with the output we have generated with samtools
?
Answer
Your script should look like this:
#!/bin/bash
cd ~/workdir
samtools sort alignment/mother.sam \
| samtools view -bh > alignment/mother.nrg.bam
gatk AddOrReplaceReadGroups \
--INPUT alignment/mother.nrg.bam \
--OUTPUT alignment/mother.nrg.rg.bam \
--RGID H0164.2 \
--RGLB lib1 \
--RGPU H0164ALXX140820.2 \
--RGPL ILLUMINA \
--RGSM mother
gatk MarkDuplicates \
--INPUT alignment/mother.nrg.rg.bam \
--OUTPUT alignment/mother.nrg.rg.md.bam \
--METRICS_FILE alignment/marked_dup_metrics_mother.txt
More information on read groups here. In this documentation it is stated how to specify RGID
and RGPU
.
Both samtools
and gatk MarkDuplicates
mark the same number of duplicates. However, gatk
forces you to add readgroups that are appropriate for further downstream processing with gatk
.