After having completed this chapter you will be able to:
samtools flagstatto get general statistics on the flags stored in a sam/bam file
- compress a sam file into a bam file
- filter on sam flags
- count alignments
- filter out a region
samtools sortto sort an alignment file based on coordinate
samtools indexto create an index of a sorted sam/bam file
- Use the pipe (
|) symbol to pipe alignments directly to
samtoolsto perform sorting and filtering
- Explain sam flags tool
Exercise: Check out the statistics of the E. coli alignment by using
samtools flagstat. Find the documentation here. Anything that draws your attention?
cd ~/workdir/alignment_output/ samtools flagstat SRR519926.sam
631808 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 627753 + 0 mapped (99.36% : N/A) 631808 + 0 paired in sequencing 315904 + 0 read1 315904 + 0 read2 302430 + 0 properly paired (47.87% : N/A) 624508 + 0 with itself and mate mapped 3245 + 0 singletons (0.51% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
Of the reads, 47.87% is properly paired. The rest isn’t. Proper pairing is quite hard to interpret. It usually means that the 0x2 flag (each segment properly aligned according to the aligner) is false. In this case it means that the insert size is high for a lot of sequences. That is because the insert size distribution is very wide. You can find info on insert size distribution like this:
samtools stats SRR519926.sam | grep ^SN | cut -f 2,3
Now look at
insert size average and
insert size standard deviation. You can see the standard deviation is higher than the average, suggesting a wide distribution.
Compression, sorting and indexing
samtools view is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let’s start with that.
Exercise: compress our SAM file into a BAM file and include the header in the output. For this, use the
-h options. Find the required documentation here. How much was the disk space reduced by compressing the file?
Tip: Samtools writes to stdout
By default, samtools writes it’s output to stdout. This means that you need to redirect your output to a file with
> or use the the output option
samtools view -bh SRR519926.sam > SRR519926.bam
ls -lh, you can find out that
SRR519926.samhas a size of 264 Mb, while
SRR519926.bamis only 77 Mb.
To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of ‘phonebook’ of your sequence alignment file. Indexing can be done with
samtools as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this:
samtools sort SRR519926.bam > SRR519926.sorted.bam samtools index SRR519926.sorted.bam
samtools view you can easily filter your alignment file based on flags. One thing that might be sensible to do at some point is to filter out unmapped reads.
Exercise: Check out the flag that you would need to filter for mapped reads. It’s at page 7 of the SAM documentation.
You will need the 0x4 flag.
Filtering against unmapped reads (leaving only mapped reads) with
samtools view would look like this:
samtools view -bh -F 0x4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam
samtools view -bh -F 4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam
Exercise: Write a command that outputs only the unmapped reads (so the opposite of the example). How many reads are in there? Is that the same as what we expect based on the output of
Check out the
-c options of
Filter like this:
samtools view -bh -f 0x4 SRR519926.sorted.bam > SRR519926.sorted.unmapped.bam
Counting like this:
samtools view -c SRR519926.sorted.unmapped.bam
samtools flagstat(631808 - 627753 = 4055)
samtools view also enables you to filter alignments in a specific region. This can be convenient if you don’t want to work with huge alignment files and if you’re only interested in alignments in a particular region. Region filtering only works for sorted and indexed alignment files.
Exercise: Filter our sorted and indexed BAM file for the region between 2000 and 2500 kb, and output it as a BAM file with a header.
Tip: Specifying a region
Our E. coli genome has only one chromosome, because only one line starts with
> in the fasta file
cd ~/workdir/ref_genome grep ">" ecoli-strK12-MG1655.fasta
>U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome
The part after the first space in the title is cut off for the alignment reference. So the code for specifying a region would be:
cd ~/workdir/alignment_output samtools view -bh SRR519926.sorted.bam U00096.3:2000000-2500000 > SRR519926.sorted.region.bam
Samtools is easy to use in a pipe. In this case you can replace the input file with a
-. For example, you can sort and compress the output of your alignment software in a pipe like this:
my_alignment_command \ | samtools sort - \ | samtools view -bh - \ > alignment.bam
The use of
In the modern versions of samtools, the use of
- is not needed for most cases, so without an input file it reads from stdin. However, if you’re not sure, it’s better to be safe than sorry.
Exercise: Write a script that maps the reads with bowtie2 (see chapter 2 of read alignment), sorts them, takes only the mapped reads, and outputs them as a BAM file with a header.
##!/usr/bin/env bash TRIMMED_DIR=~/workdir/trimmed_data REFERENCE_DIR=~/workdir/ref_genome ALIGNED_DIR=~/workdir/alignment_output bowtie2 \ -x $REFERENCE_DIR/ecoli-strK12-MG1655.fasta \ -1 $TRIMMED_DIR/paired_trimmed_SRR519926_1.fastq \ -2 $TRIMMED_DIR/paired_trimmed_SRR519926_2.fastq \ | samtools sort - \ | samtools view -F 0x4 -bh - \ > $ALIGNED_DIR/SRR519926.sorted.mapped.frompipe.bam