Read alignment - advanced

Learning outcomes

After having completed this chapter you will be able to:

Use samtools to mark duplicates from an alignment file
Use samtools to add readgroups to an alignment file
Use a for loop in bash to perform the same operation on a range of files
Use samtools in a pipe to efficiently do multiple operations on an alignment file in a single command

Material

samtools documentation

Exercises

1. Adding readgroups

During several steps of variant calling gatk uses read group information. For each read, this gives information on the sequencing platform, the library, the lane and of course the sample. Have a look at the description of the different levels of read group information gatk uses here.

Exercise: The documentation mentions several read group fields that are used by gatk. Have a look at the fastq header. Does that give you the information that is required? Do we have that information for our sample? Can you specify it for our sample?

Hint

You can have a look at the first few entries in the fastq file with:

zcat mother_R1.fastq.gz | head

Answer

Most of the information you should now based on the experimental design, the rest you can find in the fastq header:

PL: the platform. Should be quite obvious; you usually you have this information. For us, this would be ILLUMINA
SM: the sample. All alignments that have reads coming from the same individual should have the same identifier in this field. For us, this would be mother.
LB: library identifier. Molecular duplicates only exist within a library. If a single library was sequenced on multiple lanes, it is important to track this information. In our case, we have sequenced only one library, so you can specify it with e.g. lib1.
PU: platform unit. This field is used to identify the sequencing lane. The documentation tells us we should specify it as [FLOWCELL].[LANE].[SAMPLE BARCODE]. The header of the first entry in our fastq file looks like this: @H0164ALXX140820:2:1101:2136:40460/1. Where the flowcell ID is H0164 and the lane 2. This formatting is specific to Broad Genomic Services pipelines, and not very common nowadays. Here the sample barcode is added to the flowcell ID, and is therefore specified as ALXX140820. We can therefore specify it with H0164.2.ALXX140820.
ID: read group id. If you don’t have specific information on the flowcell and lane (specified with PU), you can use this field to specify a unique unit that is used for e.g. base quality score recalibration. This often a combination of a flow cell identifier and a lane. In our case this could be H0164.2

Note

More modern output of an Illumina sequencer looks e.g. like this (example on Wikipedia):

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

Here, e.g. the PU field would be FC706VJ.2.ATCACG

Exercise: Have a look at the documentation of AddOrReplaceReadGroups. Specify the required arguments, and run the command.

Answer

We can use the answers of the previous exercise, and use them in the command:

gatk AddOrReplaceReadGroups \
--INPUT alignment/mother.bam \
--OUTPUT alignment/mother.rg.bam \
--RGLB lib1 \
--RGPU H0164.2.ALXX140820 \
--RGPL ILLUMINA \
--RGSM mother \
--RGID H0164.2

Exercise: Compare the header and first alignments of mother.bam and mother.rg.bam. Notice any differences?

Hint

You can view the header with

samtools view -H <alignment.bam>

And the first few alignments with

samtools view <alignment.bam> | head

Answer

Compared to the header of mother.markdup.bam, the header of mother.markdup.rg.bam contains an extra line starting with @RG:

@RG     ID:H0164.2      LB:lib1 PL:ILLUMINA     SM:mother       PU:H0164.2.ALXX140820

In the alignment records, a tag was added at the very end of each line: RG:Z:H0164.2. Note that all fields (LB, PU, etc.) are related to ID. So for each read only ID is specified and all other fields can be deducted from that.

2. Mark duplicates

Now that we have specified read groups, we can mark the duplicates with gatk MarkDuplicates.

Exercise: Have a look at the documentation, and run gatk MarkDuplicates with the three required arguments.

Answer

gatk MarkDuplicates \
--INPUT alignment/mother.rg.bam \
--OUTPUT alignment/mother.rg.md.bam \
--METRICS_FILE alignment/marked_dup_metrics_mother.txt

Exercise: Run samtools flagstat on the alignment file with marked duplicates. How many reads were marked as duplicate?

Answer

samtools flagstat mother.rg.md.bam

Gives:

133477 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
317 + 0 supplementary
17329 + 0 duplicates
132892 + 0 mapped (99.56% : N/A)
133160 + 0 paired in sequencing
66580 + 0 read1
66580 + 0 read2
131470 + 0 properly paired (98.73% : N/A)
131990 + 0 with itself and mate mapped

585 + 0 singletons (0.44% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Which tells us that 17329 reads were marked as duplicate.

3. Indexing

To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of ‘phonebook’ of your sequence alignment file. Indexing can be done with samtools as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this:

samtools index mother.rg.md.bam

4. Piping and looping

Exercise Generate a tab-delimited file in which each line represents a sample (mother, father and son), and where you specify the SM, LB, PU and ID fields. E.g., the first line (for ‘mother’) would look like:

mother  lib1    H0164.2.ALXX140820  H0164.2

Answer

Your file should look like this:

mother  lib1    H0164.2.ALXX140820  H0164.2
father  lib2    H0164.3.ALXX140820  H0164.3
son lib3    H0164.6.ALXX140820  H0164.6

Exercise: Below you can find a script that loops over each line in a file and creates a shell variable for each column in that line. Figure out where in the script each of the following tasks is performed:

adding read groups
alignment
indexing
compression
marking duplicates
sorting by coordinate

Exercise: Use the tab-delimited file you created as input for the ‘while loop’ to generate bam files for each sample that are sorted, compressed, have read groups added, duplicates marked and indexed. In the example the tab-delimited file is called sample_rg_fields.txt. In addition, add a command in which you generate the alignments statistics with samtools flagstat of the bam file with marked duplicates.

cat sample_rg_fields.txt | while read sample lb pu id
do
    bwa mem data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
    data/fastq/"$sample"_R1.fastq.gz \
    data/fastq/"$sample"_R2.fastq.gz \
    | samtools sort \
    | samtools view -bh > alignment/$sample.bam

    gatk AddOrReplaceReadGroups \
    --INPUT alignment/$sample.bam \
    --OUTPUT alignment/$sample.rg.bam \
    --RGLB $lb \
    --RGPU $pu \
    --RGPL ILLUMINA \
    --RGSM $sample \
    --RGID $id

    gatk MarkDuplicates \
    --INPUT alignment/$sample.rg.bam \
    --OUTPUT alignment/$sample.rg.md.bam \
    --METRICS_FILE alignment/marked_dup_metrics_$sample.txt 

    samtools index alignment/$sample.bam
done < sample_rg_fields.txt

Answer

cat sample_rg_fields.txt | while read sample lb pu id
do
    bwa mem data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
    data/fastq/"$sample"_R1.fastq.gz \
    data/fastq/"$sample"_R2.fastq.gz \
    | samtools sort \
    | samtools view -bh > alignment/$sample.bam

    gatk AddOrReplaceReadGroups \
    --INPUT alignment/$sample.bam \
    --OUTPUT alignment/$sample.rg.bam \
    --RGLB $lb \
    --RGPU $pu \
    --RGPL ILLUMINA \
    --RGSM $sample \
    --RGID $id

    gatk MarkDuplicates \
    --INPUT alignment/$sample.rg.bam \
    --OUTPUT alignment/$sample.rg.md.bam \
    --METRICS_FILE alignment/marked_dup_metrics_$sample.txt 

    samtools index alignment/$sample.bam

    samtools flagstat alignment/$sample.rg.md.bam > $sample.rg.md.stats
done