Alignment advanced
Learning outcomes¶
After having completed this chapter you will be able to:
- Use
samtoolsto mark duplicates from an alignment file - Use
samtoolsto add readgroups to an alignment file - Use a for loop in bash to perform the same operation on a range of files
- Use
samtoolsin a pipe to efficiently do multiple operations on an alignment file in a single command
Material¶
samtoolsdocumentation
Exercises¶
1. Adding readgroups¶
During several steps of variant calling gatk uses read group information. For each read, this gives information on the sequencing platform, the library, the lane and of course the sample. Have a look at the description of the different levels of read group information gatk uses here.
Exercise: The documentation mentions several read group fields that are used by gatk. Have a look at the fastq header. Does that give you the information that is required? Do we have that information for our sample? Can you specify it for our sample?
Hint
You can have a look at the first few entries in the fastq file with:
Answer
Most of the information you should now based on the experimental design, the rest you can find in the fastq header:
PL: the platform. Should be quite obvious; you usually you have this information. For us, this would beILLUMINASM: the sample. All alignments that have reads coming from the same individual should have the same identifier in this field. For us, this would bemother.LB: library identifier. Molecular duplicates only exist within a library. If a single library was sequenced on multiple lanes, it is important to track this information. In our case, we have sequenced only one library, so you can specify it with e.g.lib1.PU: platform unit. This field is used to identify the sequencing lane. The documentation tells us we should specify it as[FLOWCELL].[LANE].[SAMPLE BARCODE]. The header of the first entry in our fastq file looks like this:@H0164ALXX140820:2:1101:2136:40460/1. Where the flowcell ID isH0164and the lane2. This formatting is specific to Broad Genomic Services pipelines, and not very common nowadays. Here the sample barcode is added to the flowcell ID, and is therefore specified as ALXX140820. We can therefore specify it withH0164.2.ALXX140820.ID: read group id. This is a unique identifier for the read groups. It can be a combination of the flow cell, lane and library (sample). So, for exampleH0164.2.mother. Of course in stead ofmotheryou can also use the sample barcode.
Note
More modern output of an Illumina sequencer looks e.g. like this (example on Wikipedia):
Here, e.g. the PU field would be FC706VJ.2.ATCACG
Exercise: Have a look at the documentation of AddOrReplaceReadGroups. Specify the required arguments, and run the command. Do this from a script called B05_add_readgroups.sh (in ~/project/scripts/B-mother_only).
Answer
We can use the answers of the previous exercise, and use them in the command:
Exercise: Compare the header and first alignments of mother.bam and mother.rg.bam. Notice any differences?
Hint
You can view the header with
And the first few alignments with
Answer
Compared to the header of mother.bam, the header of mother.rg.bam contains an extra line starting with @RG:
In the alignment records, a tag was added at the very end of each line: RG:Z:H0164.2.mother. Note that all fields (LB, PU, etc.) are related to ID. So for each read only ID is specified and all other fields can be deducted from that.
2. Mark duplicates¶
Now that we have specified read groups, we can mark the duplicates with gatk MarkDuplicates.
Exercise: Have a look at the documentation, and run gatk MarkDuplicates with the three required arguments. Do this from a script called B06_mark_duplicates.sh (in ~/project/scripts/B-mother_only).
Answer
Exercise: Run samtools flagstat on the alignment file with marked duplicates, and write the output to a file called mother.rg.md.bam.flagstat. Create a script called B07_get_alignment_stats_after_md.sh (in ~/project/scripts/B-mother_only). How many reads were marked as duplicate?
Answer
#!/usr/bin/env bash
cd ~/project/results/alignments
samtools flagstat mother.rg.md.bam > mother.rg.md.bam.flagstat
Gives:
133477 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
317 + 0 supplementary
17329 + 0 duplicates
132892 + 0 mapped (99.56% : N/A)
133160 + 0 paired in sequencing
66580 + 0 read1
66580 + 0 read2
131470 + 0 properly paired (98.73% : N/A)
131990 + 0 with itself and mate mapped
585 + 0 singletons (0.44% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
3. Indexing¶
To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of 'phonebook' of your sequence alignment file. Indexing can be done with samtools as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this:
Exercise: Create a script called B08_index_alignment.sh (in ~/project/scripts/B-mother_only) to perform the alignment.
Answer
4. Recap - mother only¶
Now we have performed now the following steps on the sample mother:
- Read alignment
- Adding readgroups
- Marking of duplicates
- Indexing
Your scripts directory should look like this:
├── A-prepare_references
│ ├── A01_download_course_data.sh
│ └── A02_create_bwa_index.sh
├── B-mother_only
│ ├── B01_alignment.sh
│ ├── B02_get_alignment_statistics.sh
│ ├── B03_sort_alignment.sh
│ ├── B04_compress_alignment.sh
│ ├── B05_add_readgroups.sh
│ ├── B06_mark_duplicates.sh
│ ├── B07_get_alignment_stats_after_md.sh
│ └── B08_index_alignment.sh
└── C-all_samples
5. Apply it on all three samples with pipes and loops¶
We now apply these steps to all three samples. In order to do that, we combine the alignment with the sorting and compression in one command. We can do that with piping the output of bwa to samtools sort and samtools view, like this:
SAMPLE="mother"
bwa mem data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
data/fastq/"$SAMPLE"_R1.fastq.gz \
data/fastq/"$SAMPLE"_R2.fastq.gz \
| samtools sort \
| samtools view -bh > results/alignments/"$SAMPLE".bam
Exercise: Make a directory in the scripts directory C-all_samples (so ~/project/scripts/C-all_samples). In here, create a script called C01_alignment_sorting_compression.sh. Within that script use the above snippet to make a loop that performs the alignment, sorting and compression for all three samples (i.e. mother, father and son).
Answer
Your scripts directory should look like:
scripts
├── A-prepare_references
│ ├── A01_download_course_data.sh
│ └── A02_create_bwa_index.sh
├── B-mother_only
│ ├── B01_alignment.sh
│ ├── B02_get_alignment_statistics.sh
│ ├── B03_sort_alignment.sh
│ ├── B04_compress_alignment.sh
│ ├── B05_add_readgroups.sh
│ ├── B06_mark_duplicates.sh
│ ├── B07_get_alignment_stats_after_md.sh
│ └── B08_index_alignment.sh
└── C-all_samples
└── C01_alignment_sorting_compression.sh
And the script:
#!/usr/bin/env bash
cd ~/project
for SAMPLE in mother father son
do
bwa mem data/reference/Homo_sapiens.GRCh38.dna.chromosome.20.fa \
data/fastq/"$SAMPLE"_R1.fastq.gz \
data/fastq/"$SAMPLE"_R2.fastq.gz \
| samtools sort \
| samtools view -bh > results/alignments/"$SAMPLE".bam
done
Now we continue with adding the readgroups. For each sample, we have to add specific information to the different readgroup fields. We can do that by looping over a tab delimited file with sample-specific information in each row. Let's create that tab-delimited file.
Exercise Generate a tab-delimited file called sample_rg_fields.txt and store it in ~/project/results/. In this file, each line should represent a sample (mother, father and son), and you specify the SM, LB, PU and ID fields. E.g., the first line (for 'mother') would look like:
Warning
Make sure to add a newline (Enter) at the end of the file. Otherwise a loop will stop at the second-last line.
Answer
Your file should look like this:
Exercise Generate a script called C02_add_readgroups.sh (in ~/project/scripts/C-all_samples) to loop over the tab-delimited file (have a look at the last exercise in Setup), and add the correct readgroups to the bam file of each sample with gatk AddOrReplaceReadGroups.
Hint
Try to just print the variables from a loop in order to check to see whether the loop performs according to your expectation. E.g.:
Answer
#!/usr/bin/env bash
cd ~/project/results
cat sample_rg_fields.txt | while read SAMPLE LB PU ID
do
gatk AddOrReplaceReadGroups \
--INPUT alignments/"$SAMPLE".bam \
--OUTPUT alignments/"$SAMPLE".rg.bam \
--RGLB "$LB" \
--RGPU "$PU" \
--RGPL ILLUMINA \
--RGSM "$SAMPLE" \
--RGID "$ID"
done
As final step, we will mark the duplicates and perform the indexing for the three samples.
Exercise: Generate two scripts called C03_mark_duplicates.sh and C04_index_alignments.sh, in which you loop over the sample names and perform the respective calculations. You can use B06_mark_duplicates.sh and B08_index_alignment.sh as a template.
Answer
6. Recap - all samples¶
Now, we have performed on all three samples:
- Alignment
- Sorting
- Compression
- Adding readgroups
- Marking of duplicates
- Indexing
Your scripts directory should look like this:
scripts
├── A-prepare_references
│ ├── A01_download_course_data.sh
│ └── A02_create_bwa_index.sh
├── B-mother_only
│ ├── B01_alignment.sh
│ ├── B02_get_alignment_statistics.sh
│ ├── B03_sort_alignment.sh
│ ├── B04_compress_alignment.sh
│ ├── B05_add_readgroups.sh
│ ├── B06_mark_duplicates.sh
│ ├── B07_get_alignment_stats_after_md.sh
│ └── B08_index_alignment.sh
└── C-all_samples
├── C01_alignment_sorting_compression.sh
├── C02_add_readgroups.sh
├── C03_mark_duplicates.sh
└── C04_index_alignment.sh