Skip to content

Peak Calling

Overview

After quality control, we will identify accessible chromatin regions (peaks) using different approaches and fragment size filters.

We will compare peak calling results using:
1. Only nucleosome-free fragments
2. All fragments combined
3. HMMRATAC (ATAC-seq specific peak caller)

Learning Objectives

  • Understand nucleosome-free vs. nucleosome-associated fragments
  • Use MACS3 for peak calling with different fragment size filters
  • Compare peak calling results from different approaches
  • Learn about ATAC-seq specific peak caller HMMRATAC

1. Fragment size filtering

Before calling peaks with MACS3, we will separate fragments based on their size:

  • Nucleosome-free (NF) fragments (< 100 bp): Represent open chromatin regions
  • Nucleosome-associated fragments (> 100 bp): Represent regions with positioned nucleosomes

Task 1: Filter for nucleosome-free (NF) fragments

We can do that using samtools view as we have seen in previous exercises, and filter reads based on fragment length.

  • Which column from the .bam file contains fragment length information?
Answer

In SAM format, column 9 is described as “TLEN: signed observed Template LENgth”, which corresponds to the insert size length.

With the following code, you will:
- Create output directory: results/01_filtered_bams/NF_bams.
- Use samtool view to filter BAM files to keep only fragments with length 1-100 bp.
- Save filtered files as: results/01_filtered_bams/NF_bams/${sample_name}_NF.bam.
- Maintain BAM file headers during filtering.

Code

# create new directory for peaks
mkdir -p results/01_filtered_bams/NF_bams

# filter for NF fragments only

for sample_name in "${samples[@]}"; do
    echo "Processing sample: $sample_name"

    # run command
    samtools view -h $path_bams/$sample_name.qc_bl_filt.sorted.bam | awk 'substr($0,1,1)=="@" || ($9>= 1 && $9<=100) || ($9<=-1 && $9>=-100)' | \
    samtools view -b > $path_bams/NF_bams/${sample_name}_NF.bam 
done 

Parameter explanations:
- -h: Keep header in output
- -b: Output in BAM format
- $9: Insert size (TLEN field in SAM format)
- Filter logic: Keep fragments with insert size between 1 and 100bp (nucleosome-free regions)

Task 2: sort and index BAM files

  • Sort and index the bams files for next step
  • Remove the unsorted file
for sample_name in "${samples[@]}"; do
    echo "Processing sample: $sample_name"

    # run command
    samtools sort -o $path_bams/NF_bams/${sample_name}_NF.sorted.bam $path_bams/NF_bams/${sample_name}_NF.bam
    samtools index $path_bams/NF_bams/${sample_name}_NF.sorted.bam
done

Warning

Important before deleting:
Run this command only if you have named your files exactly as specified in this tutorial and have completed all sorting and indexing steps from the previous command.

Remove intermediate files

rm results/01_filtered_bams/NF_bams/*_NF.bam

2. Peak calling strategies

2.1 MACS3 Peak Calling

MACS3 is a widely used peak calling tool that can handle both narrow and broad peaks. We’ll start using MACS3 function “callpeak” and compare results using different fragment size filters.

Peak Calling with Nucleosome-Free Fragments

We will first call peaks using MACS3 on NF reads only, focusing on fragments most likely to represent open chromatin regions.

Task 3: MACS3 on NF fragments

  • Create a new folder named: results/03_peak_calling
  • Create a subfolder inside called: NF_peaks
  • Call peaks on NF reads using MACS3 callpeak function. Save the results inside results/03_peak_calling/NF_peaks/ and name the files as: NF_peaks_${sample_name}

Macs3

You can have a look at the MACS3 documentation for more details on the parameters used here

Hint

For ATAC-seq data, we will use the BAMPE format, which is suitable for paired-end data and we will set the genome size to “mm” for mouse. We will also set a q-value cutoff of 0.01 to control the false discovery rate.

Solution

mkdir -p results/03_peak_calling/NF_peaks
path_peaks="results/03_peak_calling/"

for sample_name in "${samples[@]}"; do
    echo "Processing sample: $sample_name"

    # run command
    macs3 callpeak -f BAMPE -t $path_bams/NF_bams/${sample_name}_NF.sorted.bam -g mm -q 0.01 --name ${sample_name}_NF --outdir results/03_peak_calling/NF_peaks/NF_peaks_${sample_name}/
done
parameters explanation:
-f BAMPE: input file format is BAM paired-end
-t: input file (BAM file)
-g mm: It’s the mappable genome size or effective genome size (some are pre-computed, like mouse, and you can specify “mm” -q 0.01: q-value cutoff for peak detection
--outdir: output directory for peak files

Peak Calling with all Fragments

We will do the same, but using all fragments

Task 4: MACS3 on all fragments

  • Create a new folder named: results/03_peak_calling/all_peaks
  • Call peaks on all filtered reads using MACS3 callpeak function. Save the results inside results/03_peak_calling/all_peaks/ and name the files as: all_peaks_${sample_name}
Solution
mkdir -p results/03_peak_calling/all_peaks

for sample_name in "${samples[@]}"; do
    echo "Processing sample: $sample_name"

    # run command
    macs3 callpeak -f BAMPE -t $path_bams/${sample_name}.qc_bl_filt.sorted.bam -g mm -q 0.01 --name ${sample_name}_all --outdir $path_peaks/all_peaks/all_peaks_${sample_name}/ 2> $path_peaks/all_peaks/${sample_name}_macs3.log
done

2.2 HMMRATAC Peak Calling

HMMRATAC is specifically designed for ATAC-seq data and uses a Hidden Markov Model to identify accessible chromatin regions. It models the distribution of fragment lengths to distinguish between nucleosome-free regions and nucleosome-bound regions, to provide a more accurate identification of open chromatin regions.

We will use the filtered BAM files (all fragments) for this analysis.

Bonus Task

  • Create a new directory named: results/03_peak_calling/hmmratac_peaks.
  • Have a look at the HMMRATAC documentation for more details on the parameters used.
  • This step can take a bit longer, while waiting you can move to the next section of the tutorial.
Solution

mkdir -p results/03_peak_calling/hmmratac_peaks

for sample_name in "${samples[@]}"; do
    echo "Processing sample: $sample_name"
    macs3 hmmratac -i $path_bams/${sample_name}.qc_bl_filt.sorted.bam -f BAMPE --name ${sample_name}_hmmratac --outdir $path_peaks/hmmratac_peaks/hmmratac_peaks_${sample_name}/
done
Parameters explanation:
-i: input file (BAM file)
-f BAMPE: input file format is BAM paired-end
--outdir: output directory for peak files

3. Compare Peak Calling Results

Now let’s visualise and compare the different peak calling results using the Integrative Genomics Viewer (IGV) to understand how each method performs.

Task 5: Load traks into IGV

For Cerebrum_rep1 sample, download and load the following files in IGV:

Peak Files:

  • MACS3 NF peaks: results/03_peak_calling/NF_peaks/NF_peaks_Cerebrum_rep1/Cerebrum_rep1_NF_peaks.narrowPeak
  • MACS3 all peaks: results/03_peak_calling/all_peaks/all_peaks_Cerebrum_rep1/Cerebrum_rep1_all_peaks.narrowPeak
  • HMMRATAC peaks: results/03_peak_calling/hmmratac_peaks/hmmratac_peaks_Cerebrum_rep1/Cerebrum_rep1_hmmratac_accessible_regions.narrowPeak (if you completed the bonus task, if you didn’t and you are curious, you will find the files in /data/Solutions/)

BAM Files for Read Coverage:

  • All fragments BAM: results/01_filtered_bams/Cerebrum_rep1.qc_bl_filt.sorted.bam
  • Nucleosome-free fragments BAM: results/01_filtered_bams/NF_bams/Cerebrum_rep1_NF.sorted.bam

Task 6: Compare results

  • Do you observe big differences between methods?
  • Which method would work better to study TF binding sites? and for DA analysis?
  • Can you find an example of a gene where the nucleosomal pattern at TSS can be percieved?