Quantify Peak Accessibility
Overview
After building a consensus peak set, we need to quantify the number of reads overlapping each peak in each sample. This creates a count matrix for downstream differential accessibility analysis.
Learning Objectives
- Convert peak coordinates to appropriate format for read counting
- Use featureCounts to quantify reads in peaks across samples
- Generate comprehensive quality control reports
1. Count reads on peaks
After defining a consensus peak set, we will count the number of reads that overlap each peak in each sample. This step can be seen as counting reads in annotated genes for RNA-seq data, where instead of genes, we are counting reads in peaks.
There are different tools that can be used for this purpose, such as featureCounts from the Subread package or bedtools coverage.
In this example, we will use featureCounts, which is a widely used tool for counting reads in genomic features.
1.1 Prepare Peak Annotation for Quantification
FeatureCounts requires annotation in GTF or SAF (Simplified Annotation Format).
The SAF format is a tab-delimited file with the following columns:
GeneID, Chr, Start, End, Strand
However, your consensus peak is in BED format with only 3 columns:
# Check file format
head results/04_consensus_peaks/consensus_peaks.bed
Task1: Compare BED to SAF format
- Which are the 2 columns you are missing to convert your BED file into SAF format?
- Can you think of a way to provide values to those fields?
Answer
GeneID and Strand fields are missing.
Strand is missing since the dataset has no strand information.
GeneID would correspond to Interval_ID, but this information has been lost while creating the consensus set, given that peaks have been merged.
Strand: Information is not mandatory, you can provide a . instead.
GeneID: You can generate sequential Interval_ID with incremental numbers (e.g., “Interval_1”, “Interval_2”).
Task 2: Convert Consensus Peaks to SAF Format
- Create a new folder named:
results/05_counts_reads - Use awk or an alternative way to convert the bed file to SAF format. Don’t forget to add a header “GeneID Chr Start End Strand”.
Hint
Use awk to:
- Add sequential peak IDs (e.g., “Interval_1”, “Interval_2”).
- Convert BED coordinates to SAF format.
- Add strand information (use “.” for unstranded).
Solution
# Create output directory
mkdir results/05_counts_reads
# Create SAF file with header
echo "GeneID Chr Start End Strand" > results/04_consensus_peaks/consensus_peaks.saf
# Convert BED to SAF format
awk '{OFS = "\t"} {print "Interval_"NR,$1,$2,$3,"."}' results/04_consensus_peaks/consensus_peaks.bed >> results/04_consensus_peaks/consensus_peaks.saf
# check the new file format
head results/04_consensus_peaks/consensus_peaks.saf
-
echo -e: Creates header with tab separators.-
awk OFS="\t": Sets output field separator to tab.-
"Interval_"NR: Creates unique ID using row number.-
$1,$2,$3: BED coordinates (chr, start, end).-
".": Unstranded (appropriate for ATAC-seq peaks).
Now we can use featureCounts to count the reads in each peak for each sample
Task 3: Count reads in peaks
- Run featureCounts to quantify how many filtered reads (in
*qc_bl_filt.sorted.bam) overlap consensus peaks in each sample. - Use paired-end counting mode
- Count all filtered BAM files simultaneously
Featurecounts
-F SAF: specify that the annotation file is in SAF format
-p: specify that the input files are paired-end
-a: specify the annotation file
-o: specify the output file
These are only some of all parameteres one can apply. You can also adjust the parameters of featureCounts based on your specific requirements, such as setting a minimum mapping quality or handling multi-mapping reads.
Solution
path_bams="results/01_filtered_bams"
featureCounts -F SAF -T 2 -p -a results/04_consensus_peaks/consensus_peaks.saf -o results/05_counts_reads/feature_Counts.txt $path_bams/*qc_bl_filt.sorted.bam 2> results/05_counts_reads/featureCounts.log
The output file will contain the counts of reads in each peak for each sample, which can be used for downstream analysis such as differential accessibility analysis.
Task 4: Examine the results
Have a look to the output, which file contains the counts matrix?:
ls results/05_counts_reads/
Answer
FeatureCounts generates:
- feature_Counts.txt: Main count matrix.
- feature_Counts.txt.summary: Counting statistics.
- featureCounts.log: Processing log and any warnings.
File results/05_counts_reads/feature_Counts.txt contains the counts table with:
- Rows: Annotated peaks.
- Columns: peaks annotation information (from SAF file) + Samples (counts in each sample)
- Check how many peaks contains the counts matrix, is it the same number you had in your consensus peak set?
# Count number of peaks quantified
wc -l results/05_counts_reads/feature_Counts.txt
Answer
Yes, it is the same number of peaks, although not same number of rows because feature_Counts.txt has 2 lines header.
2. Quality Control Assessment
After filtering, peak calling and counting reads in peaks, we can run MultiQC to aggregate the QC metrics from different steps and generate a comprehensive report.
This will help us to assess further the overall quality of the ATAC-seq data and identify any potential issues that may need to be addressed before proceeding with downstream analysis.
Task 3: Generate MultiQC Report
- Run multiqc on the entire
resultsdirectory
Solution
multiqc --outdir results/multiQC_report --title "ATAC-seq_Pipeline_Summary" results/
-
--outdir: Output directory for the report.-
--title: Custom title for the report.-
results/: Search this directory for QC files.
Download and open the report and explore the QC metrics:
- How many peaks were successfully quantified?
- What percentage of reads were assigned to peaks in each sample?
- Are there differences in counting efficiency between samples?
- Do any samples show concerning quality metrics?