Read counting refers to the quantification of an “expression level”, or abundance, from reads mapped onto a reference genome/transcriptome. This expression level can take several forms, such as a count, or a fraction (RPKM/TPM), and concern different entities (exon, transcript, genes) depending on your biological application.
During this lesson, you will learn to:
- differentiate between different levels of counting and their relevance for different questions.
- perform read counting at the gene level for Differential Gene expression.
Read counting with featureCounts
The featureCount website provides several useful command-line examples to get started. For more details on the algorithm behavior (with multi/overlapping reads for instance), you can refer to the package’s User’s guide (go to the read summarization chapter).
- Decide which parameters are appropriate for counting reads from the Ruhland dataset. Assume you are interested in determining which genes are differentially expressed.
- Count the reads from one of your BAM files using featureCount.
How do the featureCount-derived counts compare to those from STAR ?
You can find bam files at
- featureCount requirements : 400M RAM / BAM file
- featureCount requirements : 2 min CPU time / BAM file
#!/usr/bin/bash #SBATCH --job-name=featurecount #SBATCH --time=00:30:00 #SBATCH --cpus-per-task=8 #SBATCH --mem=4G #SBATCH -o count.o #SBATCH -e count.e G_GTF=/shared/data/DATA/Mus_musculus.GRCm39.105.gtf inFOLDER=/shared/data/Solutions/Ruhland2016/STAR_Ruhland2016 outFOLDER=featureCOUNT_Ruhland2016 ml subread mkdir -p $outFOLDER featureCounts -T 8 -a $G_GTF -t exon -g gene_id -o featureCounts_Ruhland2016.counts.txt \ $inFOLDER/SRR3180535_EtOH1_1.fastq.gzAligned.sortedByCoord.out.bam \ $inFOLDER/SRR3180536_EtOH2_1.fastq.gzAligned.sortedByCoord.out.bam \ $inFOLDER/SRR3180537_EtOH3_1.fastq.gzAligned.sortedByCoord.out.bam \ $inFOLDER/SRR3180538_TAM1_1.fastq.gzAligned.sortedByCoord.out.bam \ $inFOLDER/SRR3180539_TAM2_1.fastq.gzAligned.sortedByCoord.out.bam \ $inFOLDER/SRR3180540_TAM3_1.fastq.gzAligned.sortedByCoord.out.bam
comparison with STAR counts
You can use this little R script to check they are the same :
fc = read.table( "featureCounts_SRR3180535_EtOH1_1.counts.txt" , header =T) rownames( fc ) = fc$Geneid head( fc ) star = read.table( "SRR3180535_EtOH1_1.fastq.gzReadsPerGene.out.tab") rownames( star ) = star$V1 head( star ) star_count = star[ rownames( fc ) , 'V2' ] fC_count = fc$STAR_Ruhland2016.SRR3180535_EtOH1_1.fastq.gzAligned.sortedByCoord.out.bam plot(log10( star_count + 1), log10(fC_count+1) ) quantile( star_count - fC_count)