Skip to content

Read counting

Read counting refers to the quantification of an “expression level”, or abundance, from reads mapped onto a reference genome/transcriptome. This expression level can take several forms, such as a count, or a fraction (RPKM/TPM), and concern different entities (exon, transcript, genes) depending on your biological application.

During this lesson, you will learn to:

  • differentiate between different levels of counting and their relevance for different questions.
  • perform read counting at the gene level for Differential Gene expression.

Material

Download the presentation

featureCounts website

Read counting with featureCounts

The featureCount website provides several useful command-line examples to get started. For more details on the algorithm behavior (with multi/overlapping reads for instance), you can refer to the package’s User’s guide (go to the read summarization chapter).

Task :

  • Decide which parameters are appropriate for counting reads from the Ruhland dataset. Assume you are interested in determining which genes are differentially expressed.
  • Count the reads from one of your BAM files using featureCount.
  • How do the featureCount-derived counts compare to those from STAR ?

  • You can find bam files at /shared/data/Solutions/Liu2015/STAR_Liu2015 and /shared/data/Solutions/Ruhland2016/STAR_Ruhland2016

  • featureCount requirements : 400M RAM / BAM file
  • featureCount requirements : 2 min CPU time / BAM file
featureCounts script
#!/usr/bin/bash
#SBATCH --job-name=featurecount
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=4G
#SBATCH -o count.o
#SBATCH -e count.e


G_GTF=/shared/data/DATA/Mus_musculus.GRCm39.105.gtf

inFOLDER=/shared/data/Solutions/Ruhland2016/STAR_Ruhland2016
outFOLDER=featureCOUNT_Ruhland2016

ml subread

mkdir -p $outFOLDER

featureCounts -T 8 -a $G_GTF -t exon -g gene_id -o featureCounts_Ruhland2016.counts.txt \
                                    $inFOLDER/SRR3180535_EtOH1_1.fastq.gzAligned.sortedByCoord.out.bam \
                                    $inFOLDER/SRR3180536_EtOH2_1.fastq.gzAligned.sortedByCoord.out.bam \
                                    $inFOLDER/SRR3180537_EtOH3_1.fastq.gzAligned.sortedByCoord.out.bam \
                                    $inFOLDER/SRR3180538_TAM1_1.fastq.gzAligned.sortedByCoord.out.bam \
                                    $inFOLDER/SRR3180539_TAM2_1.fastq.gzAligned.sortedByCoord.out.bam \
                                    $inFOLDER/SRR3180540_TAM3_1.fastq.gzAligned.sortedByCoord.out.bam
comparison with STAR counts

You can use this little R script to check they are the same :

fc = read.table( "featureCounts_SRR3180535_EtOH1_1.counts.txt" , header =T)
rownames( fc ) = fc$Geneid
head( fc )

star = read.table( "SRR3180535_EtOH1_1.fastq.gzReadsPerGene.out.tab")
rownames( star ) = star$V1
head( star )


star_count = star[ rownames( fc ) , 'V2' ]
fC_count = fc$STAR_Ruhland2016.SRR3180535_EtOH1_1.fastq.gzAligned.sortedByCoord.out.bam
plot(log10( star_count + 1),
    log10(fC_count+1) )

quantile( star_count  - fC_count)