Skip to content

Enrichment analysis



This chapter uses the gbm dataset

Load the following packages:


If the FindMarkers or FindAllMarkers functions were used, we have a table containing only the significant genes, but we don’t have any information for the non-significant genes. Therefore, we can use the over-representation analysis which is a threshold-based method. Using our list of significant genes, we can test if any gene set is over-represented in our data or not using a test similar to a Fisher test to compare differences in proportions.

The clusterProfiler package provides functions for over-representation analysis of Gene Ontology gene sets (among other functions) or KEGG gene sets.

Genes can be labeled using different types of labels, eg symbol, ensembl ID, Entrez ID. To list the allowed label types use:

BiocManager::install("", update = FALSE)

Let’s select a set of genes that are upregulated in the astrocytes compared to the macrophages:

AC_up_DEG <- subset(DEG_astro_vs_macro,
                    DEG_astro_vs_macro$avg_log2FC > 0 &
                    DEG_astro_vs_macro$p_val_adj < 0.05)
AC_up_genes <- rownames(AC_up_DEG)

We can do a gene ontology term enrichment analysis based on this set of genes:

AC_MAC_GO <- clusterProfiler::enrichGO(AC_up_genes, # vector of up regulated genes
                    "", # orgdb= package that contains gene label types correspondances
                    keyType = "SYMBOL", # indicate that genes are labeled using symbols
                    ont = "BP", # which of the GO categories to test, here the "Biological Processes"
                    minGSSize = 50) # exclude gene sets that contain less than 50 genes

The results are stored in the @result slot:


We can quite easily generate an enrichment map with the enrichplot package:

                     showCategory = 30, cex_label_category = 0.5)

In stead of testing for gene ontology terms, we can also test for other gene set collections. For example the hallmark collection from MSigDB:

gmt <- clusterProfiler::read.gmt("data/gbm_dataset/h.all.v7.2.symbols.xls")

We can use the function enricher to test for enrichment of any set of genes. But we would have to test it against a “universe”, i.e. the background genes:

AC_MAC_enrich <- clusterProfiler::enricher(gene = AC_up_genes,
                                           universe = rownames(gbm),
                                           pAdjustMethod = "BH",
                                           pvalueCutoff  = 0.05,
                                           qvalueCutoff  = 0.05,
                                           TERM2GENE = gmt)

The most signifcantly enriched group of genes is HALLMARK_MYC_TARGETS_V1:


You can get a vector of gene symbols that are in the set of MYC targets like this:

myc_target_genes <- gmt$gene[gmt$term=="HALLMARK_MYC_TARGETS_V1"]

Exercise: Calculate a module score for each cell for the MYC target genes by using the function AddModuleScore, and generate a violin plot of the gbm dataset to view differences of this score between SingleR annotations.


Generating module scores:

gbm <- Seurat::AddModuleScore(gbm,
                              features = list(myc_target_genes=myc_target_genes),
                              name = "myc_target_genes")

Generate violing plot. Note that the scores generated by AddModuleScore are stored in gbm$my_target_genes1:

Seurat::VlnPlot(gbm, "myc_target_genes1", = "SingleR_annot")


Showing that the score is on average higher in the cells annotated as astrocyte compared to cells annotated as macrophage.

Save the dataset and clear environment

Now, save the dataset so you can use it later today:

saveRDS(gbm, "gbm_day3.rds")

Clear your environment:

rm(list = ls())