Enrichment analysis
Material
- MSigDB
clusterProfiler
vignette- Revigo
- Signaling Pathway Impact Analysis (SPIA)
- Original paper on GSEA
- STRING for protein-protein interactions
- GO figure! for plotting GO terms and the associated paper
Exercises
This chapter uses the
gbm
dataset
Load the following packages:
library(clusterProfiler)
library(enrichplot)
If the FindMarkers
or FindAllMarkers
functions were used,
we have a table containing only the significant genes,
but we don’t have any information for the non-significant
genes. Therefore, we can use the over-representation analysis
which is a threshold-based method.
Using our list of significant genes, we can test
if any gene set is over-represented in our data or not using a test
similar to a Fisher test to compare differences in proportions.
The clusterProfiler
package provides functions for over-representation
analysis of Gene Ontology gene sets (among other functions) or KEGG gene sets.
Genes can be labeled using different types of labels, eg symbol, ensembl ID, Entrez ID. To list the allowed label types use:
BiocManager::install("org.Hs.eg.db", update = FALSE)
library(org.Hs.eg.db)
AnnotationDbi::keytypes(org.Hs.eg.db)
Let’s select a set of genes that are upregulated in the astrocytes compared to the macrophages:
AC_up_DEG <- subset(DEG_astro_vs_macro,
DEG_astro_vs_macro$avg_log2FC > 0 &
DEG_astro_vs_macro$p_val_adj < 0.05)
AC_up_genes <- rownames(AC_up_DEG)
We can do a gene ontology term enrichment analysis based on this set of genes:
AC_MAC_GO <- clusterProfiler::enrichGO(AC_up_genes, # vector of up regulated genes
"org.Hs.eg.db", # orgdb= package that contains gene label types correspondances
keyType = "SYMBOL", # indicate that genes are labeled using symbols
ont = "BP", # which of the GO categories to test, here the "Biological Processes"
minGSSize = 50) # exclude gene sets that contain less than 50 genes
The results are stored in the @result
slot:
View(AC_MAC_GO@result)
We can quite easily generate an enrichment map with the enrichplot
package:
enrichplot::emapplot(enrichplot::pairwise_termsim(AC_MAC_GO),
showCategory = 30, cex_label_category = 0.5)
In stead of testing for gene ontology terms, we can also test for other gene set collections. For example the hallmark collection from MSigDB:
gmt <- clusterProfiler::read.gmt("data/gbm_dataset/h.all.v7.2.symbols.xls")
head(gmt)
We can use the function enricher
to test for enrichment of any set of genes. But we would have to test it against a “universe”, i.e. the background genes:
AC_MAC_enrich <- clusterProfiler::enricher(gene = AC_up_genes,
universe = rownames(gbm),
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
TERM2GENE = gmt)
The most signifcantly enriched group of genes is HALLMARK_MYC_TARGETS_V1
:
View(AC_MAC_enrich@result)
You can get a vector of gene symbols that are in the set of MYC targets like this:
myc_target_genes <- gmt$gene[gmt$term=="HALLMARK_MYC_TARGETS_V1"]
Exercise: Calculate a module score for each cell for the MYC target genes by using the function AddModuleScore
, and generate a violin plot of the gbm
dataset to view differences of this score between SingleR
annotations.
Answer
Generating module scores:
gbm <- Seurat::AddModuleScore(gbm,
features = list(myc_target_genes=myc_target_genes),
name = "myc_target_genes")
Generate violing plot. Note that the scores generated by AddModuleScore
are stored in gbm$my_target_genes1
:
Seurat::VlnPlot(gbm, "myc_target_genes1", group.by = "SingleR_annot")
Returning:
Showing that the score is on average higher in the cells annotated as astrocyte compared to cells annotated as macrophage.
Save the dataset and clear environment
Now, save the dataset so you can use it later today:
saveRDS(gbm, "gbm_day3.rds")
Clear your environment:
rm(list = ls())
gc()
.rs.restartR()