Enrichment analysis

Material

MSigDB
clusterProfiler vignette
Revigo
Signaling Pathway Impact Analysis (SPIA)
Original paper on GSEA
STRING for protein-protein interactions
GO figure! for plotting GO terms and the associated paper

Exercises

tum_vs_norm <- readRDS("day3/tum_vs_norm_day3-2.rds")
limma_de <- readRDS("day3/limma_de_day3-2.rds")
proB <- readRDS("day3/proB_day3-2.rds")

Load the following packages:

library(clusterProfiler)
library(ggiraph)
library(enrichplot)

If the FindMarkers or FindAllMarkers functions were used, we obtained a table listing only the significant genes, but we don’t have any information of fold change for the non-significant genes. Therefore, we can use the over-representation analysis which is a threshold-based method. Using our list of significant genes, we can test if any gene set is over-represented among significant genes or not using a test similar to a Fisher test to compare differences in proportions.

The clusterProfiler package provides functions for over-representation analysis of Gene Ontology gene sets (among other functions, including functions for actual GSEA) or KEGG gene sets.

Genes can be labeled using different types of labels, eg symbol, Ensembl ID, Entrez ID. To list the allowed label types use:

# BiocManager::install("org.Hs.eg.db", update = FALSE)
library(org.Hs.eg.db)
AnnotationDbi::keytypes(org.Hs.eg.db)

About OrgDb

For other organisms, you can find available OrgDbs at bioconductor

Let’s select a set of genes that are downregulated in the tumor cells compared to normal:

tum_down  <- subset(limma_de,
                    limma_de$logFC < -1 
                      & limma_de$adj.P.Val <  0.05)
tum_down_genes <- rownames(tum_down)

We can do a Gene Ontology term over-representation analysis based on this set of genes. Make sure you check out the help of this function to understand its arguments:

?enrichGO

tum_vs_norm_go <- clusterProfiler::enrichGO(tum_down_genes,
                                            "org.Hs.eg.db",
                                            keyType = "SYMBOL",
                                            ont = "BP",
                                            minGSSize = 50)

The results are stored in the @result slot:

View(tum_vs_norm_go@result)

	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore
GO:0007059	GO:0007059	chromosome segregation	105/804	434/18860	0.2419355	5.675253	20.79271
GO:0000070	GO:0000070	mitotic sister chromatid segregation	65/804	192/18860	0.3385417	7.941413	20.39981
GO:0098813	GO:0098813	nuclear chromosome segregation	82/804	323/18860	0.2538700	5.955209	18.95484
GO:0000819	GO:0000819	sister chromatid segregation	70/804	233/18860	0.3004292	7.047381	19.59975
GO:0140014	GO:0140014	mitotic nuclear division	73/804	281/18860	0.2597865	6.093996	18.15420
GO:0000280	GO:0000280	nuclear division	90/804	453/18860	0.1986755	4.660473	16.64069

The columns GeneRatio and BgRatio

The columns GeneRatio and BgRatio that are in the enrichResult object represent the numbers that are used as input for the Fisher’s exact test.

The two numbers (M/N) in the GeneRatio column are:

M: Number of genes of interest (in our case tum_down_genes) that are in the GO set
N: Number of genes of interest with any GO annoation.

The two numbers (k/n) in the BgRatio column are:

k: Number of genes in the universe that are in the GO set
n: Number of genes in the universe with any GO annoation

A low p-value resulting from the Fisher’s exact means that M/N is signficantly greater than k/n.

Some GO terms seem redundant because they contain many of the same genes, which is a characteristic of Gene Ontology gene sets. We can simplify this list by removing redundant gene sets:

enr_go <- clusterProfiler::simplify(tum_vs_norm_go)

View(enr_go@result)

	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore
GO:0007059	GO:0007059	chromosome segregation	105/804	434/18860	0.2419355	5.675253	20.79271
GO:0000070	GO:0000070	mitotic sister chromatid segregation	65/804	192/18860	0.3385417	7.941413	20.39981
GO:0098813	GO:0098813	nuclear chromosome segregation	82/804	323/18860	0.2538700	5.955209	18.95484
GO:0000280	GO:0000280	nuclear division	90/804	453/18860	0.1986755	4.660473	16.64069
GO:1905818	GO:1905818	regulation of chromosome separation	32/804	76/18860	0.4210526	9.876931	16.36260
GO:1901987	GO:1901987	regulation of cell cycle phase transition	75/804	468/18860	0.1602564	3.759249	12.75486

We can quite easily generate a plot called an enrichment map with the enrichplot package:

enrichplot::emapplot(enrichplot::pairwise_termsim(enr_go),
                     showCategory = 30)

Instead of testing for Gene Ontology terms, we can also test for other gene set collections. For example the Hallmark collection from MSigDB:

library(msigdbr)
library(msigdbdf)
gmt <- msigdbr::msigdbr(species = "human", category = "H")

Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
ℹ Please use the `collection` argument instead.

We can use the function enricher to test for over-representation of any set of genes of the Hallmark collection. We have to include the “universe”, i.e. the full list of background, non significant genes, against which to test for differences in proportions:

tum_vs_norm_enrich <- clusterProfiler::enricher(gene = tum_down_genes,
                                                universe = rownames(proB),
                                                pAdjustMethod = "BH",
                                                pvalueCutoff  = 0.05,
                                                qvalueCutoff  = 0.05,
                                                TERM2GENE = gmt[,c("gs_name", "gene_symbol")])

When using the genes down-regulated in tumor, among the over-represented Hallmark gene sets, we have HALLMARK_G2M_CHECKPOINT, which includes genes involved in the G2/M checkpoint in the progression through the cell division cycle.

View(tum_vs_norm_enrich@result[tum_vs_norm_enrich@result$p.adjust < 0.05,])

	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore
HALLMARK_E2F_TARGETS	HALLMARK_E2F_TARGETS	HALLMARK_E2F_TARGETS	80/344	195/3857	0.4102564	4.599881	16.141910
HALLMARK_G2M_CHECKPOINT	HALLMARK_G2M_CHECKPOINT	HALLMARK_G2M_CHECKPOINT	67/344	187/3857	0.3582888	4.017209	13.234324
HALLMARK_MITOTIC_SPINDLE	HALLMARK_MITOTIC_SPINDLE	HALLMARK_MITOTIC_SPINDLE	48/344	197/3857	0.2436548	2.731909	7.807757
HALLMARK_MYC_TARGETS_V1	HALLMARK_MYC_TARGETS_V1	HALLMARK_MYC_TARGETS_V1	33/344	193/3857	0.1709845	1.917113	4.090088
HALLMARK_ESTROGEN_RESPONSE_LATE	HALLMARK_ESTROGEN_RESPONSE_LATE	HALLMARK_ESTROGEN_RESPONSE_LATE	28/344	164/3857	0.1707317	1.914280	3.743863

Clear environment

Clear your environment:

rm(list = ls())
gc()
.rs.restartR()