Skip to content

Cell annotation


Download the presentation


This chapter uses the gbm dataset

Load the gbm dataset you have created earlier today:

gbm <- readRDS("gbm_day2_part1.rds")

And load the following packages:


In the last exercise we saw that probably clustering at a resolution of 0.2 gave the most sensible results. Let’s therefore set the default identity of each cell based on this clustering:

gbm <- Seurat::SetIdent(gbm, value = gbm$RNA_snn_res.0.2)


From now on, grouping (e.g. for plotting) is done by the active identity (set at @active.ident) by default.

Based on the UMAP we have generated, we can visualize expression for a gene in each cluster:

Seurat::FeaturePlot(gbm, "PMP2")

Based on expression of sets of genes you can do a manual cell type annotation. If you know the marker genes for some cell types, you can check whether they are up-regulated in one or the other cluster. Here we have some marker genes for two different cell types:

immune_genes<-c("GZMA", "CD3E", "CD3D")
microglia_genes<-c("CCL4", "CCL3", "P2RY12", "C1QB", "CSF1ER", "CY3CR1")

Let’s have a look at the expression of the three immune genes:

Seurat::FeaturePlot(gbm, immune_genes, ncol=2)

These cells are almost all in cluster 6. Which becomes clearer when looking at the violin plot:

                features = immune_genes,
                ncol = 2)

Exercise: Have a look at the microglia genes as well. Which cluster contains probably microglial cells?



Seurat::FeaturePlot(gbm, microglia_genes, ncol=2)


Corresponding mainly to cluster 1 (and cluster 4):

                features = microglia_genes,
                ncol = 2)

We can also automate this with the function AddModuleScore. For each cell, an expression score for a group of genes is calcuated:

gbm <- Seurat::AddModuleScore(gbm,
                              features = list(immune_genes),
                              name = "immune_genes")

Exercise: After running AddModuleScore, a column was added to

A. What is the name of that column? What kind of data is in there?

B. Generate a UMAP with color accoding to this column and a violinplot grouped by cluster. Is this according to what we saw in the previous exercise?


A. The new column is called immune_genes1. It contains the module score for each cell (which is basically the average expression of the set of genes).

B. You can plot the UMAP with

Seurat::FeaturePlot(gbm, "immune_genes1")



Which indeed shows these genes are mainly expressed in cluster 6:

Cell type annotation using SingleR

To do a fully automated annoation, we need a reference dataset of primary cells. Any reference could be used. The package scRNAseq in Bioconductor includes several scRNAseq datasets that can be used as reference to SingleR. One could also use a reference made of bulk RNA seq data. Here we are using the Human Primary Cell Atlas dataset from celldex. Check out what’s in there: <- celldex::HumanPrimaryCellAtlasData()

Now SingleR compares our normalized count data to a reference set, and finds the most probable annation:

gbm_SingleR <- SingleR::SingleR(test = Seurat::GetAssayData(gbm, slot = "data"),
                                ref =,
                                labels =$label.main)

See what’s in there by using head:

Visualize singleR score quality scores:


In order to visualize it in our UMAP, we have to add the annotation to

gbm$SingleR_annot <- gbm_SingleR$labels

Seurat::DimPlot(gbm, = "SingleR_annot", label = T, repel = T)

Exercise: Compare our manual annotation (based on the set of immune genes) to the annotation with SingleR. Do they correspond?


We can have a look at the mean module score for each SingleR annotation like this:

mean_scores <- tapply(gbm$immune_genes1, gbm$SingleR_annot, mean)
mean_scores[order(mean_scores, decreasing = TRUE)[1:6]]


    T_cells             NK_cell                  DC            Monocyte          Macrophage        Chondrocytes
1.015605822         0.679592135         0.017018871        -0.005414908        -0.011926460        -0.011975024

Showing that T-cells and NK-cells have a high module score based on our set of immune genes, which makes sense.

Of course, it was also already clear from the UMAP plots that cluster 6 (the cluster with the high module score for the immune genes) contained the T-cells and NK-cells.

Save the dataset and clear environment

Now, save the dataset so you can use it tomorrow:

saveRDS(gbm, "gbm_day2_part2.rds")

Clear your environment:

rm(list = ls())