Skip to content

Differential gene expression

Material

Download the presentation

Exercises

Find all markers for each cluster

This part uses the gbm dataset

Load the gbm dataset you have created yesterday:

gbm <- readRDS("gbm_day2_part2.rds")

And load the following packages:

library(Seurat)
library(edgeR)
library(limma)

The function FindAllMarkers performs a Wilcoxon plot to determine the genes differentially expressed between each cluster and the rest of the cells. Other types of tests than the Wilcoxon test are available. Check it out by running ?Seurat::FindAllMarkers.

Now run analysis:

de_genes <- Seurat::FindAllMarkers(gbm,  min.pct = 0.25)

Time for coffee

This takes a while. Have a break.

We are usually only interested in significant marker genes, so let’s filter based on an adjusted p-value smaller than 0.05:

de_genes <- subset(de_genes, de_genes$p_val_adj < 0.05)
View(de_genes)

Exercise: What are significant marker genes in cluster 6? Are the immune genes in there?

Hint

You can re-load the vector with immune genes with:

immune_genes <- c("GZMA", "CD3E", "CD3D")
Answer

Running

de_genes[de_genes$gene %in% immune_genes,]

Returns:

              p_val avg_log2FC pct.1 pct.2     p_val_adj cluster gene
GZMA 4.015094e-221   2.441962 0.331 0.005 9.781973e-217       6 GZMA
CD3E  0.000000e+00   2.113342 0.474 0.006  0.000000e+00       6 CD3E
CD3D  0.000000e+00   2.507228 0.519 0.006  0.000000e+00       6 CD3D

So, yes, the immune genes are highly significant markers for cluster 6.

Differential expression between clusters

This part uses the gbm dataset

The FindMarkers function allows to test for differential gene expression analysis specifically between 2 clusters, i.e. perform pairwise comparisons, eg between cells of cluster 0 vs cluster 2, or between cells annotated as astrocytes and macrophages.

First we can set the default cell identity to the cell types defined by SingleR:

gbm <- Seurat::SetIdent(gbm, value = "SingleR_annot")

Run the differential gene expression analysis (runs for a couple of minutes):

DEG_astro_vs_macro <- Seurat::FindMarkers(gbm,
                                           ident.1 = "Astrocyte",
                                           ident.2 = "Macrophage",
                                           group.by = gbm$SingleR_annot,
                                           test.use = "wilcox")

Exercise: What is the top 10 differentially expressed genes? What does the sign (i.e. positive or negative) mean in the log fold change values? Check your answer by generating a violin plot of a top differentially expressed gene.

Answer

You can look at the top 10 differentially expressed genes with:

top_order <- order(DEG_astro_vs_macro$p_val_adj)
DEG_astro_vs_macro[top_order[1:10],]

Returning:

          p_val avg_log2FC pct.1 pct.2 p_val_adj
SLC2A5       0  -1.481971 0.080 0.656         0
TNFRSF1B     0  -1.529484 0.041 0.694         0
CAMK2N1      0   2.622397 0.907 0.106         0
C1QA         0  -4.843247 0.088 0.964         0
C1QC         0  -4.374756 0.075 0.943         0
C1QB         0  -4.940663 0.099 0.946         0
LAPTM5       0  -3.685947 0.075 0.987         0
MARCKSL1     0   2.690161 0.911 0.221         0
CNN3         0   2.940659 0.818 0.110         0
CD53         0  -1.861605 0.035 0.804         0

For an explanation of the log fold change have a look at ?Seurat::FindMarkers. At Value it says:

avg_logFC: log fold-chage of the average expression between the two groups. Positive values indicate that the gene is more highly expressed in the first group

Plotting the top gene SLC2A5:

Seurat::VlnPlot(gbm, features = "SLC2A5")

Returning:

Differential expression analysis including batch as covariates

This part uses the pancreas dataset

The Wilcoxon test implemented in FindMarkers does not allow to test for complex design (eg factorial experiments) or to include batch as a covariate.

We can use edgeR or limma which are designed for microarray or bulk RNA seq data and provide a design matrix that includes covariates for example.

We will go back to the pancreas cells sequenced with different technologies, analyze differentially expressed genes between 2 clusters of cells using the technologies as covariates. Similar approaches can be used to analyze differentially expressed genes between conditions, eg sick vs healthy, wild type versus knockout, etc, and including batches in the model if they are present.

We will load the pancreas.integrated object we have created yesterday:

pancreas.integrated <- readRDS("pancreas.integrated.rds")

Since we will start wit differential gene expression, we set the default assay back to “RNA”. Also, we set the default identity to the cell type:

Seurat::DefaultAssay(pancreas.integrated) <- "RNA"
Seurat::Idents(pancreas.integrated) <- pancreas.integrated$celltype

Let’s have a look at the UMAP (again), coloured by celltype:

Seurat::DimPlot(pancreas.integrated)

Let’s say we are specifically interested to test for differential gene expression between two cell types.

Note

Here we could also test for e.g. healthy versus diseased within a celltype/cluster.

Now we will run differential expression analysis between cell type delta and gamma using the technology as a covariate by using limma.

First, we will subset the pancreas.integrated object, only leaving the delta and gamma cells:

pancreas.dg <- subset(pancreas.integrated, idents = c("delta", "gamma"))

Get the count matrix and keep only genes that are expressed in at least one cell:

counts <- Seurat::GetAssayData(pancreas.dg, slot = "counts")
counts <- counts[rowSums(counts) != 0,]

Generate a DGEList object to use as input for limma:

dge <- edgeR::DGEList(counts = counts)
dge <- edgeR::calcNormFactors(dge)  

Generate a design matrix:

design <- model.matrix(~ 0 + celltype + tech, data = pancreas.dg@meta.data)
colnames(design)<-c("delta", "gamma", "celseq2", "fluidigmc1", "smartseq2")

Specify which contrasts to check:

contrast.mat <- limma::makeContrasts(delta - gamma,
                                     levels = design)

Now limma can perform the transformation with voom, fit the model, compute the contrasts and compute test statistics with eBayes:

vm <- limma::voom(dge, design = design, plot = TRUE)
fit <- limma::lmFit(vm, design = design)
fit.contrasts <- limma::contrasts.fit(fit, contrast.mat)
fit.contrasts <- limma::eBayes(fit.contrasts)

We can use topTable to get the most significantly differentially expressed genes:

limma::topTable(fit.contrasts, number = 10, sort.by = "P")

And we can check whether this corresponds to the counts by generating a violin plot:

Seurat::VlnPlot(pancreas.dg, "PPY", split.by = "tech")
Seurat::VlnPlot(pancreas.dg, "RBP4", split.by = "tech")