Differential gene expression
- More information on pseudobulk analysis
- Muscat for pseudobulk DGE.
- Paper on the robustness of different differential expression analysis methods
Find all markers for each cluster
This part uses the
gbm dataset you have created yesterday:
gbm <- readRDS("gbm_day2_part2.rds")
And load the following packages:
library(Seurat) library(edgeR) library(limma)
FindAllMarkers performs a Wilcoxon plot to determine the genes differentially expressed between each cluster and the rest of the cells. Other types of tests than the Wilcoxon test are available. Check it out by running
Now run analysis:
de_genes <- Seurat::FindAllMarkers(gbm, min.pct = 0.25)
Time for coffee
This takes a while. Have a break.
We are usually only interested in significant marker genes, so let’s filter based on an adjusted p-value smaller than 0.05:
de_genes <- subset(de_genes, de_genes$p_val_adj < 0.05) View(de_genes)
Exercise: What are significant marker genes in cluster 6? Are the immune genes in there?
You can re-load the vector with immune genes with:
immune_genes <- c("GZMA", "CD3E", "CD3D")
de_genes[de_genes$gene %in% immune_genes,]
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene GZMA 4.015094e-221 2.441962 0.331 0.005 9.781973e-217 6 GZMA CD3E 0.000000e+00 2.113342 0.474 0.006 0.000000e+00 6 CD3E CD3D 0.000000e+00 2.507228 0.519 0.006 0.000000e+00 6 CD3D
So, yes, the immune genes are highly significant markers for cluster 6.
Differential expression between clusters
This part uses the
FindMarkers function allows to test for differential gene expression analysis specifically between 2 clusters, i.e. perform pairwise comparisons, eg between cells of cluster 0 vs cluster 2, or between cells annotated as astrocytes and macrophages.
First we can set the default cell identity to the cell types defined by
gbm <- Seurat::SetIdent(gbm, value = "SingleR_annot")
Run the differential gene expression analysis (runs for a couple of minutes):
DEG_astro_vs_macro <- Seurat::FindMarkers(gbm, ident.1 = "Astrocyte", ident.2 = "Macrophage", group.by = gbm$SingleR_annot, test.use = "wilcox")
Exercise: What is the top 10 differentially expressed genes? What does the sign (i.e. positive or negative) mean in the log fold change values? Check your answer by generating a violin plot of a top differentially expressed gene.
You can look at the top 10 differentially expressed genes with:
top_order <- order(DEG_astro_vs_macro$p_val_adj) DEG_astro_vs_macro[top_order[1:10],]
p_val avg_log2FC pct.1 pct.2 p_val_adj SLC2A5 0 -1.481971 0.080 0.656 0 TNFRSF1B 0 -1.529484 0.041 0.694 0 CAMK2N1 0 2.622397 0.907 0.106 0 C1QA 0 -4.843247 0.088 0.964 0 C1QC 0 -4.374756 0.075 0.943 0 C1QB 0 -4.940663 0.099 0.946 0 LAPTM5 0 -3.685947 0.075 0.987 0 MARCKSL1 0 2.690161 0.911 0.221 0 CNN3 0 2.940659 0.818 0.110 0 CD53 0 -1.861605 0.035 0.804 0
For an explanation of the log fold change have a look at
?Seurat::FindMarkers. At Value it says:
avg_logFC: log fold-chage of the average expression between the two groups. Positive values indicate that the gene is more highly expressed in the first group
Plotting the top gene
Seurat::VlnPlot(gbm, features = "SLC2A5")
Differential expression analysis including batch as covariates
This part uses the
The Wilcoxon test implemented in
FindMarkers does not allow to test for complex design (eg factorial experiments) or to include batch as a covariate.
We can use
limma which are designed for microarray or bulk RNA seq data and provide a design matrix that includes covariates for example.
We will go back to the pancreas cells sequenced with different technologies, analyze differentially expressed genes between 2 clusters of cells using the technologies as covariates. Similar approaches can be used to analyze differentially expressed genes between conditions, eg sick vs healthy, wild type versus knockout, etc, and including batches in the model if they are present.
We will load the
pancreas.integrated object we have created yesterday:
pancreas.integrated <- readRDS("pancreas.integrated.rds")
Since we will start wit differential gene expression, we set the default assay back to “RNA”. Also, we set the default identity to the cell type:
Seurat::DefaultAssay(pancreas.integrated) <- "RNA" Seurat::Idents(pancreas.integrated) <- pancreas.integrated$celltype
Let’s have a look at the UMAP (again), coloured by celltype:
Let’s say we are specifically interested to test for differential gene expression between two cell types.
Here we could also test for e.g. healthy versus diseased within a celltype/cluster.
Now we will run differential expression analysis between cell type delta and gamma using the technology as a covariate by using
First, we will subset the
pancreas.integrated object, only leaving the delta and gamma cells:
pancreas.dg <- subset(pancreas.integrated, idents = c("delta", "gamma"))
Get the count matrix and keep only genes that are expressed in at least one cell:
counts <- Seurat::GetAssayData(pancreas.dg, slot = "counts") counts <- counts[rowSums(counts) != 0,]
DGEList object to use as input for
dge <- edgeR::DGEList(counts = counts) dge <- edgeR::calcNormFactors(dge)
Generate a design matrix:
design <- model.matrix(~ 0 + celltype + tech, data = firstname.lastname@example.org) colnames(design)<-c("delta", "gamma", "celseq2", "fluidigmc1", "smartseq2")
Specify which contrasts to check:
contrast.mat <- limma::makeContrasts(delta - gamma, levels = design)
limma can perform the transformation with
voom, fit the model, compute the contrasts and compute test statistics with
vm <- limma::voom(dge, design = design, plot = TRUE) fit <- limma::lmFit(vm, design = design) fit.contrasts <- limma::contrasts.fit(fit, contrast.mat) fit.contrasts <- limma::eBayes(fit.contrasts)
We can use
topTable to get the most significantly differentially expressed genes:
limma::topTable(fit.contrasts, number = 10, sort.by = "P")
And we can check whether this corresponds to the counts by generating a violin plot:
Seurat::VlnPlot(pancreas.dg, "PPY", split.by = "tech") Seurat::VlnPlot(pancreas.dg, "RBP4", split.by = "tech")