The following milestones and tasks are what you will need to perform in order to complete the project-based learning outcomes for this course. Go through them in order and refer to the glossary for more detailed information on each individual step.
The data for this project is available at the course GitHub repository, which can be downloaded like so:
git clone https://github.com/sib-swiss/SchoolRNA2020.git
SchoolRNA2020
; the data is residing in the single_cell/data/
directoryYou can either read the data from there (check the glossary for details) or copy the data to some other location, if you prefer. You can run the above command either on the command line interface or from within RStudio (more details in the glossary).
Data download:
cd ~/Downloads
git clone https://github.com/sib-swiss/SchoolRNA2020.git
cd SchoolRNA2020
mv single_cell/ ~/RProjects/sibsinglecell/single_cell
Data location:
data_dir <- "~/RProjects/sibsinglecell/single_cell/data"
The data GSM3396167 is located at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3396167. Paper: 10.1172/jci.insight.124928.
We have 3 samples from 3 different donors in the sparse matrix format (.mtx) with corresponding barcodes and feature names in .tsv files. We want to load them separately to not loose any cells (min.cells = 0
and min.features=0$
) and hold the information of the sample name. Then, we create seurat objects separately and we merge them. Like this, we end up with one object that contains the information of the samples.
On the merged object we perform the analysis (normalisation, scaling, dimensionality reduction, visualisation, etc.) with functions implemented in the Seurat package.
library(Seurat)
sparse_matrix_A <- Seurat::Read10X(
data.dir = file.path(data_dir, "marrow_data_GSE120221", "GSM3396167_matrix_A"))
sparse_matrix_E <- Seurat::Read10X(
data.dir = file.path(data_dir, "marrow_data_GSE120221", "GSM3396167_matrix_E"))
sparse_matrix_F <- Seurat::Read10X(
data.dir = file.path(data_dir, "marrow_data_GSE120221", "GSM3396167_matrix_F"))
seurat_A <- CreateSeuratObject(counts = sparse_matrix_A,
min.cells = 0, min.features = 0,
assay = "RNA",
project = "matrix_A")
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
seurat_E <- CreateSeuratObject(counts = sparse_matrix_E,
min.cells = 0, min.features = 0,
assay = "RNA",
project = "matrix_E")
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
seurat_F <- CreateSeuratObject(counts = sparse_matrix_F,
min.cells = 0, min.features = 0,
assay = "RNA",
project = "matrix_F")
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
rm(sparse_matrix_A, sparse_matrix_E, sparse_matrix_F)
SeuratObject <- merge(seurat_A, c(seurat_E, seurat_F),
add.cell.ids = c("matrix_A", "matrix_E", "matrix_F"))
rm(seurat_A, seurat_E, seurat_F)
Final object with 10679 cells and 33694 features (i.e. genes).
SeuratObject
## An object of class Seurat
## 33694 features across 10679 samples within 1 assay
## Active assay: RNA (33694 features, 0 variable features)
SeuratObject@assays$RNA@counts[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
## matrix_A_AAACCTGCAAGTAATG-1 matrix_A_AAACCTGCAGCGAACA-1
## RP11-34P13.3 . .
## FAM138A . .
## OR4F5 . .
## RP11-34P13.7 . .
## RP11-34P13.8 . .
## RP11-34P13.14 . .
## matrix_A_AAACCTGCATGTCCTC-1 matrix_A_AAACCTGGTCGACTGC-1
## RP11-34P13.3 . .
## FAM138A . .
## OR4F5 . .
## RP11-34P13.7 . .
## RP11-34P13.8 . .
## RP11-34P13.14 . .
## matrix_A_AAACCTGGTCGCTTCT-1 matrix_A_AAACCTGTCCCGACTT-1
## RP11-34P13.3 . .
## FAM138A . .
## OR4F5 . .
## RP11-34P13.7 . .
## RP11-34P13.8 . .
## RP11-34P13.14 . .
Which QC metrics should you calculate? What does each QC metric mean?
Number of UMI counts: important to know the distribution of the counts for the normalisation and scaling steps.
number of features/genes: how many genes are found per cell; important to identify doublets and empty droplets.
percentage of mitochondrial genes: the higher the mitochondrial fraction, the cells are more stressed and/or prone to apoptosis.
ribosomal genes: inversly correlated to mitochondrial content; the higher the fraction, the more the cell is metabolic activated (stress or cell-cycle).
the gene biotypes (e.g. coding protein): QC on library preparation and mapping. Also used to filter for not needed genes.
chromosome distributions: check biases for certain chromsomes, e.g. X chromosome in male, female studies.
gene QC: find most expressed genes across cells. Identify housekeeping genes (mitochondrial, ribosomal and structural genes).
cell cycle scoring: find if cells are in the desired cell cycle states. Also used to identify batch effects between samples, otherwise the main signal comes from the cell cycle.
VlnPlot(SeuratObject,
group.by= "orig.ident",
features = c("nFeature_RNA","nCount_RNA"),
pt.size = 0.1,
ncol = 4) + NoLegend()
#### Gene QC
#Compute the relative expression of each gene per cell
rel_expression <- Matrix::t( Matrix::t(SeuratObject@assays$RNA@counts) / Matrix::colSums(SeuratObject@assays$RNA@counts)) * 100
most_expressed <- sort(Matrix::rowSums( rel_expression ),T) / ncol(SeuratObject)
#Plot the relative expression of each gene per cell
par(mfrow=c(1,3),mar=c(4,6,1,1))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[30:1]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[30:1],horizontal=TRUE,ylim=c(0,8))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[60:31]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[60:31],horizontal=TRUE,ylim=c(0,8))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[90:61]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[90:61],horizontal=TRUE,ylim=c(0,8))
#### Percentage mitochondrial and ribosomal genes
# Calculating % mitochondrial genes
SeuratObject <- PercentageFeatureSet(
object = SeuratObject,
pattern = "^MT-",
assay = "RNA",
col.name = "percent_mito")
# Calculating % ribosomal genes
SeuratObject <- PercentageFeatureSet(
SeuratObject,
pattern = "^RP[SL]",
col.name = "percent_ribo")
VlnPlot(SeuratObject,
group.by= "orig.ident",
features = c("percent_mito","percent_ribo"),
pt.size = 0.1,
ncol = 2) + NoLegend()
#### Percentage gene biotype
library(biomaRt)
# Retrieve mouse gene annotation from ENSEMBL
mart = biomaRt::useMart(
biomart = "ensembl",
dataset = "hsapiens_gene_ensembl",
host = "apr2020.archive.ensembl.org")
# Retrieve the selected attributes mouse gene annotation
annot <- biomaRt::getBM(
attributes = c(
"external_gene_name",
"gene_biotype",
"chromosome_name"),
mart = mart)
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Cache found
# Match the gene names with theit respective gene biotype
item <- annot[match(rownames(SeuratObject@assays$RNA@counts) , annot[,1]),"gene_biotype"]
item[is.na(item)] <- "unknown"
# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(SeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(SeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]
# Plot percentage of each gene biotype
par(mar=c(4,20,2,0))
boxplot( perc*100,outline=F,las=2,main="% reads per cell",col=scales::hue_pal()(100),horizontal=T)
# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))
SeuratObject@meta.data <- SeuratObject@meta.data[,!(colnames(SeuratObject@meta.data) %in% colnames(gene_biotype_table))]
SeuratObject@meta.data <- cbind(
SeuratObject@meta.data,
gene_biotype_table)
# Match the gene names with their respective chromosome location
item <- annot[match(rownames(SeuratObject@assays$RNA@counts) , annot[,1]),"chromosome_name"]
item[is.na(item)] <- "unknown"
item[! item %in% as.character(c(1:23,"X","Y","MT")) ] <- "other"
perc <- rowsum(as.matrix(SeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(SeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]
boxplot( perc*100,outline=F,las=2,main="% reads per cell",col=scales::hue_pal()(100),horizontal=T)
# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))
SeuratObject@meta.data <- SeuratObject@meta.data[,!(colnames(SeuratObject@meta.data) %in% colnames(gene_biotype_table))]
SeuratObject@meta.data <- cbind(
SeuratObject@meta.data,
gene_biotype_table)
SeuratObject <- CellCycleScoring(
object = SeuratObject,
g2m.features = cc.genes$g2m.genes,
s.features = cc.genes$s.genes)
## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms
SeuratObject$G1.Score <- 1 - SeuratObject$S.Score - SeuratObject$G2M.Score
VlnPlot(SeuratObject,
group.by= "orig.ident",
features = c("G1.Score","S.Score", "G2M.Score"),
pt.size = 0.1,
ncol = 4) + NoLegend()
Combine all metadata (QC parameters) and perform PCA in order to check the differences between the samples. This is a better visualisation than comparing each QC metric on its own.
# Calculate PCA using selected metadata parameters
metadata_use <- grep("perc",colnames(SeuratObject@meta.data),value = T)
metadata_use <- c("nCount_RNA","nFeature_RNA","S.Score","G2M.Score",metadata_use)
#remove columns in the metadata that are all 0s
metadata_use <- metadata_use [ colSums( SeuratObject@meta.data[,metadata_use] != 0 ) > 0 ]
#Run PCA
PC <- prcomp( SeuratObject@meta.data[,metadata_use] ,center = T, scale. = T)
# Add the PCA (ran on the METADATA) in the object
SeuratObject@reductions[["pca_metadata"]] <- CreateDimReducObject(
embeddings = PC$x,
key = "metadataPC_",
assay = "RNA")
# Plot the PCA ran on the METADATA
DimPlot(SeuratObject,
reduction = "pca_metadata",
dims = c(1,2),
group.by = "orig.ident" )
Which QC metric needs filtering in your data? Which filtering parameters and respective thresholds suit your data?
nFeature_RNA: number of genes <= 4500 to account for sequencing depth and >= 100 (maybe set 200) to remove low quality cells.
nCount_RNA: number of counts <= 50000 to remove duplets.
percent_mito: keep cells with <= 15 percent.
Filter cells with percentage mitochondria > XX%
SeuratObject <- subset(SeuratObject,
subset = percent_mito <= 15 & nFeature_RNA <= 4500 & nCount_RNA <= 50000 & nFeature_RNA >= 100)
Normalisation is needed to account for varying sequencing depth. Normalisation to the library size per cell, and then the ln + 1 is taken.
UMI counts, percentage mitochondria, percentage ribosome, cell cycle
In the metadata PCA, the samples are similar in that cluster together meaning that there are no big biases in the QC metrics.
No, we are interested in protein coding genes and genes that are expressed in the samples.
#remove genes with zero variance
SeuratObject <- SeuratObject[ Matrix::rowSums(SeuratObject) > 0, ]
# remove non coding proteins
sel <- annot[match(rownames(SeuratObject) ,
annot[,1]),2] == "protein_coding"
genes_use <- rownames(SeuratObject)[sel]
genes_use <- as.character(na.omit(genes_use))
SeuratObject <- SeuratObject[genes_use,]
SeuratObject <- NormalizeData(
object = SeuratObject,
scale.factor = 10000,
normalization.method = "LogNormalize")
We select the 3000 most variable genes.
SeuratObject <- FindVariableFeatures(
object = SeuratObject,
nfeatures = 3000,
selection.method = "vst",
verbose = FALSE,
assay = "RNA")
Visualise the top20 variable genes.
top20 <- head(VariableFeatures(SeuratObject, assay = "RNA"), 20)
LabelPoints(plot = VariableFeaturePlot(SeuratObject, assay = "RNA"), points = top20, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
####Scaling and Centering
hist(SeuratObject@meta.data$nCount_RNA, breaks = 100)
hist(Matrix::colSums(SeuratObject@assays$RNA@data), breaks = 100)
Based on the histogram above, we could use negative binomial or poisson scaling. (If we had something to regress).
if(!file.exists("SeuratObject_scaled.RData")){
SeuratObject <- ScaleData(
object = SeuratObject,
features = VariableFeatures(SeuratObject),
assay = "RNA",
vars.to.regress = NULL,
do.scale = T,
do.center = T)
save(SeuratObject, file = "SeuratObject_scaled.RData")
} else{
load("SeuratObject_scaled.RData")
}
## Centering and scaling data matrix
SeuratObject@assays$RNA@scale.data[1:6,1:6]
## matrix_A_AAACCTGCAAGTAATG-1 matrix_A_AAACCTGCAGCGAACA-1
## HES4 -0.11697461 8.34084953
## ISG15 -0.41439954 3.76623184
## TNFRSF18 -0.12533438 -0.12533438
## TNFRSF4 -0.14284681 -0.14284681
## TAS1R3 -0.04254550 -0.04254550
## MXRA8 -0.04197874 -0.04197874
## matrix_A_AAACCTGCATGTCCTC-1 matrix_A_AAACCTGGTCGACTGC-1
## HES4 -0.11697461 -0.11697461
## ISG15 -0.41439954 -0.41439954
## TNFRSF18 -0.12533438 -0.12533438
## TNFRSF4 -0.14284681 -0.14284681
## TAS1R3 -0.04254550 -0.04254550
## MXRA8 -0.04197874 -0.04197874
## matrix_A_AAACCTGGTCGCTTCT-1 matrix_A_AAACCTGTCCCGACTT-1
## HES4 -0.11697461 -0.11697461
## ISG15 1.51893746 -0.41439954
## TNFRSF18 -0.12533438 -0.12533438
## TNFRSF4 -0.14284681 -0.14284681
## TAS1R3 -0.04254550 -0.04254550
## MXRA8 -0.04197874 -0.04197874
PCA for dimensionality reduction and elbow plot to find the top N dimensions that contain the most information of data. Based on the elbow plot we use the first 15 PCs.
tSNE and UMAP for visualisation in 2D. * Random seed influences the shape of the plot. * The number of PCs. * other parameters:
?RunUMAP()
SeuratObject <- RunPCA(object = SeuratObject,
features = VariableFeatures(SeuratObject),
assay = "RNA",
npcs = 100,
verbose = FALSE,
reduction.name = "pca")
ElbowPlot(SeuratObject, ndims = 30, reduction = "pca")
Choose dimensionality
top_PCs <- 15
SeuratObject <- RunUMAP(object = SeuratObject,
assay = "RNA",
reduction = "pca",
dims = 1:top_PCs,
n.components = 2,
n.neighbors = 20, # used for local approximations
spread = 1, # effective scale of embedded points
repulsion.strength = 1, # weighting
min.dist= .001,
verbose = F,
n.epochs = 200,
metric = "euclidean",
seed.use = 42,
reduction.name="umap")
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
SeuratObject <- RunTSNE(object = SeuratObject,
assay = "RNA",
reduction = "pca",
perplexity=30,
max_iter=1000,
theta=0.5,
eta=200,
exaggeration_factor=12,
dims.use = 1:top_PCs,
verbose = F,
num_threads=0,
seed.use = 42,
reduction.name = "tsne")
Where in the Seurat object is your reductions stored?
SeuratObject@reductions
## $pca_metadata
## A dimensional reduction object with key metadataPC_
## Number of dimensions: 58
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: RNA
##
## $pca
## A dimensional reduction object with key PC_
## Number of dimensions: 100
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: RNA
##
## $umap
## A dimensional reduction object with key UMAP_
## Number of dimensions: 2
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: RNA
##
## $tsne
## A dimensional reduction object with key tSNE_
## Number of dimensions: 2
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: RNA
Plot PCA, UMAP
DimPlot(SeuratObject, reduction = "pca", dims = c(1,2))
DimPlot(SeuratObject, reduction = "pca", dims = c(3,4))
DimPlot(SeuratObject, reduction = "umap")
DimPlot(SeuratObject, reduction = "tsne")
How some of your QC parameters and datasets influence the separation of your cells?
FeaturePlot(SeuratObject, reduction = "umap", features = c("percent_mito", "percent_ribo", "nCount_RNA"), split.by = "orig.ident")
Based on the PCA, and UMAP plots, yes, there appears to be a batch effect.
No overlap of samples/batches in 2D representation in individual clusters.
Probably not, since the data is high dimensional.
set.seed(1)
# Load additional libraries
library(harmony)
## Loading required package: Rcpp
library(SeuratWrappers)
##
## Attaching package: 'SeuratWrappers'
## The following objects are masked from 'package:Seurat':
##
## ALRAChooseKPlot, RunALRA
SeuratObject <- RunHarmony(
SeuratObject,
reduction = "pca", dims.use = 1:100,
nclust = NULL,
group.by.vars = "orig.ident",
reduction.save = "harmony",
project.dim = TRUE)
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony converged after 3 iterations
## Warning: Invalid name supplied, making object name syntactically valid. New
## object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details
## on syntax validity
DimPlot(SeuratObject, reduction = "harmony", dims = c(1,2))
DimPlot(SeuratObject, reduction = "harmony", dims = c(3,4))
If there is reasonable overlap in 2D representation.
corrected data matrix (genes x samples)
tsne_harmony and umap_harmony
ElbowPlot(SeuratObject, ndims = 30, reduction = "harmony")
SeuratObject <- RunUMAP(object = SeuratObject,
reduction = "harmony",
dims = 1:top_PCs,
n.components = 2,
n.neighbors = 20,
spread = 1,
repulsion.strength = 1,
min.dist= .1,
verbose = F,
n.epochs = 200,
metric = "euclidean",
seed.use = 42,
reduction.name="umap_harmony",
reduction.key = "umap.harmony_")
## Warning: Keys should be one or more alphanumeric characters followed by an
## underscore, setting key from umap.harmony_ to umapharmony_
## Warning: All keys should be one or more alphanumeric characters followed by an
## underscore '_', setting key to umapharmony_
SeuratObject <- RunTSNE(object = SeuratObject,
reduction = "harmony",
perplexity=30,
max_iter=1000,
theta=0.5,
eta=200,
exaggeration_factor=12,
dims.use = 1:top_PCs,
verbose = F,
num_threads=0,
seed.use = 42,
reduction.name = "tsne_harmony")
## Warning: Cannot add objects with duplicate keys (offending key: tSNE_), setting
## key to 'tsne_harmony_'
library(gridExtra)
p <- DimPlot(SeuratObject, reduction = "umap_harmony")
q <- DimPlot(SeuratObject, reduction = "umap")
grid.arrange(p, q, ncol = 2)
p <- DimPlot(SeuratObject, reduction = "tsne_harmony")
q <- DimPlot(SeuratObject, reduction = "tsne")
grid.arrange(p, q, ncol = 2)
SeuratObject@reductions$harmony
## A dimensional reduction object with key harmony_
## Number of dimensions: 100
## Projected dimensional reduction calculated: TRUE
## Jackstraw run: FALSE
## Computed using assay: RNA
A graph is a diagram showing the relation between variables. It consists of nodes (cells) and edges, the connections betweens nodes.
A graph that connects closely related cells and has little connection to unrelated cells. We use Shared Nearest Neighbours (SNN) to eliminate random correlations between cells.
set.seed(42)
SeuratObject <- FindNeighbors(SeuratObject,
assay = "RNA",
compute.SNN = T,
reduction = "harmony" ,
dims = 1:top_PCs,
graph.name="SNN",
prune.SNN = 1/15,
k.param = 20,
force.recalc = T)
## Computing nearest neighbor graph
## Computing SNN
SeuratObject@graphs$SNN[1:2,1:2]
## 2 x 2 sparse Matrix of class "dgCMatrix"
## matrix_A_AAACCTGCAAGTAATG-1
## matrix_A_AAACCTGCAAGTAATG-1 1
## matrix_A_AAACCTGCAGCGAACA-1 .
## matrix_A_AAACCTGCAGCGAACA-1
## matrix_A_AAACCTGCAAGTAATG-1 .
## matrix_A_AAACCTGCAGCGAACA-1 1
So that we can differentiate and annotate the cell types.
Resolution = 0.8 (how many clusters are obtained) and Louvain algorithm.
SeuratObject <- FindClusters(
object = SeuratObject,
graph.name = "SNN",
resolution = 0.8,
algorithm = 1,
random.seed = 42) #algorithim 1 = Louvain
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 10495
## Number of edges: 98635
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8709
## Number of communities: 28
## Elapsed time: 0 seconds
## 9 singletons identified. 19 final clusters.
If nearby cells on the UMAP or tSNE plots are separated by different cluster annotation, the resolution is too high. If the resolution is too low, faraway cells on the 2D plot are clustered together.
Yes.
DimPlot(SeuratObject, reduction = "umap_harmony", group.by = "seurat_clusters", label = TRUE)
DimPlot(SeuratObject, reduction = "tsne_harmony", group.by = "seurat_clusters", label = TRUE)
See plot, each cluster appears in each sample in different fractions.
DimPlot(SeuratObject, reduction = "umap_harmony", group.by = "seurat_clusters", label = TRUE, split.by = "orig.ident")
DimPlot(SeuratObject, reduction = "tsne_harmony", group.by = "seurat_clusters", label = TRUE, split.by = "orig.ident")
Number of cells per cluster.
table(SeuratObject@meta.data$seurat_clusters)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2486 1173 936 907 879 782 553 478 422 417 350 287 287 185 136 108
## 16 17 18
## 105 2 2
Which genes specifically drive the differentiation of Neutrophils.
What genes are expressed differentially in each cluster? what is the cluster representing in terms of cell type?
Are you interested in comparing all cells or using a specific cluster?
First, in all cells, then later maybe only in a subset.
Which clustering resolution would you run your differential expression?
Which test did you choose for differential expression?
Wilcox. Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test.
if(!file.exists("all_markers.RData")){
markers <- FindAllMarkers( object = SeuratObject,
assay = "RNA",
logfc.threshold = 0.25,
test.use = "wilcox",
slot = "data",
min.pct = 0.1,
min.diff.pct = 0.2,
only.pos = TRUE,
max.cells.per.ident = 100,
random.seed = 42,
latent.vars = NULL,
min.cells.feature = 3,
min.cells.group = 3,
pseudocount.use = 1,
return.thresh = 0.01 )
save(markers, file = "all_markers.RData")
} else{
load("all_markers.RData")
}
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
## Calculating cluster 14
## Calculating cluster 15
## Calculating cluster 16
## Calculating cluster 17
## Calculating cluster 18
Cluster expressing CD34 and G0S2 because they are markers for progenitors and neutrophils.
library(ggplot2)
DimPlot(SeuratObject, reduction = "umap_harmony", order = TRUE, label = TRUE)
FeaturePlot(SeuratObject, reduction = "umap_harmony", order = TRUE, label = TRUE, features = c("CD34", "G0S2"))
DotPlot(object = SeuratObject,
features = c("CD34", "G0S2"),
group.by = "seurat_clusters",
assay = "RNA") + coord_flip()
VlnPlot(object = SeuratObject,
features = c("CD34", "G0S2"),
ncol = 2,
group.by = "seurat_clusters",
assay = "RNA")
We use the top5 differentially expressed genes per cluster.
Based on marker gene list one can identify the cell type based on the found marker genes.
Wit heatmap (see below) or with a dotplot for each marker gene and cell type.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:biomaRt':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
top5 <- markers %>% group_by(cluster) %>% top_n(-3, p_val_adj)
DoHeatmap(object = SeuratObject,
features = as.character(unique(top5$gene)),
group.by = "seurat_clusters",
assay = "RNA")
## Warning in DoHeatmap(object = SeuratObject, features =
## as.character(unique(top5$gene)), : The following features were omitted as they
## were not found in the scale.data slot for the RNA assay: CD52, CD3D, JUNB, RPL22
DotPlot(object = SeuratObject,
features = as.character(unique(top5$gene)),
group.by = "seurat_clusters",
assay = "RNA") + coord_flip()
Clusters with the marker genes:
markers %>% filter(gene %in% c("CD34", "G0S2"))
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
## G0S2 6.699020e-17 2.2068970 0.650 0.059 1.015839e-12 5 G0S2
## CD34 1.822133e-17 0.6994332 0.557 0.012 2.763082e-13 10 CD34
## G0S21 6.733057e-08 1.1788146 0.415 0.094 1.021001e-03 12 G0S2
QC
FeaturePlot(SeuratObject, reduction = "umap_harmony", features = c("percent_mito", "percent_ribo", "nCount_RNA"), label = TRUE)
Which genes specifically drive the differentiation of Neutrophils.
Based on the UMAP, yes. Because the clusters are closely together. There is one cluster inbetween the clusters with marker gene which we include for trajectory analysis.
We are only interested in clusters involved in differentiation of neutrophils.
umap_harmony because no extra steps for visualisation needed.
Which differential expression test are you interested in? which genes drive differentiation? and what is the difference in the two lineages?
How would you visualize your results?
cells_to_use <- rownames(filter(SeuratObject@meta.data,
seurat_clusters %in% c(10,7,5,12)))
dimred <- SeuratObject@reductions[['umap_harmony']]@cell.embeddings
dimred <- dimred[cells_to_use,]
clustering <- factor(SeuratObject@meta.data[cells_to_use, "seurat_clusters"])
Slingshot
library(slingshot)
## Loading required package: princurve
# Run default Slingshot lineage identification
set.seed(42)
lineages <- getLineages(
data = dimred,
clusterLabels = clustering,
# end.clus = c("12"), #define how many branches/lineages to consider
start.clus = "10" #define where to start the trajectories
)
## Using full covariance matrix
lineages
## class: SlingshotDataSet
##
## Samples Dimensions
## 1897 2
##
## lineages: 1
## Lineage1: 10 7 5 12
##
## curves: 0
# Plot the lineages
par(mfrow=c(1,2))
plot(dimred[,1:2], col = clustering, cex=.5,pch = 16)
for(i in levels(clustering)){
text( mean(dimred[clustering==i,1]),
mean(dimred[clustering==i,2]), labels = i,font = 2) }
plot(dimred[,1:2], col = clustering,cex=.5, pch = 16)
lines(lineages, lwd = 3, col = 'black')
curves <- getCurves(lineages,
approx_points = 300,
thresh = 0.01,
stretch = .8,
allow.breaks = FALSE,
shrink = .99)
curves
## class: SlingshotDataSet
##
## Samples Dimensions
## 1897 2
##
## lineages: 1
## Lineage1: 10 7 5 12
##
## curves: 1
## Curve1: Length: 20.114 Samples: 1897
plot(dimred, col = clustering, asp = 1, pch = 16)
lines(curves, lwd = 3, col = 'black')
Find differentially expressed genes along trajectories
BiocParallel::register(BiocParallel::SerialParam())
library(tradeSeq)
## tradeSeq has been updated to accommodate singleCellExperiment objects as output, making it much more memory efficient. Please check the news file and the updated vignette for details.
counts <- as.matrix( SeuratObject@assays$RNA@counts[ SeuratObject@assays$RNA@var.features , ] )
# filter counts to cells of interest
counts <- counts[,cells_to_use]
# Removing some genes to speed up the computations for this tutorial
filt_counts <- counts [ rowSums(counts > 5) > ncol(counts)/100, ]
dim(filt_counts)
## [1] 338 1897
# Fitting a Gamma distribution
sce <- fitGAM( counts = as.matrix(filt_counts),
sds = curves,
nknots = 6)
plotGeneCount(curve = curves,
counts = filt_counts,
clusters = clustering,
models = sce)
#### Genes that change with pseudotime
pseudotime_association <- associationTest(sce)
pseudotime_association$fdr <- p.adjust(pseudotime_association$pvalue, method = "fdr")
pseudotime_association <- pseudotime_association[ order(pseudotime_association$pvalue), ]
pseudotime_association$feature_id <- rownames(pseudotime_association)
write.table(pull(pseudotime_association, feature_id), file = "genes_pseudotime.txt", sep = "\t",
row.names = FALSE, quote = FALSE, col.names = FALSE)
# Define function to plot
plot_differential_expression <- function(feature_id) {
# cowplot::plot_grid(
plotGeneCount(curve = curves,
counts = filt_counts,
gene=feature_id[1],
clusters = clustering,
models = sce)
# plotSmoothers(models = sce,
# counts=filt_counts,
# gene = feature_id[1])
}
# Get genes and plot
feature_id <- pseudotime_association %>%
filter(fdr < 0.05) %>%
top_n(40, waldStat)
for(i in 1:40){
print(plot_differential_expression(feature_id[i, "feature_id"]))
}
plot_smoothers <- function(seuratObject,
SlingshotCurves,
gene,
clustering_use,
plotOrder=T,
col=pal,
colorByPseudotime=T,
spar=.9,
pt.size=.3,
factorScale=1.3){
pt <- slingshot::slingPseudotime(curves)
pt <- apply(pt,1,function(x) min(x,na.rm = T))
ooo <- order(pt)[order(pt)]
if(plotOrder){
x <- 1:length(pt)
y <- seuratObject@assays$RNA@data[gene,order(pt)]
col <- col[seuratObject@meta.data[order(pt),clustering_use]]
} else {
x <- pt
y <- seuratObject@assays$RNA@data[gene,]
col <- col[seuratObject@meta.data[,clustering_use]]
}
plot(x,y,col=col,cex=pt.size,pch=16,xlab="Pseudotime",ylab="",las=1,main=gene)
sm <- smooth.spline( x,y , spar = spar)
lines(sm$x,sm$y*factorScale,lwd=2)
}
pal <- scales::hue_pal()(length(unique((SeuratObject@meta.data[cells_to_use, "seurat_clusters"]))))
temp <- SeuratObject[,cells_to_use]
## Warning: Keys should be one or more alphanumeric characters followed by an
## underscore, setting key from tsne_harmony_ to tsneharmony_
temp$seurat_clusters <- factor(temp$seurat_clusters,levels = c("10","7","5","12") )
for(i in 1:40){
plot_smoothers(seuratObject = temp,
SlingshotCurves = curves,
clustering_use = "seurat_clusters",
gene = feature_id[i, "feature_id"],
plotOrder = TRUE,
col=pal,
spar=.9,
pt.size=.3,
factorScale=1)
}
Gene enrichment
library(enrichR)
## Welcome to enrichR
## Checking connection ... Connection is Live!
# Check available databases to perform enrichment (then choose one)
# enrichR::listEnrichrDbs()
my_genes <- pseudotime_association %>% filter(fdr < 0.05) %>% pull(feature_id)
my_database <- "GO_Biological_Process_2018"
# perform enrichment
enrich_results <- enrichr(
genes = my_genes,
databases = my_database)[[1]]
## Uploading data to Enrichr... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
library(rafalib)
mypar(1,1,mar=c(3,20,2,1))
barplot( height = -log10(enrich_results$P.value)[20:1],
names.arg = enrich_results$Term[20:1],
horiz = T,
las=2,
border=F,
cex.names = .6 )
abline( v=c( -log10(0.05) ),lty=2 )
abline( v=0,lty=1 )