Integration
Material
Exercises
This chapter uses the
pancreas
dataset
The gbm
dataset does not contain any samples, treatments or methods to integrate. Therefore for these exercises we will use a different dataset that is described in Comprehensive Integration of Single CellData. It is a dataset comprising of four different single cell experiment performed by using four different methods.
pancreas.data <- readRDS(file = "data/pancreas_dataset/pancreas_expression_matrix.rds")
metadata <- readRDS(file = "data/pancreas_dataset/pancreas_metadata.rds")
Create a Seurat object with all datasets.
pancreas <- Seurat::CreateSeuratObject(pancreas.data, meta.data = metadata)
The object pancreas
is now of class Seurat
and comparable with the object gbm
that we have used in the previous exercises.
Exercise: Have a look at the object. How many cells are in there? And how many features? What kind of information is in the meta.data
slot?
Answer
Just by running pancreas
, we get the following information:
An object of class Seurat
34363 features across 6321 samples within 1 assay
Active assay: RNA (34363 features, 0 variable features)
Running head(pancreas@meta.data)
gives you:
orig.ident nCount_RNA nFeature_RNA tech celltype
D101_5 SeuratProject 4615.810 1986 celseq gamma
D101_43 SeuratProject 11711.506 3942 celseq gamma
D101_93 SeuratProject 5567.659 2418 celseq gamma
D102_4 SeuratProject 6804.533 2846 celseq gamma
D172444_23 SeuratProject 5541.101 2436 celseq gamma
D172444_68 SeuratProject 4301.892 2015 celseq gamma
So, apparently there is also information in there about the technology (tech
) and cell type annotation (celltype
)
Now, we can repeat what we have learned in the previous chapters. Let’s assume we don’t have to filter for e.g. mitochondrial UMIs and number of features, and we can directly proceed to the normalization and scaling.
Exercise: To perform normalization, scaling, PCA and UMAP, run the following functions (with sensible parameters) on the pancreas
object:
Seurat::NormalizeData
Seurat::FindVariableFeatures
Seurat::ScaleData
Seurat::RunPCA
Seurat::RunUMAP
Answer
pancreas <- Seurat::NormalizeData(pancreas)
pancreas <- Seurat::FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000)
pancreas <- Seurat::ScaleData(pancreas)
pancreas <- Seurat::RunPCA(pancreas, npcs = 30)
pancreas <- Seurat::RunUMAP(pancreas, reduction = "pca", dims = 1:30)
Now we plot the UMAP based on technology. There is clearly clustering according to technology:
Seurat::DimPlot(pancreas, reduction = "umap", group.by = "tech")
Exercise: Generate the same UMAP plot but now grouped by celltype
. Does cell type group correctly together?
Answer
Generate the plot:
Seurat::DimPlot(pancreas, reduction = "umap", group.by = "celltype")
This shows that within a techology cell types cluster together, but not between technology (this is e.g. very clear if you look at the clusters annotated as “alpha”)
To perform the integration, we split the combined object into a list, with each dataset as an element. We perform standard preprocessing (log-normalization), and identify variable features individually for each dataset based on a variance stabilizing transformation ("vst"
).
pancreas.list <- Seurat::SplitObject(pancreas, split.by = "tech")
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- Seurat::NormalizeData(pancreas.list[[i]])
pancreas.list[[i]] <- Seurat::FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
verbose = FALSE)
}
After this, we prepare the integration by selecting integration anchors:
pancreas.anchors <- Seurat::FindIntegrationAnchors(object.list = pancreas.list, dims = 1:30)
And finally perform the integration:
pancreas.integrated <- Seurat::IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
After running IntegrateData
, the Seurat
object will contain an additional element of class Assay
with the integrated (or ‘batch-corrected’) expression matrix. This new Assay
is called integrated
, and exists next to the already existing RNA
element with class Assay
.
Warning
Use the Assay
integrated
only for clustering and visualisation. It will give unexpected results during e.g. differential gene expression analysis. Therefore, use the RNA
element for other analyses.
We can then use this new integrated matrix for clustering and visualization. Now, we can scale the integrated data, run PCA, and visualize the results with UMAP.
Note
No need to re-run FindVariableFeatures
, these were automatically set by calling IntegrateData
.
First, switch the default Assay
to integrated
(in stead of RNA
).
Seurat::DefaultAssay(pancreas.integrated) <- "integrated"
Exercise: In order to redo the clustering, scale the integrated data, run the PCA and the UMAP again (using the function ScaleData
, RunPCA
and RunUMAP
). After that, generate the same two UMAP plots (grouped by tech
and by celltype
). Did the integration perform well?
Answer
Performing the scaling, PCA and UMAP:
pancreas.integrated <- Seurat::ScaleData(pancreas.integrated)
pancreas.integrated <- Seurat::RunPCA(pancreas.integrated, npcs = 30)
pancreas.integrated <- Seurat::RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
Plotting the UMAP:
Seurat::DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
Seurat::DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE)
So, yes, integration performed well. Clustering is now not according to technology, but according to cell type.
Save the dataset and clear environment
Finally, store the integrated dataset as an .rds
file. We will use it tomorrow:
saveRDS(pancreas.integrated, "pancreas.integrated.rds")
Clear your environment:
rm(list = ls())
gc()
.rs.restartR()