Research Analysis Tasks

The following milestones and tasks are what you will need to perform in order to complete the project-based learning outcomes for this course. Go through them in order and refer to the glossary for more detailed information on each individual step.

Milestone 1

Download the data

The data for this project is available at the course GitHub repository, which can be downloaded like so:

Run git clone https://github.com/sib-swiss/SchoolRNA2020.git
A new directory will be created named SchoolRNA2020; the data is residing in the single_cell/data/ directory

You can either read the data from there (check the glossary for details) or copy the data to some other location, if you prefer. You can run the above command either on the command line interface or from within RStudio (more details in the glossary).

Data download:

cd ~/Downloads
git clone https://github.com/sib-swiss/SchoolRNA2020.git
cd SchoolRNA2020
mv single_cell/ ~/RProjects/sibsinglecell/single_cell

Data location:

data_dir <- "~/RProjects/sibsinglecell/single_cell/data"

The data GSM3396167 is located at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3396167. Paper: 10.1172/jci.insight.124928.

Load and merge datasets

Which file format do we have the data in?

We have 3 samples from 3 different donors in the sparse matrix format (.mtx) with corresponding barcodes and feature names in .tsv files. We want to load them separately to not loose any cells (min.cells = 0 and min.features=0$) and hold the information of the sample name. Then, we create seurat objects separately and we merge them. Like this, we end up with one object that contains the information of the samples.

Why do we need to create a Seurat object?

On the merged object we perform the analysis (normalisation, scaling, dimensionality reduction, visualisation, etc.) with functions implemented in the Seurat package.

library(Seurat)

sparse_matrix_A <- Seurat::Read10X(
  data.dir = file.path(data_dir, "marrow_data_GSE120221", "GSM3396167_matrix_A"))
sparse_matrix_E <- Seurat::Read10X(
  data.dir = file.path(data_dir, "marrow_data_GSE120221", "GSM3396167_matrix_E"))
sparse_matrix_F <- Seurat::Read10X(
  data.dir = file.path(data_dir, "marrow_data_GSE120221", "GSM3396167_matrix_F"))

seurat_A <- CreateSeuratObject(counts = sparse_matrix_A,
                               min.cells = 0, min.features = 0,
                               assay = "RNA",
                               project = "matrix_A")

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

seurat_E <- CreateSeuratObject(counts = sparse_matrix_E,
                               min.cells = 0, min.features = 0,
                               assay = "RNA",
                               project = "matrix_E")

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

seurat_F <- CreateSeuratObject(counts = sparse_matrix_F,
                               min.cells = 0, min.features = 0,
                               assay = "RNA",
                               project = "matrix_F")

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

rm(sparse_matrix_A, sparse_matrix_E, sparse_matrix_F)

SeuratObject <- merge(seurat_A, c(seurat_E, seurat_F),
                       add.cell.ids = c("matrix_A", "matrix_E", "matrix_F"))
rm(seurat_A, seurat_E, seurat_F)

Final object with 10679 cells and 33694 features (i.e. genes).

SeuratObject

## An object of class Seurat 
## 33694 features across 10679 samples within 1 assay 
## Active assay: RNA (33694 features, 0 variable features)

Where in the Seurat object is your counts stored?

SeuratObject@assays$RNA@counts[1:6,1:6]

## 6 x 6 sparse Matrix of class "dgCMatrix"
##               matrix_A_AAACCTGCAAGTAATG-1 matrix_A_AAACCTGCAGCGAACA-1
## RP11-34P13.3                            .                           .
## FAM138A                                 .                           .
## OR4F5                                   .                           .
## RP11-34P13.7                            .                           .
## RP11-34P13.8                            .                           .
## RP11-34P13.14                           .                           .
##               matrix_A_AAACCTGCATGTCCTC-1 matrix_A_AAACCTGGTCGACTGC-1
## RP11-34P13.3                            .                           .
## FAM138A                                 .                           .
## OR4F5                                   .                           .
## RP11-34P13.7                            .                           .
## RP11-34P13.8                            .                           .
## RP11-34P13.14                           .                           .
##               matrix_A_AAACCTGGTCGCTTCT-1 matrix_A_AAACCTGTCCCGACTT-1
## RP11-34P13.3                            .                           .
## FAM138A                                 .                           .
## OR4F5                                   .                           .
## RP11-34P13.7                            .                           .
## RP11-34P13.8                            .                           .
## RP11-34P13.14                           .                           .

Compute QC

Which QC metrics should you calculate? What does each QC metric mean?
Number of UMI counts: important to know the distribution of the counts for the normalisation and scaling steps.
number of features/genes: how many genes are found per cell; important to identify doublets and empty droplets.
percentage of mitochondrial genes: the higher the mitochondrial fraction, the cells are more stressed and/or prone to apoptosis.
ribosomal genes: inversly correlated to mitochondrial content; the higher the fraction, the more the cell is metabolic activated (stress or cell-cycle).
the gene biotypes (e.g. coding protein): QC on library preparation and mapping. Also used to filter for not needed genes.
chromosome distributions: check biases for certain chromsomes, e.g. X chromosome in male, female studies.
gene QC: find most expressed genes across cells. Identify housekeeping genes (mitochondrial, ribosomal and structural genes).
cell cycle scoring: find if cells are in the desired cell cycle states. Also used to identify batch effects between samples, otherwise the main signal comes from the cell cycle.

Total number of features and UMI counts per sample in the violin plot.

VlnPlot(SeuratObject,
        group.by= "orig.ident",
        features = c("nFeature_RNA","nCount_RNA"),
        pt.size = 0.1,
        ncol = 4) + NoLegend()

#### Gene QC

#Compute the relative expression of each gene per cell
rel_expression <- Matrix::t( Matrix::t(SeuratObject@assays$RNA@counts) / Matrix::colSums(SeuratObject@assays$RNA@counts)) * 100
most_expressed <- sort(Matrix::rowSums( rel_expression ),T) / ncol(SeuratObject)

#Plot the relative expression of each gene per cell
par(mfrow=c(1,3),mar=c(4,6,1,1))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[30:1]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[30:1],horizontal=TRUE,ylim=c(0,8))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[60:31]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[60:31],horizontal=TRUE,ylim=c(0,8))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[90:61]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[90:61],horizontal=TRUE,ylim=c(0,8))

#### Percentage mitochondrial and ribosomal genes

# Calculating % mitochondrial genes
SeuratObject <- PercentageFeatureSet(
  object = SeuratObject,
  pattern = "^MT-",
  assay = "RNA",
  col.name = "percent_mito")
# Calculating % ribosomal genes
SeuratObject <- PercentageFeatureSet(
  SeuratObject,
  pattern = "^RP[SL]",
  col.name = "percent_ribo")
VlnPlot(SeuratObject,
        group.by= "orig.ident",
        features = c("percent_mito","percent_ribo"),
        pt.size = 0.1,
        ncol = 2) + NoLegend()

#### Percentage gene biotype

library(biomaRt)

# Retrieve mouse gene annotation from ENSEMBL
mart = biomaRt::useMart(
  biomart = "ensembl",
  dataset = "hsapiens_gene_ensembl",
  host = "apr2020.archive.ensembl.org")

# Retrieve the selected attributes mouse gene annotation
annot <- biomaRt::getBM(
  attributes = c(
    "external_gene_name",
    "gene_biotype",
    "chromosome_name"),
  mart = mart)

## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Cache found

# Match the gene names with theit respective gene biotype
item <- annot[match(rownames(SeuratObject@assays$RNA@counts) , annot[,1]),"gene_biotype"]
item[is.na(item)] <- "unknown"

# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(SeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(SeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]

# Plot percentage of each gene biotype
par(mar=c(4,20,2,0))
boxplot( perc*100,outline=F,las=2,main="% reads per cell",col=scales::hue_pal()(100),horizontal=T)

# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))
SeuratObject@meta.data <- SeuratObject@meta.data[,!(colnames(SeuratObject@meta.data) %in% colnames(gene_biotype_table))]

SeuratObject@meta.data <- cbind(
  SeuratObject@meta.data,
  gene_biotype_table)

Percentage chromosome location

# Match the gene names with their respective chromosome location
item <- annot[match(rownames(SeuratObject@assays$RNA@counts) , annot[,1]),"chromosome_name"]
item[is.na(item)] <- "unknown"
item[! item %in% as.character(c(1:23,"X","Y","MT")) ] <- "other"

perc <- rowsum(as.matrix(SeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(SeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]

boxplot( perc*100,outline=F,las=2,main="% reads per cell",col=scales::hue_pal()(100),horizontal=T)

# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))
SeuratObject@meta.data <- SeuratObject@meta.data[,!(colnames(SeuratObject@meta.data) %in% colnames(gene_biotype_table))]

SeuratObject@meta.data <- cbind(
  SeuratObject@meta.data,
  gene_biotype_table)

Cell cycle scoring

SeuratObject <- CellCycleScoring(
  object = SeuratObject,
  g2m.features = cc.genes$g2m.genes,
  s.features = cc.genes$s.genes)

## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms

SeuratObject$G1.Score <- 1 - SeuratObject$S.Score - SeuratObject$G2M.Score

VlnPlot(SeuratObject,
        group.by= "orig.ident",
        features = c("G1.Score","S.Score", "G2M.Score"),
        pt.size = 0.1,
        ncol = 4) + NoLegend()

PCA on metadata

Combine all metadata (QC parameters) and perform PCA in order to check the differences between the samples. This is a better visualisation than comparing each QC metric on its own.

# Calculate PCA using selected metadata parameters
metadata_use <- grep("perc",colnames(SeuratObject@meta.data),value = T)
metadata_use <- c("nCount_RNA","nFeature_RNA","S.Score","G2M.Score",metadata_use)

#remove columns in the metadata that are all 0s
metadata_use <- metadata_use [ colSums( SeuratObject@meta.data[,metadata_use] != 0 ) > 0 ]

#Run PCA
PC <- prcomp( SeuratObject@meta.data[,metadata_use] ,center = T, scale. = T)

# Add the PCA (ran on the METADATA) in the object
SeuratObject@reductions[["pca_metadata"]] <- CreateDimReducObject(
  embeddings = PC$x,
  key = "metadataPC_",
  assay = "RNA")

# Plot the PCA ran on the METADATA
DimPlot(SeuratObject,
        reduction = "pca_metadata",
        dims = c(1,2),
        group.by = "orig.ident" )

Define appropriate filtering thresholds

Which QC metric needs filtering in your data? Which filtering parameters and respective thresholds suit your data?
nFeature_RNA: number of genes <= 4500 to account for sequencing depth and >= 100 (maybe set 200) to remove low quality cells.
nCount_RNA: number of counts <= 50000 to remove duplets.
percent_mito: keep cells with <= 15 percent.

Filter cells with percentage mitochondria > XX%

SeuratObject <- subset(SeuratObject, 
                       subset = percent_mito <= 15 & nFeature_RNA <= 4500 & nCount_RNA <= 50000 & nFeature_RNA >= 100)

Milestone 2

Normalization and scaling

Why do we need to normalize the data? What exactly are we normalizing?

Normalisation is needed to account for varying sequencing depth. Normalisation to the library size per cell, and then the ln + 1 is taken.

Where in the Seurat object is your normalized data stored?

SeuratObject@assays$RNA@data

Which data covariates could potentially influence the interpretation of the results?

UMI counts, percentage mitochondria, percentage ribosome, cell cycle

Following the question above, are there any covariates that need to be regressed out?

In the metadata PCA, the samples are similar in that cluster together meaning that there are no big biases in the QC metrics.

Are all genes equally important for your analysis? Justify.

No, we are interested in protein coding genes and genes that are expressed in the samples.

Filtering genes

#remove genes with zero variance
SeuratObject <- SeuratObject[ Matrix::rowSums(SeuratObject) > 0, ]

# remove non coding proteins
sel <- annot[match(rownames(SeuratObject) ,
                   annot[,1]),2] == "protein_coding"
genes_use <- rownames(SeuratObject)[sel]
genes_use <- as.character(na.omit(genes_use))
SeuratObject <- SeuratObject[genes_use,]

Normalisation

SeuratObject <- NormalizeData(
  object = SeuratObject,
  scale.factor = 10000,
  normalization.method = "LogNormalize")

Feature selection

We select the 3000 most variable genes.

SeuratObject <- FindVariableFeatures(
  object = SeuratObject,
  nfeatures = 3000,
  selection.method = "vst",
  verbose = FALSE,
  assay = "RNA")

Visualise the top20 variable genes.

top20 <- head(VariableFeatures(SeuratObject, assay = "RNA"), 20)
LabelPoints(plot = VariableFeaturePlot(SeuratObject, assay = "RNA"), points = top20, repel = TRUE)

## When using repel, set xnudge and ynudge to 0 for optimal results

####Scaling and Centering

hist(SeuratObject@meta.data$nCount_RNA, breaks = 100)
hist(Matrix::colSums(SeuratObject@assays$RNA@data), breaks = 100)

Based on the histogram above, we could use negative binomial or poisson scaling. (If we had something to regress).

if(!file.exists("SeuratObject_scaled.RData")){
  SeuratObject <- ScaleData(
    object = SeuratObject,
    features = VariableFeatures(SeuratObject),
    assay = "RNA",
    vars.to.regress = NULL,
    do.scale = T,
    do.center = T)
  save(SeuratObject, file = "SeuratObject_scaled.RData")
} else{
  load("SeuratObject_scaled.RData")
}

## Centering and scaling data matrix

Where in the Seurat object is your scaled data stored?

SeuratObject@assays$RNA@scale.data[1:6,1:6]

##          matrix_A_AAACCTGCAAGTAATG-1 matrix_A_AAACCTGCAGCGAACA-1
## HES4                     -0.11697461                  8.34084953
## ISG15                    -0.41439954                  3.76623184
## TNFRSF18                 -0.12533438                 -0.12533438
## TNFRSF4                  -0.14284681                 -0.14284681
## TAS1R3                   -0.04254550                 -0.04254550
## MXRA8                    -0.04197874                 -0.04197874
##          matrix_A_AAACCTGCATGTCCTC-1 matrix_A_AAACCTGGTCGACTGC-1
## HES4                     -0.11697461                 -0.11697461
## ISG15                    -0.41439954                 -0.41439954
## TNFRSF18                 -0.12533438                 -0.12533438
## TNFRSF4                  -0.14284681                 -0.14284681
## TAS1R3                   -0.04254550                 -0.04254550
## MXRA8                    -0.04197874                 -0.04197874
##          matrix_A_AAACCTGGTCGCTTCT-1 matrix_A_AAACCTGTCCCGACTT-1
## HES4                     -0.11697461                 -0.11697461
## ISG15                     1.51893746                 -0.41439954
## TNFRSF18                 -0.12533438                 -0.12533438
## TNFRSF4                  -0.14284681                 -0.14284681
## TAS1R3                   -0.04254550                 -0.04254550
## MXRA8                    -0.04197874                 -0.04197874

Data Visualization

Which method would you use to capture most significant information out of the data? Following the question above, which parameters would you choose? Why?

PCA for dimensionality reduction and elbow plot to find the top N dimensions that contain the most information of data. Based on the elbow plot we use the first 15 PCs.

Which method would you choose for visualization of the differences between your cells? Following the question above, how the parameters in this method influence your visual representation?

tSNE and UMAP for visualisation in 2D. * Random seed influences the shape of the plot. * The number of PCs. * other parameters:

?RunUMAP()

PCA

SeuratObject <- RunPCA(object = SeuratObject,
                       features = VariableFeatures(SeuratObject),
                       assay = "RNA",
                       npcs = 100,
                       verbose = FALSE,
                       reduction.name = "pca")
ElbowPlot(SeuratObject, ndims = 30, reduction = "pca")

Choose dimensionality

top_PCs <- 15

Run UMAP

SeuratObject <- RunUMAP(object = SeuratObject,
                        assay = "RNA",
                        reduction = "pca",
                        dims = 1:top_PCs,
                        n.components = 2,
                        n.neighbors = 20, # used for local approximations
                        spread = 1, # effective scale of embedded points
                        repulsion.strength = 1, # weighting
                        min.dist= .001,
                        verbose = F,
                        n.epochs = 200,
                        metric = "euclidean",
                        seed.use = 42,
                        reduction.name="umap")

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

Run tSNE

SeuratObject <- RunTSNE(object = SeuratObject,
                        assay = "RNA",
                        reduction = "pca",
                        perplexity=30,
                        max_iter=1000,
                        theta=0.5,
                        eta=200,
                        exaggeration_factor=12,
                        dims.use = 1:top_PCs,
                        verbose = F,
                        num_threads=0,
                        seed.use = 42,
                        reduction.name = "tsne")

Where in the Seurat object is your reductions stored?

SeuratObject@reductions

## $pca_metadata
## A dimensional reduction object with key metadataPC_ 
##  Number of dimensions: 58 
##  Projected dimensional reduction calculated:  FALSE 
##  Jackstraw run: FALSE 
##  Computed using assay: RNA 
## 
## $pca
## A dimensional reduction object with key PC_ 
##  Number of dimensions: 100 
##  Projected dimensional reduction calculated:  FALSE 
##  Jackstraw run: FALSE 
##  Computed using assay: RNA 
## 
## $umap
## A dimensional reduction object with key UMAP_ 
##  Number of dimensions: 2 
##  Projected dimensional reduction calculated:  FALSE 
##  Jackstraw run: FALSE 
##  Computed using assay: RNA 
## 
## $tsne
## A dimensional reduction object with key tSNE_ 
##  Number of dimensions: 2 
##  Projected dimensional reduction calculated:  FALSE 
##  Jackstraw run: FALSE 
##  Computed using assay: RNA

Plot PCA, UMAP

DimPlot(SeuratObject, reduction = "pca", dims = c(1,2))

DimPlot(SeuratObject, reduction = "pca", dims = c(3,4))

DimPlot(SeuratObject, reduction = "umap")

DimPlot(SeuratObject, reduction = "tsne")

How some of your QC parameters and datasets influence the separation of your cells?

FeaturePlot(SeuratObject, reduction = "umap", features = c("percent_mito", "percent_ribo", "nCount_RNA"), split.by = "orig.ident")

Milestone 3

Dataset integration

Are there any batch effects in the data? Is batch correction / dataset integration necessary?

Based on the PCA, and UMAP plots, yes, there appears to be a batch effect.

Following the question above, which parameters allow you to say that batch effects are present?

No overlap of samples/batches in 2D representation in individual clusters.

Would a simple linear regression be sufficient for removing batch effects on your dataset?

Probably not, since the data is high dimensional.

Which method for batch effect would you choose?

Harmony

set.seed(1)
# Load additional libraries
library(harmony)

## Loading required package: Rcpp

library(SeuratWrappers)

## 
## Attaching package: 'SeuratWrappers'

## The following objects are masked from 'package:Seurat':
## 
##     ALRAChooseKPlot, RunALRA

SeuratObject <- RunHarmony(
  SeuratObject,
  reduction = "pca", dims.use = 1:100,
  nclust = NULL,
  group.by.vars = "orig.ident",
  reduction.save = "harmony",
  project.dim = TRUE)

## Harmony 1/10

## Harmony 2/10

## Harmony 3/10

## Harmony converged after 3 iterations

## Warning: Invalid name supplied, making object name syntactically valid. New
## object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details
## on syntax validity

DimPlot(SeuratObject, reduction = "harmony", dims = c(1,2))

DimPlot(SeuratObject, reduction = "harmony", dims = c(3,4))

How could you tell the batch correction procedure worked?

If there is reasonable overlap in 2D representation.

After batch correction, do you have corrected data matrix (genes x samples) or a matrix in a reduction embedding (samples x dimensions).

corrected data matrix (genes x samples)

Where should you put the embedding results?

tsne_harmony and umap_harmony

Visualize your results using the new matricies.

ElbowPlot(SeuratObject, ndims = 30, reduction = "harmony")

SeuratObject <- RunUMAP(object = SeuratObject,
                        reduction = "harmony",
                        dims = 1:top_PCs,
                        n.components = 2,
                        n.neighbors = 20,
                        spread = 1,
                        repulsion.strength = 1,
                        min.dist= .1,
                        verbose = F,
                        n.epochs = 200,
                        metric = "euclidean",
                        seed.use = 42,
                        reduction.name="umap_harmony",
                        reduction.key = "umap.harmony_")

## Warning: Keys should be one or more alphanumeric characters followed by an
## underscore, setting key from umap.harmony_ to umapharmony_

## Warning: All keys should be one or more alphanumeric characters followed by an
## underscore '_', setting key to umapharmony_

SeuratObject <- RunTSNE(object = SeuratObject,
                        reduction = "harmony",
                        perplexity=30,
                        max_iter=1000,
                        theta=0.5,
                        eta=200,
                        exaggeration_factor=12,
                        dims.use = 1:top_PCs,
                        verbose = F,
                        num_threads=0,
                        seed.use = 42,
                        reduction.name = "tsne_harmony")

## Warning: Cannot add objects with duplicate keys (offending key: tSNE_), setting
## key to 'tsne_harmony_'

library(gridExtra)
p <- DimPlot(SeuratObject, reduction = "umap_harmony")
q <- DimPlot(SeuratObject, reduction = "umap")
grid.arrange(p, q, ncol = 2)

p <- DimPlot(SeuratObject, reduction = "tsne_harmony")
q <- DimPlot(SeuratObject, reduction = "tsne")
grid.arrange(p, q, ncol = 2)

Where in the Seurat object is your integrated data stored?

SeuratObject@reductions$harmony

## A dimensional reduction object with key harmony_ 
##  Number of dimensions: 100 
##  Projected dimensional reduction calculated:  TRUE 
##  Jackstraw run: FALSE 
##  Computed using assay: RNA

Milestone 4

Cell Clustering

What is a graph?

A graph is a diagram showing the relation between variables. It consists of nodes (cells) and edges, the connections betweens nodes.

Which kind of graph is the most robust to represent your data?

A graph that connects closely related cells and has little connection to unrelated cells. We use Shared Nearest Neighbours (SNN) to eliminate random correlations between cells.

set.seed(42)
SeuratObject <- FindNeighbors(SeuratObject,
                              assay = "RNA",
                              compute.SNN = T,
                              reduction = "harmony" ,
                              dims = 1:top_PCs,
                              graph.name="SNN",
                              prune.SNN = 1/15,
                              k.param = 20,
                              force.recalc = T)

## Computing nearest neighbor graph

## Computing SNN

Where in the Seurat object is your graph stored?

SeuratObject@graphs$SNN[1:2,1:2]

## 2 x 2 sparse Matrix of class "dgCMatrix"
##                             matrix_A_AAACCTGCAAGTAATG-1
## matrix_A_AAACCTGCAAGTAATG-1                           1
## matrix_A_AAACCTGCAGCGAACA-1                           .
##                             matrix_A_AAACCTGCAGCGAACA-1
## matrix_A_AAACCTGCAAGTAATG-1                           .
## matrix_A_AAACCTGCAGCGAACA-1                           1

Why do we need to cluster our cells?

So that we can differentiate and annotate the cell types.

Which parameters have you chosen when clustering?

Resolution = 0.8 (how many clusters are obtained) and Louvain algorithm.

SeuratObject <- FindClusters(
  object = SeuratObject,
  graph.name = "SNN",
  resolution = 0.8,
  algorithm = 1,
  random.seed = 42) #algorithim 1 = Louvain

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 10495
## Number of edges: 98635
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8709
## Number of communities: 28
## Elapsed time: 0 seconds

## 9 singletons identified. 19 final clusters.

How can you tell which clustering resolution is best?

If nearby cells on the UMAP or tSNE plots are separated by different cluster annotation, the resolution is too high. If the resolution is too low, faraway cells on the 2D plot are clustered together.

Do the clustering reflect the cell separation seen by the visualization method you are using?

Yes.

DimPlot(SeuratObject, reduction = "umap_harmony", group.by = "seurat_clusters", label = TRUE)

DimPlot(SeuratObject, reduction = "tsne_harmony", group.by = "seurat_clusters", label = TRUE)

How are your clusters distributed across the samples, groups, experimental conditions, etc.?

See plot, each cluster appears in each sample in different fractions.

DimPlot(SeuratObject, reduction = "umap_harmony", group.by = "seurat_clusters", label = TRUE, split.by = "orig.ident")

DimPlot(SeuratObject, reduction = "tsne_harmony", group.by = "seurat_clusters", label = TRUE, split.by = "orig.ident")

Where in the Seurat object is your clustering data stored?

Number of cells per cluster.

table(SeuratObject@meta.data$seurat_clusters)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2486 1173  936  907  879  782  553  478  422  417  350  287  287  185  136  108 
##   16   17   18 
##  105    2    2

Milestone 5

Differential expression

Which biological question(s) do you want to answer with differential expression?

Which genes specifically drive the differentiation of Neutrophils.

What genes are expressed differentially in each cluster? what is the cluster representing in terms of cell type?
Are you interested in comparing all cells or using a specific cluster?

First, in all cells, then later maybe only in a subset.

Which clustering resolution would you run your differential expression?
Which test did you choose for differential expression?

Wilcox. Identifies differentially expressed genes between two groups of cells using a Wilcoxon Rank Sum test.

What parameters did you set for computing differential expression? Justify each one

if(!file.exists("all_markers.RData")){
  markers <- FindAllMarkers( object = SeuratObject,
                             assay = "RNA",
                             logfc.threshold = 0.25,
                             test.use = "wilcox",
                             slot = "data",
                             min.pct = 0.1,
                             min.diff.pct = 0.2,
                             only.pos = TRUE,
                             max.cells.per.ident = 100,
                             random.seed = 42,
                             latent.vars = NULL,
                             min.cells.feature = 3,
                             min.cells.group = 3,
                             pseudocount.use = 1,
                             return.thresh = 0.01 )
  save(markers, file = "all_markers.RData")
} else{
  load("all_markers.RData")
}

## Calculating cluster 0

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

## Calculating cluster 8

## Calculating cluster 9

## Calculating cluster 10

## Calculating cluster 11

## Calculating cluster 12

## Calculating cluster 13

## Calculating cluster 14

## Calculating cluster 15

## Calculating cluster 16

## Calculating cluster 17

## Calculating cluster 18

If you are interested in a particular cluster, which cluster, why?

Cluster expressing CD34 and G0S2 because they are markers for progenitors and neutrophils.

library(ggplot2)
DimPlot(SeuratObject, reduction = "umap_harmony", order = TRUE, label = TRUE)

FeaturePlot(SeuratObject, reduction = "umap_harmony", order = TRUE, label = TRUE, features = c("CD34", "G0S2"))

DotPlot(object = SeuratObject,
        features = c("CD34", "G0S2"),
        group.by = "seurat_clusters",
        assay = "RNA") + coord_flip()

VlnPlot(object = SeuratObject,
        features = c("CD34", "G0S2"),
        ncol = 2,
        group.by = "seurat_clusters",
        assay = "RNA")

Which marker genes can separate each of the cell clusters in your data?

We use the top5 differentially expressed genes per cluster.

Which cell types do they represent?

Based on marker gene list one can identify the cell type based on the found marker genes.

How would you visualize the list of differentially expressed genes?

Wit heatmap (see below) or with a dotplot for each marker gene and cell type.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:biomaRt':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

top5 <- markers %>% group_by(cluster) %>% top_n(-3, p_val_adj)
DoHeatmap(object = SeuratObject,
          features = as.character(unique(top5$gene)),
          group.by = "seurat_clusters",
          assay = "RNA")

## Warning in DoHeatmap(object = SeuratObject, features =
## as.character(unique(top5$gene)), : The following features were omitted as they
## were not found in the scale.data slot for the RNA assay: CD52, CD3D, JUNB, RPL22

DotPlot(object = SeuratObject,
        features = as.character(unique(top5$gene)),
        group.by = "seurat_clusters",
        assay = "RNA") + coord_flip()

Clusters with the marker genes:

markers %>% filter(gene %in% c("CD34", "G0S2"))

##              p_val avg_logFC pct.1 pct.2    p_val_adj cluster gene
## G0S2  6.699020e-17 2.2068970 0.650 0.059 1.015839e-12       5 G0S2
## CD34  1.822133e-17 0.6994332 0.557 0.012 2.763082e-13      10 CD34
## G0S21 6.733057e-08 1.1788146 0.415 0.094 1.021001e-03      12 G0S2

FeaturePlot(SeuratObject, reduction = "umap_harmony", features = c("percent_mito", "percent_ribo", "nCount_RNA"), label = TRUE)

Trajectory inference analysis

Which biological question(s) do you want to answer with trajectory?