The following milestones and tasks are what you will need to perform in order to complete the project-based learning outcomes for this course. Go through them in order and refer to the glossary for more detailed information on each individual step.
The data for this project is available at the course GitHub repository, which can be downloaded like so:
git clone https://github.com/sib-swiss/SchoolRNA2020.git
SchoolRNA2020
; the data is residing in the single_cell/data/
directoryYou can either read the data from there (check the glossary for details) or copy the data to some other location, if you prefer. You can run the above command either on the command line interface or from within RStudio (more details in the glossary).
H5.
Reading data:
cov.17 <- Seurat::Read10X_h5(
filename = "data/covid_data_GSE149689/nCoV_PBMC_17.h5",
use.names = T)
cov.15 <- Seurat::Read10X_h5(
filename = "data/covid_data_GSE149689/nCoV_PBMC_15.h5",
use.names = T)
ctl.14 <- Seurat::Read10X_h5(
filename = "data/covid_data_GSE149689/Normal_PBMC_14.h5",
use.names = T)
ctl.13 <- Seurat::Read10X_h5(
filename = "data/covid_data_GSE149689/Normal_PBMC_13.h5",
use.names = T)
Because it’s a useful object containing methods for single cell data analysis.
Creating Seurat objects for each sample:
library("Seurat", lib.loc="~/sibsinglecell/lib/R/library")
## Warning: package 'Seurat' was built under R version 3.6.3
cov17.seurat <- CreateSeuratObject(
counts = cov.17,
assay = "RNA",
project = "covid_17")
cov15.seurat <- CreateSeuratObject(
counts = cov.15,
assay = "RNA",
project = "covid_15")
ctl14.seurat <- CreateSeuratObject(
counts = ctl.14,
assay = "RNA",
project = "control_14")
ctl13.seurat <- CreateSeuratObject(
counts = ctl.13,
assay = "RNA",
project = "control_13")
Combining datasets:
CombinedSeuratObject <- merge(
x = cov15.seurat,
y = c( cov17.seurat ,
ctl13.seurat,
ctl14.seurat),
add.cell.ids = c('covid_15','covid_17','control_13','control_14'))
The following is to create a column with the group (control or covid):
CombinedSeuratObject$group <-sub("_.*","",CombinedSeuratObject$orig.ident)
#CombinedSeuratObject$nCount_RNA
In this step we read the count data and create a Seurat object, a simple R object for scRNAseq data. We then combine all datasets, creating a metadata column that specificies sample group (control or covid).
A standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells. Please note that those values are highly dependent on the library preparation method used. Extremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type.
VlnPlot(CombinedSeuratObject,
group.by= "orig.ident",
features = c("nFeature_RNA","nCount_RNA"),
pt.size = 0.1,
ncol = 4) + NoLegend()
On the left figure each dot is a cell and we see the number of genes expressed in each cell. On the right plot, the number of reads obtained from each cell (UMI have been merged already). All samples seem to have a similar distribution in #genes and #counts, except sample covid_15 which shows a few cells with high #counts, that could indicate doublets (we don’t really know).
In single cell, the most detected genes usually belong to housekeeping gene families, such as mitochondrial (MT-), ribossomal (RPL and RPS) and other structural proteins (i.e., ACTB, TMSB4X, B2M, EEF1A1).
#Compute the relative expression of each gene per cell
rel_expression <- Matrix::t( Matrix::t(CombinedSeuratObject@assays$RNA@counts) / Matrix::colSums(CombinedSeuratObject@assays$RNA@counts)) * 100
most_expressed <- sort(Matrix::rowSums( rel_expression ),T) / ncol(CombinedSeuratObject)
#Plot the relative expression of each gene per cell
par(mfrow=c(1,3),mar=c(4,6,1,1))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[30:1]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[30:1],horizontal=TRUE,ylim=c(0,8))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[60:31]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[60:31],horizontal=TRUE,ylim=c(0,8))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[90:61]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[90:61],horizontal=TRUE,ylim=c(0,8))
par(mfrow=c(1,1))
This plot shows the highest expressed genes in the combined samples. We see the usually high expressed mt and RP genes, and probably house keeping genes like Malat1 (lncRNA highly expressed in PBMC cells).
Having the data in a suitable format, we can start calculating some quality metrics. We can for example calculate the percentage of mitocondrial and ribosomal genes per cell and add to the metadata. This will be helpfull to visualize them across different metadata parameteres (i.e. datasetID and chemistry version). There are several ways of doing this. Here is an example of how to manually calculate the proportion of mitochondrial reads and add to the metadata table.
Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017): “High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.”
(PS: non-linear relationship)
# create seurat object of MT- genes:
CombinedSeuratObject <- PercentageFeatureSet(
object = CombinedSeuratObject,
pattern = "^MT-",
assay = "RNA",
col.name = "percent_mito")
# Feature plot of mt genes
VlnPlot(CombinedSeuratObject,
group.by= "orig.ident",
features = c("nFeature_RNA","nCount_RNA","percent_mito"),
pt.size = 0.1,
ncol = 4) + NoLegend()
In the right plot we see the percentage of mt-genes per cell. Sample covid_15, we see large number of cells where high proportion of genes are mt. We will need filter out cells that have more that 25% mt genes.
In the same manner we will calculate the proportion gene expression that comes from ribosomal proteins. Ribosomal genes are the also among the top expressed genes in any cell and, on the contrary to mitochondrial genes, are inversely proportional to the mitochondrial content: the higher the mitochondrial content, the lower is the detection of ribosomal genes (PS: non-linear relationship).
# create seurat object of RP- genes:
CombinedSeuratObject <- PercentageFeatureSet(
object = CombinedSeuratObject,
pattern = "^RP[SL]",
assay = "RNA",
col.name = "percent_ribo")
# Feature plot of rp genes
VlnPlot(CombinedSeuratObject,
group.by= "orig.ident",
features = c("nFeature_RNA","nCount_RNA","percent_ribo"),
pt.size = 0.1,
ncol = 4) + NoLegend()
We see the anticorrelation of cells with high mt genes having low % ribo genes. Sample covid_15 seems the one with poorest quality.
In RNA-sequencing, genes can be categorized into different groups depending on their RNA biotype. For example, “coding”, “non-coding”, “VDJ region genes” are “small interefering RNA” common gene biotypes. Besides, having information about chromossomal location might be usefull to identify bacth effects driven by sex chromossomes.
Depending on the desired type of analysis, some gene categories can be filtered out if not of interest. For single cell specifically, cell libraries are usually constructed using poly-A enrichment and therefore enriching for “protein-coding proteins”, which usually contitutes around 80-90% of all available genes.
How to run it:
library(biomaRt)
# get names of species:
#biomaRt::listDatasets(mart)[,"dataset"]
# get versin names, to use latest
#biomaRt::listEnsemblArchives()
# Retrieve human gene annotation from ENSEMBL
mart = biomaRt::useMart(
biomart = "ensembl",
dataset = "hsapiens_gene_ensembl",
host = "aug2020.archive.ensembl.org")
# Retrieve the selected attributes mouse gene annotation
annot <- biomaRt::getBM(
attributes = c(
"external_gene_name",
"gene_biotype",
"chromosome_name"),
mart = mart)
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Cache found
# Match the gene names with their respective gene biotype
item <- annot[match(rownames(CombinedSeuratObject@assays$RNA@counts) , annot[,1]),"gene_biotype"]
item[is.na(item)] <- "unknown"
# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(CombinedSeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(CombinedSeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]
# Plot percentage of each gene biotype
boxplot( perc*100,outline=F,las=2,main="% reads per cell",col=scales::hue_pal()(100),horizontal=T)
# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))
CombinedSeuratObject@meta.data <- CombinedSeuratObject@meta.data[,!(colnames(CombinedSeuratObject@meta.data) %in% colnames(gene_biotype_table))]
CombinedSeuratObject@meta.data <- cbind(
CombinedSeuratObject@meta.data,
gene_biotype_table)
Same is done with chromosome_name
# Match the gene names with their respective chromosome location
item <- annot[match(rownames(CombinedSeuratObject@assays$RNA@counts) , annot[,1]),"chromosome_name"]
item[is.na(item)] <- "unknown"
item[! item %in% as.character(c(1:23,"X","Y","MT")) ] <- "other"
# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(CombinedSeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(CombinedSeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]
# Plot percentage of each gene biotype
boxplot( perc*100,outline=F,las=2,main="% reads per cell",col=scales::hue_pal()(100),horizontal=T)
# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))
CombinedSeuratObject@meta.data <- CombinedSeuratObject@meta.data[,!(colnames(CombinedSeuratObject@meta.data) %in% colnames(gene_biotype_table))]
CombinedSeuratObject@meta.data <- cbind(
CombinedSeuratObject@meta.data,
gene_biotype_table)
If you want to focus the analysis on only protein-coding genes, for example, you can do it like so:
dim(CombinedSeuratObject)
## [1] 33538 6800
sel <- annot[match(rownames(CombinedSeuratObject) ,
annot[,1]),2] == "protein_coding"
genes_use <- rownames(CombinedSeuratObject)[sel]
genes_use <- as.character(na.omit(genes_use))
CombinedSeuratObject <- CombinedSeuratObject[genes_use,]
dim(CombinedSeuratObject)
## [1] 19397 6800
We here perform cell cycle scoring. To score a gene list, the algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list. Cell cycle scoring with Seurat adds three slots in data, a score for S phase, a score for G2M phase and the predicted cell cycle phase. The Seurat package provides a list of human G2M and S phase genes in cc.genes.
How to run it:
CombinedSeuratObject <- CellCycleScoring(
object = CombinedSeuratObject,
g2m.features = cc.genes$g2m.genes,
s.features = cc.genes$s.genes)
## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms
## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms
CombinedSeuratObject$G1.Score <- 1 - CombinedSeuratObject$S.Score - CombinedSeuratObject$G2M.Score
VlnPlot(CombinedSeuratObject,
group.by= "orig.ident",
features = c("S.Score","G1.Score","G2M.Score"),
pt.size = 0.1,
ncol = 4) + NoLegend()
From the plots, we can see that cell cycle is not different between samples, so it won’t have an impact in the analysis.
Features and Gene QC: A sample with low # genes or los # counts could indicate a problem with sequence depth or poor sample quality. We need to filter out cells with low number of features for further analysis, and features that are not expressed in a certain number of samples (e.g. min in 3 samples).
Mitocondrial genes: cells with high proportion of mt-genes are indicative of poor quality and should be removed. High expression of mitochondrial genes could be indicative of several processes. Cells that are stressed/damaged/lysed will lose their cytoplasmic components, whereas the mitochdria is still intact during the damage, therefore most of the genes from such cells will be mitochondrial genes. If these cells are used for differential analysis, we will falsely obtain genes involved in cell damage/apoptosis processes. Therefore, they need to be filtered out.
Cell cycle: we need to assess if cells are dividing, as this could affect differential expression analysis.
In this ste, we performed QC control of the data to remove low quality cells or features that could confound downstream analysis.
Mitocondrial content. Cells with low features. Features present in too few cells.
Filtering out genes that are expressed in less than 3 samples, as well as cells with less than 200 genes expressed.
CombinedSeuratObject <- CombinedSeuratObject[rowSums(as.matrix(CombinedSeuratObject@assays$RNA@counts)>0)>=3 ,
colSums(as.matrix(CombinedSeuratObject@assays$RNA@counts)>0)>=200]
We removed cells with high mt-gene content; cells with less than 200 genes expressed, and genes that are expressed in less than 3 samples.
To make it comparable. To remove the effect of sequencing depth, and gene length.
#remove genes with zero variance
CombinedSeuratObject <- CombinedSeuratObject[ Matrix::rowSums(CombinedSeuratObject) > 0, ]
#remove genes with zero variance
CombinedSeuratObject <- NormalizeData(
object = CombinedSeuratObject,
scale.factor = 10000,
normalization.method = "LogNormalize")
CombinedSeuratObject <- FindVariableFeatures(
object = CombinedSeuratObject,
nfeatures = 3000,
selection.method = "vst",
verbose = FALSE,
assay = "RNA")
#Variable gene plot:
top20 <- head(VariableFeatures(CombinedSeuratObject), 20)
LabelPoints(plot = VariableFeaturePlot(CombinedSeuratObject), points = top20, repel = TRUE)
## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.
## When using repel, set xnudge and ynudge to 0 for optimal results
#CombinedSeuratObject@assays$RNA@data
Mitocondrial genes, number of features, etc.
Percent_mito (mt-genes), S.score (cell cycle), nFeature_RNA, nCount_RNA.
# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
object = CombinedSeuratObject,
vars.to.regress = c("nCount_RNA","nFeature_RNA","percent_mito","S.Score"),
model.use = "linear",
do.scale = T,
do.center = T)
## Regressing out nCount_RNA, nFeature_RNA, percent_mito, S.Score
## Centering and scaling data matrix
No. We are interested in genes that are highly variable between samples, because they are most interesting biologically.
# CombinedSeuratObject@assays$RNA@scale.data
We select features that are most informative (most variable); we normalize data (for sequencing depth differences), and we scale the data, regressing out covariates that could confound downstream analysis.
PCA
#run pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
assay = "RNA",
npcs = 100,
verbose = FALSE )
The number of principal components.
UMAP and tSNE (with the first 50 PC).
# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
reduction = "pca",
perplexity=30,
max_iter=1000,
theta=0.5,
eta=200,
exaggeration_factor=12,
dims.use = 1:50,
verbose = T,
num_threads=0)
## Read the 5931 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 1.20 seconds (sparsity = 0.020042)!
## Learning embedding...
## Iteration 50: error is 91.748648 (50 iterations in 2.86 seconds)
## Iteration 100: error is 73.893670 (50 iterations in 3.03 seconds)
## Iteration 150: error is 71.120532 (50 iterations in 3.14 seconds)
## Iteration 200: error is 70.200473 (50 iterations in 2.97 seconds)
## Iteration 250: error is 69.733782 (50 iterations in 3.06 seconds)
## Iteration 300: error is 2.211945 (50 iterations in 2.97 seconds)
## Iteration 350: error is 1.798589 (50 iterations in 2.69 seconds)
## Iteration 400: error is 1.582548 (50 iterations in 2.75 seconds)
## Iteration 450: error is 1.449213 (50 iterations in 2.94 seconds)
## Iteration 500: error is 1.359214 (50 iterations in 2.90 seconds)
## Iteration 550: error is 1.296529 (50 iterations in 3.27 seconds)
## Iteration 600: error is 1.250634 (50 iterations in 3.85 seconds)
## Iteration 650: error is 1.220024 (50 iterations in 3.60 seconds)
## Iteration 700: error is 1.202863 (50 iterations in 3.83 seconds)
## Iteration 750: error is 1.193324 (50 iterations in 3.97 seconds)
## Iteration 800: error is 1.185357 (50 iterations in 3.69 seconds)
## Iteration 850: error is 1.177214 (50 iterations in 3.70 seconds)
## Iteration 900: error is 1.169584 (50 iterations in 3.72 seconds)
## Iteration 950: error is 1.161636 (50 iterations in 3.67 seconds)
## Iteration 1000: error is 1.154700 (50 iterations in 3.68 seconds)
## Fitting performed in 66.29 seconds.
# uMAP
CombinedSeuratObject <- RunUMAP(object = CombinedSeuratObject,
reduction = "pca",
dims = 1:50,
n.components = 2,
n.neighbors = 20,
spread = .3,
repulsion.strength = 1,
min.dist= .001,
verbose = T,
num_threads=0,
n.epochs = 200,
metric = "euclidean",
seed.use = 42,
reduction.name="umap")
## Warning: The following arguments are not used: num_threads
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 09:09:48 UMAP embedding parameters a = 12.99 b = 0.7939
## 09:09:48 Read 5931 rows and found 50 numeric columns
## 09:09:48 Using Annoy for neighbor search, n_neighbors = 20
## 09:09:48 Building Annoy index with metric = euclidean, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 09:09:48 Writing NN index file to temp file /tmp/RtmpygNzsq/file15597d5f138
## 09:09:48 Searching Annoy index using 1 thread, search_k = 2000
## 09:09:49 Annoy recall = 100%
## 09:09:50 Commencing smooth kNN distance calibration using 1 thread
## 09:09:50 Initializing from normalized Laplacian + noise
## 09:09:51 Commencing optimization for 200 epochs, with 169892 positive edges
## 09:09:53 Optimization finished
# plot tSNE
DimPlot(object = CombinedSeuratObject,
# group.by = c("DATASET"),
reduction = "tsne",
dims = c(1,2),
pt.size = .1,
label = T,
ncol = 3)
# plot UMAP
DimPlot(object = CombinedSeuratObject,
# group.by = c("DATASET"),
reduction = "umap",
dims = c(1,2),
pt.size = .1,
label = T,
ncol = 3)
# plot tSNE
FeaturePlot(object = CombinedSeuratObject,
features = c("percent_mito", "nCount_RNA", "nFeature_RNA"),
reduction = "tsne",
dims = c(1,2),
order = T,
pt.size = .1,
ncol = 3)
I we choose the first few PCs, that would explain the majority of the variance. If we’d choose more PCs, clusters will be less well separated because we are considering more variablity of the data.
If we don’t remove poor quality cells, these could cluster together without giving us biologically-relevant information.
# CombinedSeuratObject@reductions
We have visualized the data as a PCA ,to reduce the complexity.
Patient. They are separated. We need to integrate them to be able to separate cell types.
We saw in in the UMAP/tSNE (covid_15 is apart.)
We do scaling again adding the sample replicates Then do PCA, tSNE and plot that again.
# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
object = CombinedSeuratObject,
vars.to.regress = c("nCount_RNA","nFeature_RNA","percent_mito","S.Score","orig.ident"),
model.use = "linear",
do.scale = T,
do.center = T)
## Regressing out nCount_RNA, nFeature_RNA, percent_mito, S.Score, orig.ident
## Centering and scaling data matrix
# pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
assay = "RNA",
npcs = 100,
verbose = FALSE )
# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
reduction = "pca",
perplexity=30,
max_iter=1000,
theta=0.5,
eta=200,
exaggeration_factor=12,
dims.use = 1:50,
verbose = T,
num_threads=0)
## Read the 5931 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.98 seconds (sparsity = 0.019822)!
## Learning embedding...
## Iteration 50: error is 91.845550 (50 iterations in 3.16 seconds)
## Iteration 100: error is 74.787813 (50 iterations in 2.93 seconds)
## Iteration 150: error is 72.091245 (50 iterations in 2.93 seconds)
## Iteration 200: error is 71.337837 (50 iterations in 3.09 seconds)
## Iteration 250: error is 71.066622 (50 iterations in 3.25 seconds)
## Iteration 300: error is 2.196712 (50 iterations in 2.91 seconds)
## Iteration 350: error is 1.760298 (50 iterations in 3.36 seconds)
## Iteration 400: error is 1.537295 (50 iterations in 3.14 seconds)
## Iteration 450: error is 1.401632 (50 iterations in 2.89 seconds)
## Iteration 500: error is 1.310989 (50 iterations in 3.78 seconds)
## Iteration 550: error is 1.248310 (50 iterations in 3.66 seconds)
## Iteration 600: error is 1.204516 (50 iterations in 3.77 seconds)
## Iteration 650: error is 1.175341 (50 iterations in 6.15 seconds)
## Iteration 700: error is 1.155939 (50 iterations in 4.27 seconds)
## Iteration 750: error is 1.142352 (50 iterations in 3.83 seconds)
## Iteration 800: error is 1.131431 (50 iterations in 3.84 seconds)
## Iteration 850: error is 1.123433 (50 iterations in 3.83 seconds)
## Iteration 900: error is 1.115514 (50 iterations in 4.27 seconds)
## Iteration 950: error is 1.107891 (50 iterations in 3.83 seconds)
## Iteration 1000: error is 1.101574 (50 iterations in 3.79 seconds)
## Fitting performed in 72.66 seconds.
# plot tSNE
DimPlot(object = CombinedSeuratObject,
# group.by = c("DATASET"),
reduction = "tsne",
dims = c(1,2),
pt.size = .1,
label = T,
ncol = 3)
No it’s not good enough to do simple linear regression.
CCA.
SeuratObject.list <- SplitObject(
object = CombinedSeuratObject,
split.by = "orig.ident")
SeuratObject.anchors <- FindIntegrationAnchors(
object.list = SeuratObject.list,
dims = 1:30)
## Computing 2000 integration features
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3161 anchors
## Filtering anchors
## Retained 2330 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3487 anchors
## Filtering anchors
## Retained 2410 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3652 anchors
## Filtering anchors
## Retained 2621 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3375 anchors
## Filtering anchors
## Retained 2441 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3652 anchors
## Filtering anchors
## Retained 2658 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 4116 anchors
## Filtering anchors
## Retained 3596 anchors
CombinedSeuratObject <- IntegrateData(
anchorset = SeuratObject.anchors,
dims = 1:30,
new.assay.name = "cca")
## Merging dataset 3 into 4
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 2 into 4 3
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 1 into 4 3 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Warning: Adding a command log without an assay associated with it
Scaling, PCA, tSNE and plotting.
# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
object = CombinedSeuratObject,
assay = "cca",
# vars.to.regress = c("nCount_RNA","nFeatures_RNA","percent_mito","S.Score"),
model.use = "linear",
do.scale = T,
do.center = T)
## Centering and scaling data matrix
# pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
assay = "cca",
npcs = 100,
verbose = FALSE )
# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
reduction = "pca",
perplexity=30,
max_iter=1000,
theta=0.5,
eta=200,
exaggeration_factor=12,
dims.use = 1:50,
verbose = T,
num_threads=0)
## Read the 5931 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.96 seconds (sparsity = 0.020564)!
## Learning embedding...
## Iteration 50: error is 91.422225 (50 iterations in 3.12 seconds)
## Iteration 100: error is 71.879686 (50 iterations in 3.31 seconds)
## Iteration 150: error is 68.618122 (50 iterations in 2.99 seconds)
## Iteration 200: error is 67.433919 (50 iterations in 3.24 seconds)
## Iteration 250: error is 66.819658 (50 iterations in 3.26 seconds)
## Iteration 300: error is 2.174736 (50 iterations in 3.41 seconds)
## Iteration 350: error is 1.787636 (50 iterations in 3.66 seconds)
## Iteration 400: error is 1.572558 (50 iterations in 3.71 seconds)
## Iteration 450: error is 1.437932 (50 iterations in 3.66 seconds)
## Iteration 500: error is 1.347387 (50 iterations in 3.69 seconds)
## Iteration 550: error is 1.284705 (50 iterations in 3.80 seconds)
## Iteration 600: error is 1.239576 (50 iterations in 3.78 seconds)
## Iteration 650: error is 1.207996 (50 iterations in 3.78 seconds)
## Iteration 700: error is 1.184813 (50 iterations in 3.84 seconds)
## Iteration 750: error is 1.169460 (50 iterations in 3.85 seconds)
## Iteration 800: error is 1.159007 (50 iterations in 3.80 seconds)
## Iteration 850: error is 1.150471 (50 iterations in 3.82 seconds)
## Iteration 900: error is 1.142919 (50 iterations in 3.95 seconds)
## Iteration 950: error is 1.136613 (50 iterations in 3.81 seconds)
## Iteration 1000: error is 1.131145 (50 iterations in 3.82 seconds)
## Fitting performed in 72.29 seconds.
# plot tSNE
DimPlot(object = CombinedSeuratObject,
# group.by = c("DATASET"),
reduction = "tsne",
dims = c(1,2),
pt.size = .1,
label = T,
ncol = 3)
# plot tSNE with categories
FeaturePlot(object = CombinedSeuratObject,
features = c("percent_mito", "nCount_RNA", "nFeature_RNA","S.Score"),
reduction = "tsne",
dims = c(1,2),
order = T,
pt.size = .1,
ncol = 3)
After CCA samples have been well integrated. Only a cluster on covid-15 that hasnt been integrated, but it doesn’t correspond to cells with high mt-genes proportion.
But we do see a cluster of cells with high mt-genes content, that we should remove.
# CombinedSeuratObject <- CombinedSeuratObject[,CombinedSeuratObject$percent_mito < 25]
Run the whole analysis again, from combining datasets onwards.
CombinedSeuratObject <- merge(
x = cov15.seurat,
y = c( cov17.seurat ,
ctl13.seurat,
ctl14.seurat),
add.cell.ids = c('covid_15','covid_17','control_13','control_14'))
# creating a metadata object with the group (control/covid), taken from sample name (everything before the "_")
CombinedSeuratObject$group <-sub("_.*","",CombinedSeuratObject$orig.ident)
# create seurat object of MT- genes:
CombinedSeuratObject <- PercentageFeatureSet(
object = CombinedSeuratObject,
pattern = "^MT-",
assay = "RNA",
col.name = "percent_mito")
# remove cells that have more than 25% mt genes
#rows are genes; columns are cells
CombinedSeuratObject <- CombinedSeuratObject[,CombinedSeuratObject$percent_mito < 25]
# ribo genes
# create seurat object of ribo- genes:
CombinedSeuratObject <- PercentageFeatureSet(
object = CombinedSeuratObject,
pattern = "^RP[SL]",
assay = "RNA",
col.name = "percent_ribo")
# Feature plot of mt genes
VlnPlot(CombinedSeuratObject,
group.by= "orig.ident",
features = c("nFeature_RNA","nCount_RNA", "percent_mito","percent_ribo"),
pt.size = 0.1,
ncol = 4) + NoLegend()
# % Gene biotype and chromossome location
library(biomaRt)
# get names of species:
#biomaRt::listDatasets(mart)[,"dataset"]
# get versin names, to use latest
#biomaRt::listEnsemblArchives()
# Retrieve human gene annotation from ENSEMBL
mart = biomaRt::useMart(
biomart = "ensembl",
dataset = "hsapiens_gene_ensembl",
host = "aug2020.archive.ensembl.org")
# Retrieve the selected attributes mouse gene annotation
annot <- biomaRt::getBM(
attributes = c(
"external_gene_name",
"gene_biotype",
"chromosome_name"),
mart = mart)
## Cache found
# Match the gene names with their respective gene biotype
item <- annot[match(rownames(CombinedSeuratObject@assays$RNA@counts) , annot[,1]),"gene_biotype"]
item[is.na(item)] <- "unknown"
# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(CombinedSeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(CombinedSeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]
# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))
CombinedSeuratObject@meta.data <- CombinedSeuratObject@meta.data[,!(colnames(CombinedSeuratObject@meta.data) %in% colnames(gene_biotype_table))]
CombinedSeuratObject@meta.data <- cbind(
CombinedSeuratObject@meta.data,
gene_biotype_table)
#Same is done with chromosome_name
# Match the gene names with their respective chromosome location
item <- annot[match(rownames(CombinedSeuratObject@assays$RNA@counts) , annot[,1]),"chromosome_name"]
item[is.na(item)] <- "unknown"
item[! item %in% as.character(c(1:23,"X","Y","MT")) ] <- "other"
# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(CombinedSeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(CombinedSeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]
# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))
CombinedSeuratObject@meta.data <- CombinedSeuratObject@meta.data[,!(colnames(CombinedSeuratObject@meta.data) %in% colnames(gene_biotype_table))]
CombinedSeuratObject@meta.data <- cbind(
CombinedSeuratObject@meta.data,
gene_biotype_table)
# keep only protein coding genes
dim(CombinedSeuratObject)
## [1] 33538 5479
sel <- annot[match(rownames(CombinedSeuratObject) ,
annot[,1]),2] == "protein_coding"
genes_use <- rownames(CombinedSeuratObject)[sel]
genes_use <- as.character(na.omit(genes_use))
CombinedSeuratObject <- CombinedSeuratObject[genes_use,]
dim(CombinedSeuratObject)
## [1] 19397 5479
# cell cycle
CombinedSeuratObject <- CellCycleScoring(
object = CombinedSeuratObject,
g2m.features = cc.genes$g2m.genes,
s.features = cc.genes$s.genes)
## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms
## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms
CombinedSeuratObject$G1.Score <- 1 - CombinedSeuratObject$S.Score - CombinedSeuratObject$G2M.Score
# filtering
# Filtering out genes that are expressed in less than 3 samples, as well as cells with less than 200 genes expressed.
CombinedSeuratObject <- CombinedSeuratObject[rowSums(as.matrix(CombinedSeuratObject@assays$RNA@counts)>0)>=3 ,
colSums(as.matrix(CombinedSeuratObject@assays$RNA@counts)>0)>=200]
# feature selection
CombinedSeuratObject <- FindVariableFeatures(
object = CombinedSeuratObject,
nfeatures = 3000,
selection.method = "vst",
verbose = FALSE,
assay = "RNA")
# Normalization
#remove genes with zero variance
CombinedSeuratObject <- CombinedSeuratObject[ Matrix::rowSums(CombinedSeuratObject) > 0, ]
#remove genes with zero variance
CombinedSeuratObject <- NormalizeData(
object = CombinedSeuratObject,
scale.factor = 10000,
normalization.method = "LogNormalize")
#Variable gene plot:
top20 <- head(VariableFeatures(CombinedSeuratObject), 20)
LabelPoints(plot = VariableFeaturePlot(CombinedSeuratObject), points = top20, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
object = CombinedSeuratObject,
vars.to.regress = c("nCount_RNA","nFeatures_RNA","percent_mito","S.Score"),
model.use = "linear",
do.scale = T,
do.center = T)
## Warning: Requested variables to regress not in object: nFeatures_RNA
## Regressing out nCount_RNA, percent_mito, S.Score
## Centering and scaling data matrix
# pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
assay = "RNA",
npcs = 100,
verbose = FALSE )
# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
reduction = "pca",
perplexity=30,
max_iter=1000,
theta=0.5,
eta=200,
exaggeration_factor=12,
dims.use = 1:50,
verbose = T,
num_threads=0)
## Read the 5080 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.88 seconds (sparsity = 0.023303)!
## Learning embedding...
## Iteration 50: error is 89.855855 (50 iterations in 2.66 seconds)
## Iteration 100: error is 72.346303 (50 iterations in 2.71 seconds)
## Iteration 150: error is 69.827760 (50 iterations in 2.57 seconds)
## Iteration 200: error is 69.000762 (50 iterations in 2.51 seconds)
## Iteration 250: error is 68.377956 (50 iterations in 2.42 seconds)
## Iteration 300: error is 2.079499 (50 iterations in 2.39 seconds)
## Iteration 350: error is 1.659317 (50 iterations in 2.50 seconds)
## Iteration 400: error is 1.447837 (50 iterations in 2.45 seconds)
## Iteration 450: error is 1.320678 (50 iterations in 2.57 seconds)
## Iteration 500: error is 1.237463 (50 iterations in 2.95 seconds)
## Iteration 550: error is 1.182178 (50 iterations in 3.08 seconds)
## Iteration 600: error is 1.147065 (50 iterations in 3.14 seconds)
## Iteration 650: error is 1.124953 (50 iterations in 3.13 seconds)
## Iteration 700: error is 1.109929 (50 iterations in 3.16 seconds)
## Iteration 750: error is 1.100050 (50 iterations in 3.17 seconds)
## Iteration 800: error is 1.091429 (50 iterations in 3.18 seconds)
## Iteration 850: error is 1.083330 (50 iterations in 3.23 seconds)
## Iteration 900: error is 1.075522 (50 iterations in 3.18 seconds)
## Iteration 950: error is 1.068569 (50 iterations in 3.21 seconds)
## Iteration 1000: error is 1.063398 (50 iterations in 3.17 seconds)
## Fitting performed in 57.37 seconds.
# plot tSNE
DimPlot(object = CombinedSeuratObject,
# group.by = c("DATASET"),
reduction = "tsne",
dims = c(1,2),
pt.size = .1,
label = T,
ncol = 3)
# uMAP
CombinedSeuratObject <- RunUMAP(object = CombinedSeuratObject,
reduction = "pca",
dims = 1:50,
n.components = 2,
n.neighbors = 20,
spread = .3,
repulsion.strength = 1,
min.dist= .001,
verbose = T,
num_threads=0,
n.epochs = 200,
metric = "euclidean",
seed.use = 42,
reduction.name="umap")
## Warning: The following arguments are not used: num_threads
## 09:13:26 UMAP embedding parameters a = 12.99 b = 0.7939
## 09:13:26 Read 5080 rows and found 50 numeric columns
## 09:13:26 Using Annoy for neighbor search, n_neighbors = 20
## 09:13:26 Building Annoy index with metric = euclidean, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 09:13:26 Writing NN index file to temp file /tmp/RtmpygNzsq/file15595f11491a
## 09:13:26 Searching Annoy index using 1 thread, search_k = 2000
## 09:13:27 Annoy recall = 100%
## 09:13:27 Commencing smooth kNN distance calibration using 1 thread
## 09:13:28 Initializing from normalized Laplacian + noise
## 09:13:28 Commencing optimization for 200 epochs, with 143780 positive edges
## 09:13:30 Optimization finished
# plot UMAP
DimPlot(object = CombinedSeuratObject,
# group.by = c("DATASET"),
reduction = "umap",
dims = c(1,2),
pt.size = .1,
label = T,
ncol = 3)
# plot tSNE
FeaturePlot(object = CombinedSeuratObject,
features = c("percent_mito", "nCount_RNA", "nFeature_RNA"),
reduction = "tsne",
dims = c(1,2),
order = T,
pt.size = .1,
ncol = 3)
# plot umap
FeaturePlot(object = CombinedSeuratObject,
features = c("percent_mito", "nCount_RNA", "nFeature_RNA"),
reduction = "umap",
dims = c(1,2),
order = T,
pt.size = .1,
ncol = 3)
# remove batch with cca
SeuratObject.list <- SplitObject(
object = CombinedSeuratObject,
split.by = "orig.ident")
SeuratObject.anchors <- FindIntegrationAnchors(
object.list = SeuratObject.list,
dims = 1:30)
## Computing 2000 integration features
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 2926 anchors
## Filtering anchors
## Retained 2149 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3090 anchors
## Filtering anchors
## Retained 1983 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3408 anchors
## Filtering anchors
## Retained 2469 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 2957 anchors
## Filtering anchors
## Retained 2099 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3333 anchors
## Filtering anchors
## Retained 2431 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 3523 anchors
## Filtering anchors
## Retained 3046 anchors
CombinedSeuratObject <- IntegrateData(
anchorset = SeuratObject.anchors,
dims = 1:30,
new.assay.name = "cca")
## Merging dataset 4 into 3
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 2 1 into 3 4
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Warning: Adding a command log without an assay associated with it
# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
object = CombinedSeuratObject,
assay = "cca",
# vars.to.regress = c("nCount_RNA","nFeatures_RNA","percent_mito","S.Score"),
model.use = "linear",
do.scale = T,
do.center = T)
## Centering and scaling data matrix
# pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
assay = "cca",
npcs = 100,
verbose = FALSE )
# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
reduction = "pca",
perplexity=30,
max_iter=1000,
theta=0.5,
eta=200,
exaggeration_factor=12,
dims.use = 1:50,
verbose = T,
num_threads=0)
## Read the 5080 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.85 seconds (sparsity = 0.023741)!
## Learning embedding...
## Iteration 50: error is 89.616038 (50 iterations in 3.39 seconds)
## Iteration 100: error is 71.253908 (50 iterations in 2.86 seconds)
## Iteration 150: error is 68.214660 (50 iterations in 2.72 seconds)
## Iteration 200: error is 67.097111 (50 iterations in 2.83 seconds)
## Iteration 250: error is 66.572063 (50 iterations in 3.35 seconds)
## Iteration 300: error is 2.037887 (50 iterations in 3.08 seconds)
## Iteration 350: error is 1.651030 (50 iterations in 3.03 seconds)
## Iteration 400: error is 1.452680 (50 iterations in 3.07 seconds)
## Iteration 450: error is 1.331552 (50 iterations in 3.09 seconds)
## Iteration 500: error is 1.255069 (50 iterations in 3.10 seconds)
## Iteration 550: error is 1.202225 (50 iterations in 3.15 seconds)
## Iteration 600: error is 1.166650 (50 iterations in 3.18 seconds)
## Iteration 650: error is 1.143866 (50 iterations in 3.21 seconds)
## Iteration 700: error is 1.126984 (50 iterations in 3.33 seconds)
## Iteration 750: error is 1.113643 (50 iterations in 3.50 seconds)
## Iteration 800: error is 1.103519 (50 iterations in 3.22 seconds)
## Iteration 850: error is 1.095353 (50 iterations in 3.29 seconds)
## Iteration 900: error is 1.088870 (50 iterations in 3.41 seconds)
## Iteration 950: error is 1.082467 (50 iterations in 3.21 seconds)
## Iteration 1000: error is 1.076996 (50 iterations in 4.37 seconds)
## Fitting performed in 64.38 seconds.
# plot tSNE
DimPlot(object = CombinedSeuratObject,
# group.by = c("DATASET"),
reduction = "tsne",
dims = c(1,2),
pt.size = .1,
label = T,
ncol = 3)
# plot tSNE with categories
FeaturePlot(object = CombinedSeuratObject,
features = c("percent_mito", "nCount_RNA", "nFeature_RNA","S.Score"),
reduction = "tsne",
dims = c(1,2),
order = T,
pt.size = .1,
ncol = 3)
# CombinedSeuratObject@assays$cca
CCA: corrected matrix.
Scanorama and harmony: reduction embedding.
In CCA, we create a new assay (see above), but the embeddings (PCA) need to be calculated
But in scanorama and harmony, it creates already the embeddings (PCA), so pca doesnt need to run again, one can directly plot UMAP/tSNE.
See above.
We integrate the data to remove batch effects.
Representation of data in a simplified way.
SNN, because it removes spurious connections.
## SNN
CombinedSeuratObject <- FindNeighbors(CombinedSeuratObject,
assay = "cca",
compute.SNN = T,
reduction = "pca" ,
dims = 1:50,
graph.name="SNN",
prune.SNN = 1/15,
k.param = 20,
force.recalc = T)
## Computing nearest neighbor graph
## Computing SNN
# Louvain
CombinedSeuratObject <- FindClusters(
object = CombinedSeuratObject,
graph.name = "SNN",
resolution = 0.2, # to get 8 clusters
algorithm = 1) #algorithim 1 = Louvain
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 5080
## Number of edges: 52597
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9416
## Number of communities: 16
## Elapsed time: 0 seconds
## 8 singletons identified. 8 final clusters.
DimPlot(object = CombinedSeuratObject,
group.by = c("seurat_clusters"),
reduction = "tsne",
dims = c(1,2),
pt.size = .1,
label = T,
ncol = 3)
# CombinedSeuratObject@graphs$SNN
Ti identify the cell types.
We choose a resolution (0.2) that gives us sufficient number of clusters to distinguish the main blood cell types (see also later the identification of marker genes).
We choose Louvain because we don’t have the package for Leiden.
By trying and see what makes sense from biological point of view. We could create more clusters but they have the same expression profile.
Yes.
# comparison of clusters. Are the clusters homogeneously distributed among samples.
comparison_table <- table(list(
CombinedSeuratObject@meta.data[,"orig.ident"],
CombinedSeuratObject@meta.data[,"seurat_clusters"]
))
#Transform data to percentages:
comparison_table <- t(t(comparison_table)/colSums(comparison_table))*100
#Barplot
barplot(comparison_table,
col=scales::hue_pal()(nrow(comparison_table))[1:nrow(comparison_table)],
border=NA,
las=2, legend = T)
# comparison of clusters. Are the clusters homogeneously distributed among samples.
comparison_table <- table(list(
CombinedSeuratObject@meta.data[,"group"],
CombinedSeuratObject@meta.data[,"seurat_clusters"]
))
#Transform data to percentages:
comparison_table <- t(t(comparison_table)/colSums(comparison_table))*100
#Barplot
barplot(comparison_table,
col=scales::hue_pal()(nrow(comparison_table))[1:nrow(comparison_table)],
border=NA,
las=2, legend = T)
We see a few clusters that have mostly covid samples.
# CombinedSeuratObject@meta.data$seurat_clusters
To identify cell clusters.
#Find markers:
markers <- FindAllMarkers( object = CombinedSeuratObject,
assay = "cca",
logfc.threshold = 0.25,
test.use = "wilcox",
slot = "data",
min.pct = 0.1,
min.diff.pct = -Inf,
only.pos = FALSE,
max.cells.per.ident = 51,
latent.vars = NULL,
min.cells.feature = 3,
min.cells.group = 3,
pseudocount.use = 1,
return.thresh = 0.01 )
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
#plot
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:biomaRt':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
top5 <- markers %>% group_by(cluster) %>% top_n(-5, p_val_adj)
DoHeatmap(object = CombinedSeuratObject,
features = as.character(unique(top5$gene)),
group.by = "seurat_clusters",
assay = "cca")
Specific cluster.
We are interested in clusters that have DEG between conditions. Those are cluster 0 and 6.
After trying all 8 clusters, these 2 show DEG between covid and control samples.
# DEG within a cluster.
cell_selection <- CombinedSeuratObject[,CombinedSeuratObject$seurat_clusters == 0]
cell_selection <- SetIdent(cell_selection, value = "group")
# Compute differentiall expression
DGE_cell_selection <- FindAllMarkers(
object = cell_selection,
assay = "RNA",
logfc.threshold = 0.2,
test.use = "wilcox",
min.pct = 0.1,
min.diff.pct = -Inf,
only.pos = FALSE,
max.cells.per.ident = 51, # number cells of smallest cluster
latent.vars = NULL,
min.cells.feature = 3,
min.cells.group = 3,
pseudocount.use = 1,
return.thresh = 0.01 )
## Calculating cluster covid
## Calculating cluster control
# plot top 5 genes
top5dg4 <- DGE_cell_selection %>% group_by(cluster) %>% top_n(-5, p_val_adj)
# heatmap
DoHeatmap(object = cell_selection,
features = as.character(unique(top5dg4$gene)),
group.by = "group",
assay = "RNA",slot = "data")
# DEG within a cluster.
cell_selection <- CombinedSeuratObject[,CombinedSeuratObject$seurat_clusters == 6]
cell_selection <- SetIdent(cell_selection, value = "group")
# Compute differentiall expression
DGE_cell_selection <- FindAllMarkers(
object = cell_selection,
assay = "RNA",
logfc.threshold = 0.2,
test.use = "wilcox",
min.pct = 0.1,
min.diff.pct = -Inf,
only.pos = FALSE,
max.cells.per.ident = 51, # number cells of smallest cluster
latent.vars = NULL,
min.cells.feature = 3,
min.cells.group = 3,
pseudocount.use = 1,
return.thresh = 0.01 )
## Calculating cluster covid
## Calculating cluster control
# plot top 5 genes
top5dg4 <- DGE_cell_selection %>% group_by(cluster) %>% top_n(-5, p_val_adj)
# heatmap
DoHeatmap(object = cell_selection,
features = as.character(unique(top5dg4$gene)),
group.by = "group",
assay = "RNA",slot = "data")
From cluster 6, we have IFI6 and IFI27, two genes related to Interferon alpha signalling and innate immune response. These two are downregulated in covid patient samples (they could be interesting for follow-up studies)
Clustering with resolution = 0.2 in Louvain to get 8 clusters that are representative of one cell type.
Wilcox
What parameters did you set for computing differential expression? Justify each one
logfc.threshold = 0.2, # we will detect small differences (2^0.2= 1.15) as a first step; and later one could find stronger changes.
min.pct = 0.1, # means the gene should be expressed in at least 10% of cells.
min.diff.pct = -Inf, # to consider all
only.pos = FALSE, # for up and down-regulated
max.cells.per.ident = 51, # number cells of smallest cluster, so it’s scaled down to the smallest cluster.
latent.vars = NULL, # we don’t want to correct confounding factors, we already did.
min.cells.feature = 3, # we want genes that are expressed in at least 3 cells (we did this already before)
min.cells.group = 3, # Every group should have at least 3 cells.
return.thresh = 0.01 # significance threshold to consider.
Which marker genes can separate each of the cell clusters in your data?
We plot known marker genes for each blood cell type, to try to know which cells identify each cluster. This will help later to know in which cells we have DEG.
#plotting clusters
DimPlot(object = CombinedSeuratObject,
group.by = c("seurat_clusters"),
reduction = "tsne",
dims = c(1,2),
pt.size = .1,
label = T,
ncol = 3)
# plotting known marker genes for different cell types
FeaturePlot(object = CombinedSeuratObject,
features = c("CD4","CD8B","CD3G","CD14","HBA1","MS4A1"),
reduction = "tsne",
dims = c(1,2),
order = T,
pt.size = .1,
ncol = 3)
With this we see that the clusters 0 and 6 having DEG between covid and control correspond to CD4 and monocytes cells.
“CD4”,“CD8B”,“CD3G”,“CD14”,“HBA1”,“MS4A1”, representing CD4, CD8, NK, monocytes, red blood cells, B cells, respectively.
Heatmap.
We identify DEG between conditions and we associate them to the cell type using known marker genes.
# from cluster 6 (CD4 T cells)
# functional enrichment
library(enrichR)
## Welcome to enrichR
## Checking connection ... Connection is Live!
# Check available databases to perform enrichment (then choose one)
enrichR::listEnrichrDbs()
## geneCoverage genesPerTerm libraryName
## 1 13362 275 Genome_Browser_PWMs
## 2 27884 1284 TRANSFAC_and_JASPAR_PWMs
## 3 6002 77 Transcription_Factor_PPIs
## 4 47172 1370 ChEA_2013
## 5 47107 509 Drug_Perturbations_from_GEO_2014
## 6 21493 3713 ENCODE_TF_ChIP-seq_2014
## 7 1295 18 BioCarta_2013
## 8 3185 73 Reactome_2013
## 9 2854 34 WikiPathways_2013
## 10 15057 300 Disease_Signatures_from_GEO_up_2014
## 11 4128 48 KEGG_2013
## 12 34061 641 TF-LOF_Expression_from_GEO
## 13 7504 155 TargetScan_microRNA
## 14 16399 247 PPI_Hub_Proteins
## 15 12753 57 GO_Molecular_Function_2015
## 16 23726 127 GeneSigDB
## 17 32740 85 Chromosome_Location
## 18 13373 258 Human_Gene_Atlas
## 19 19270 388 Mouse_Gene_Atlas
## 20 13236 82 GO_Cellular_Component_2015
## 21 14264 58 GO_Biological_Process_2015
## 22 3096 31 Human_Phenotype_Ontology
## 23 22288 4368 Epigenomics_Roadmap_HM_ChIP-seq
## 24 4533 37 KEA_2013
## 25 10231 158 NURSA_Human_Endogenous_Complexome
## 26 2741 5 CORUM
## 27 5655 342 SILAC_Phosphoproteomics
## 28 10406 715 MGI_Mammalian_Phenotype_Level_3
## 29 10493 200 MGI_Mammalian_Phenotype_Level_4
## 30 11251 100 Old_CMAP_up
## 31 8695 100 Old_CMAP_down
## 32 1759 25 OMIM_Disease
## 33 2178 89 OMIM_Expanded
## 34 851 15 VirusMINT
## 35 10061 106 MSigDB_Computational
## 36 11250 166 MSigDB_Oncogenic_Signatures
## 37 15406 300 Disease_Signatures_from_GEO_down_2014
## 38 17711 300 Virus_Perturbations_from_GEO_up
## 39 17576 300 Virus_Perturbations_from_GEO_down
## 40 15797 176 Cancer_Cell_Line_Encyclopedia
## 41 12232 343 NCI-60_Cancer_Cell_Lines
## 42 13572 301 Tissue_Protein_Expression_from_ProteomicsDB
## 43 6454 301 Tissue_Protein_Expression_from_Human_Proteome_Map
## 44 3723 47 HMDB_Metabolites
## 45 7588 35 Pfam_InterPro_Domains
## 46 7682 78 GO_Biological_Process_2013
## 47 7324 172 GO_Cellular_Component_2013
## 48 8469 122 GO_Molecular_Function_2013
## 49 13121 305 Allen_Brain_Atlas_up
## 50 26382 1811 ENCODE_TF_ChIP-seq_2015
## 51 29065 2123 ENCODE_Histone_Modifications_2015
## 52 280 9 Phosphatase_Substrates_from_DEPOD
## 53 13877 304 Allen_Brain_Atlas_down
## 54 15852 912 ENCODE_Histone_Modifications_2013
## 55 4320 129 Achilles_fitness_increase
## 56 4271 128 Achilles_fitness_decrease
## 57 10496 201 MGI_Mammalian_Phenotype_2013
## 58 1678 21 BioCarta_2015
## 59 756 12 HumanCyc_2015
## 60 3800 48 KEGG_2015
## 61 2541 39 NCI-Nature_2015
## 62 1918 39 Panther_2015
## 63 5863 51 WikiPathways_2015
## 64 6768 47 Reactome_2015
## 65 25651 807 ESCAPE
## 66 19129 1594 HomoloGene
## 67 23939 293 Disease_Perturbations_from_GEO_down
## 68 23561 307 Disease_Perturbations_from_GEO_up
## 69 23877 302 Drug_Perturbations_from_GEO_down
## 70 15886 9 Genes_Associated_with_NIH_Grants
## 71 24350 299 Drug_Perturbations_from_GEO_up
## 72 3102 25 KEA_2015
## 73 31132 298 Gene_Perturbations_from_GEO_up
## 74 30832 302 Gene_Perturbations_from_GEO_down
## 75 48230 1429 ChEA_2015
## 76 5613 36 dbGaP
## 77 9559 73 LINCS_L1000_Chem_Pert_up
## 78 9448 63 LINCS_L1000_Chem_Pert_down
## 79 16725 1443 GTEx_Tissue_Sample_Gene_Expression_Profiles_down
## 80 19249 1443 GTEx_Tissue_Sample_Gene_Expression_Profiles_up
## 81 15090 282 Ligand_Perturbations_from_GEO_down
## 82 16129 292 Aging_Perturbations_from_GEO_down
## 83 15309 308 Aging_Perturbations_from_GEO_up
## 84 15103 318 Ligand_Perturbations_from_GEO_up
## 85 15022 290 MCF7_Perturbations_from_GEO_down
## 86 15676 310 MCF7_Perturbations_from_GEO_up
## 87 15854 279 Microbe_Perturbations_from_GEO_down
## 88 15015 321 Microbe_Perturbations_from_GEO_up
## 89 3788 159 LINCS_L1000_Ligand_Perturbations_down
## 90 3357 153 LINCS_L1000_Ligand_Perturbations_up
## 91 12668 300 L1000_Kinase_and_GPCR_Perturbations_down
## 92 12638 300 L1000_Kinase_and_GPCR_Perturbations_up
## 93 8973 64 Reactome_2016
## 94 7010 87 KEGG_2016
## 95 5966 51 WikiPathways_2016
## 96 15562 887 ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X
## 97 17850 300 Kinase_Perturbations_from_GEO_down
## 98 17660 300 Kinase_Perturbations_from_GEO_up
## 99 1348 19 BioCarta_2016
## 100 934 13 HumanCyc_2016
## 101 2541 39 NCI-Nature_2016
## 102 2041 42 Panther_2016
## 103 5209 300 DrugMatrix
## 104 49238 1550 ChEA_2016
## 105 2243 19 huMAP
## 106 19586 545 Jensen_TISSUES
## 107 22440 505 RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO
## 108 8184 24 MGI_Mammalian_Phenotype_2017
## 109 18329 161 Jensen_COMPARTMENTS
## 110 15755 28 Jensen_DISEASES
## 111 10271 22 BioPlex_2017
## 112 10427 38 GO_Cellular_Component_2017
## 113 10601 25 GO_Molecular_Function_2017
## 114 13822 21 GO_Biological_Process_2017
## 115 8002 143 GO_Cellular_Component_2017b
## 116 10089 45 GO_Molecular_Function_2017b
## 117 13247 49 GO_Biological_Process_2017b
## 118 21809 2316 ARCHS4_Tissues
## 119 23601 2395 ARCHS4_Cell-lines
## 120 20883 299 ARCHS4_IDG_Coexp
## 121 19612 299 ARCHS4_Kinases_Coexp
## 122 25983 299 ARCHS4_TFs_Coexp
## 123 19500 137 SysMyo_Muscle_Gene_Sets
## 124 14893 128 miRTarBase_2017
## 125 17598 1208 TargetScan_microRNA_2017
## 126 5902 109 Enrichr_Libraries_Most_Popular_Genes
## 127 12486 299 Enrichr_Submissions_TF-Gene_Coocurrence
## 128 1073 100 Data_Acquisition_Method_Most_Popular_Genes
## 129 19513 117 DSigDB
## 130 14433 36 GO_Biological_Process_2018
## 131 8655 61 GO_Cellular_Component_2018
## 132 11459 39 GO_Molecular_Function_2018
## 133 19741 270 TF_Perturbations_Followed_by_Expression
## 134 27360 802 Chromosome_Location_hg19
## 135 13072 26 NIH_Funded_PIs_2017_Human_GeneRIF
## 136 13464 45 NIH_Funded_PIs_2017_Human_AutoRIF
## 137 13787 200 Rare_Diseases_AutoRIF_ARCHS4_Predictions
## 138 13929 200 Rare_Diseases_GeneRIF_ARCHS4_Predictions
## 139 16964 200 NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions
## 140 17258 200 NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions
## 141 10352 58 Rare_Diseases_GeneRIF_Gene_Lists
## 142 10471 76 Rare_Diseases_AutoRIF_Gene_Lists
## 143 12419 491 SubCell_BarCode
## 144 19378 37 GWAS_Catalog_2019
## 145 6201 45 WikiPathways_2019_Human
## 146 4558 54 WikiPathways_2019_Mouse
## 147 3264 22 TRRUST_Transcription_Factors_2019
## 148 7802 92 KEGG_2019_Human
## 149 8551 98 KEGG_2019_Mouse
## 150 12444 23 InterPro_Domains_2019
## 151 9000 20 Pfam_Domains_2019
## 152 7744 363 DepMap_WG_CRISPR_Screens_Broad_CellLines_2019
## 153 6204 387 DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019
## 154 13420 32 MGI_Mammalian_Phenotype_Level_4_2019
## 155 14148 122 UK_Biobank_GWAS_v1
## 156 9813 49 BioPlanet_2019
## 157 1397 13 ClinVar_2019
## 158 9116 22 PheWeb_2019
## 159 17464 63 DisGeNET
## 160 394 73 HMS_LINCS_KinomeScan
## 161 11851 586 CCLE_Proteomics_2020
## 162 8189 421 ProteomicsDB_2020
## 163 18704 100 lncHUB_lncRNA_Co-Expression
## 164 5605 39 Virus-Host_PPI_P-HIPSTer_2020
## 165 5718 31 Elsevier_Pathway_Collection
## 166 14156 40 Table_Mining_of_CRISPR_Studies
## 167 16979 295 COVID-19_Related_Gene_Sets
## link
## 1 http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/
## 2 http://jaspar.genereg.net/html/DOWNLOAD/
## 3
## 4 http://amp.pharm.mssm.edu/lib/cheadownload.jsp
## 5 http://www.ncbi.nlm.nih.gov/geo/
## 6 http://genome.ucsc.edu/ENCODE/downloads.html
## 7 https://cgap.nci.nih.gov/Pathways/BioCarta_Pathways
## 8 http://www.reactome.org/download/index.html
## 9 http://www.wikipathways.org/index.php/Download_Pathways
## 10 http://www.ncbi.nlm.nih.gov/geo/
## 11 http://www.kegg.jp/kegg/download/
## 12 http://www.ncbi.nlm.nih.gov/geo/
## 13 http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_61
## 14 http://amp.pharm.mssm.edu/X2K
## 15 http://www.geneontology.org/GO.downloads.annotations.shtml
## 16 http://genesigdb.org/genesigdb/downloadall.jsp
## 17 http://software.broadinstitute.org/gsea/msigdb/index.jsp
## 18 http://biogps.org/downloads/
## 19 http://biogps.org/downloads/
## 20 http://www.geneontology.org/GO.downloads.annotations.shtml
## 21 http://www.geneontology.org/GO.downloads.annotations.shtml
## 22 http://www.human-phenotype-ontology.org/
## 23 http://www.roadmapepigenomics.org/
## 24 http://amp.pharm.mssm.edu/lib/keacommandline.jsp
## 25 https://www.nursa.org/nursa/index.jsf
## 26 http://mips.helmholtz-muenchen.de/genre/proj/corum/
## 27 http://amp.pharm.mssm.edu/lib/keacommandline.jsp
## 28 http://www.informatics.jax.org/
## 29 http://www.informatics.jax.org/
## 30 http://www.broadinstitute.org/cmap/
## 31 http://www.broadinstitute.org/cmap/
## 32 http://www.omim.org/downloads
## 33 http://www.omim.org/downloads
## 34 http://mint.bio.uniroma2.it/download.html
## 35 http://www.broadinstitute.org/gsea/msigdb/collections.jsp
## 36 http://www.broadinstitute.org/gsea/msigdb/collections.jsp
## 37 http://www.ncbi.nlm.nih.gov/geo/
## 38 http://www.ncbi.nlm.nih.gov/geo/
## 39 http://www.ncbi.nlm.nih.gov/geo/
## 40 https://portals.broadinstitute.org/ccle/home\n
## 41 http://biogps.org/downloads/
## 42 https://www.proteomicsdb.org/
## 43 http://www.humanproteomemap.org/index.php
## 44 http://www.hmdb.ca/downloads
## 45 ftp://ftp.ebi.ac.uk/pub/databases/interpro/
## 46 http://www.geneontology.org/GO.downloads.annotations.shtml
## 47 http://www.geneontology.org/GO.downloads.annotations.shtml
## 48 http://www.geneontology.org/GO.downloads.annotations.shtml
## 49 http://www.brain-map.org/
## 50 http://genome.ucsc.edu/ENCODE/downloads.html
## 51 http://genome.ucsc.edu/ENCODE/downloads.html
## 52 http://www.koehn.embl.de/depod/
## 53 http://www.brain-map.org/
## 54 http://genome.ucsc.edu/ENCODE/downloads.html
## 55 http://www.broadinstitute.org/achilles
## 56 http://www.broadinstitute.org/achilles
## 57 http://www.informatics.jax.org/
## 58 https://cgap.nci.nih.gov/Pathways/BioCarta_Pathways
## 59 http://humancyc.org/
## 60 http://www.kegg.jp/kegg/download/
## 61 http://pid.nci.nih.gov/
## 62 http://www.pantherdb.org/
## 63 http://www.wikipathways.org/index.php/Download_Pathways
## 64 http://www.reactome.org/download/index.html
## 65 http://www.maayanlab.net/ESCAPE/
## 66 http://www.ncbi.nlm.nih.gov/homologene
## 67 http://www.ncbi.nlm.nih.gov/geo/
## 68 http://www.ncbi.nlm.nih.gov/geo/
## 69 http://www.ncbi.nlm.nih.gov/geo/
## 70 https://grants.nih.gov/grants/oer.htm\n
## 71 http://www.ncbi.nlm.nih.gov/geo/
## 72 http://amp.pharm.mssm.edu/Enrichr
## 73 http://www.ncbi.nlm.nih.gov/geo/
## 74 http://www.ncbi.nlm.nih.gov/geo/
## 75 http://amp.pharm.mssm.edu/Enrichr
## 76 http://www.ncbi.nlm.nih.gov/gap
## 77 https://clue.io/
## 78 https://clue.io/
## 79 http://www.gtexportal.org/
## 80 http://www.gtexportal.org/
## 81 http://www.ncbi.nlm.nih.gov/geo/
## 82 http://www.ncbi.nlm.nih.gov/geo/
## 83 http://www.ncbi.nlm.nih.gov/geo/
## 84 http://www.ncbi.nlm.nih.gov/geo/
## 85 http://www.ncbi.nlm.nih.gov/geo/
## 86 http://www.ncbi.nlm.nih.gov/geo/
## 87 http://www.ncbi.nlm.nih.gov/geo/
## 88 http://www.ncbi.nlm.nih.gov/geo/
## 89 https://clue.io/
## 90 https://clue.io/
## 91 https://clue.io/
## 92 https://clue.io/
## 93 http://www.reactome.org/download/index.html
## 94 http://www.kegg.jp/kegg/download/
## 95 http://www.wikipathways.org/index.php/Download_Pathways
## 96
## 97 http://www.ncbi.nlm.nih.gov/geo/
## 98 http://www.ncbi.nlm.nih.gov/geo/
## 99 http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways
## 100 http://humancyc.org/
## 101 http://pid.nci.nih.gov/
## 102 http://www.pantherdb.org/pathway/
## 103 https://ntp.niehs.nih.gov/drugmatrix/
## 104 http://amp.pharm.mssm.edu/Enrichr
## 105 http://proteincomplexes.org/
## 106 http://tissues.jensenlab.org/
## 107 http://www.ncbi.nlm.nih.gov/geo/
## 108 http://www.informatics.jax.org/
## 109 http://compartments.jensenlab.org/
## 110 http://diseases.jensenlab.org/
## 111 http://bioplex.hms.harvard.edu/
## 112 http://www.geneontology.org/
## 113 http://www.geneontology.org/
## 114 http://www.geneontology.org/
## 115 http://www.geneontology.org/
## 116 http://www.geneontology.org/
## 117 http://www.geneontology.org/
## 118 http://amp.pharm.mssm.edu/archs4
## 119 http://amp.pharm.mssm.edu/archs4
## 120 http://amp.pharm.mssm.edu/archs4
## 121 http://amp.pharm.mssm.edu/archs4
## 122 http://amp.pharm.mssm.edu/archs4
## 123 http://sys-myo.rhcloud.com/
## 124 http://mirtarbase.mbc.nctu.edu.tw/
## 125 http://www.targetscan.org/
## 126 http://amp.pharm.mssm.edu/Enrichr
## 127 http://amp.pharm.mssm.edu/Enrichr
## 128 http://amp.pharm.mssm.edu/Enrichr
## 129 http://tanlab.ucdenver.edu/DSigDB/DSigDBv1.0/
## 130 http://www.geneontology.org/
## 131 http://www.geneontology.org/
## 132 http://www.geneontology.org/
## 133 http://www.ncbi.nlm.nih.gov/geo/
## 134 http://hgdownload.cse.ucsc.edu/downloads.html
## 135 https://www.ncbi.nlm.nih.gov/pubmed/
## 136 https://www.ncbi.nlm.nih.gov/pubmed/
## 137 https://amp.pharm.mssm.edu/geneshot/
## 138 https://www.ncbi.nlm.nih.gov/gene/about-generif
## 139 https://www.ncbi.nlm.nih.gov/pubmed/
## 140 https://www.ncbi.nlm.nih.gov/pubmed/
## 141 https://www.ncbi.nlm.nih.gov/gene/about-generif
## 142 https://amp.pharm.mssm.edu/geneshot/
## 143 http://www.subcellbarcode.org/
## 144 https://www.ebi.ac.uk/gwas
## 145 https://www.wikipathways.org/
## 146 https://www.wikipathways.org/
## 147 https://www.grnpedia.org/trrust/
## 148 https://www.kegg.jp/
## 149 https://www.kegg.jp/
## 150 https://www.ebi.ac.uk/interpro/
## 151 https://pfam.xfam.org/
## 152 https://depmap.org/
## 153 https://depmap.org/
## 154 http://www.informatics.jax.org/
## 155 https://www.ukbiobank.ac.uk/tag/gwas/
## 156 https://tripod.nih.gov/bioplanet/
## 157 https://www.ncbi.nlm.nih.gov/clinvar/
## 158 http://pheweb.sph.umich.edu/
## 159 https://www.disgenet.org
## 160 http://lincs.hms.harvard.edu/kinomescan/
## 161 https://portals.broadinstitute.org/ccle
## 162 https://www.proteomicsdb.org/
## 163 https://amp.pharm.mssm.edu/lnchub/
## 164 http://phipster.org/
## 165 http://www.transgene.ru/disease-pathways/
## 166
## 167 https://amp.pharm.mssm.edu/covid19
## numTerms
## 1 615
## 2 326
## 3 290
## 4 353
## 5 701
## 6 498
## 7 249
## 8 78
## 9 199
## 10 142
## 11 200
## 12 269
## 13 222
## 14 385
## 15 1136
## 16 2139
## 17 386
## 18 84
## 19 96
## 20 641
## 21 5192
## 22 1779
## 23 383
## 24 474
## 25 1796
## 26 1658
## 27 84
## 28 71
## 29 476
## 30 6100
## 31 6100
## 32 90
## 33 187
## 34 85
## 35 858
## 36 189
## 37 142
## 38 323
## 39 323
## 40 967
## 41 93
## 42 207
## 43 30
## 44 3906
## 45 311
## 46 941
## 47 205
## 48 402
## 49 2192
## 50 816
## 51 412
## 52 59
## 53 2192
## 54 109
## 55 216
## 56 216
## 57 476
## 58 239
## 59 125
## 60 179
## 61 209
## 62 104
## 63 404
## 64 1389
## 65 315
## 66 12
## 67 839
## 68 839
## 69 906
## 70 32876
## 71 906
## 72 428
## 73 2460
## 74 2460
## 75 395
## 76 345
## 77 33132
## 78 33132
## 79 2918
## 80 2918
## 81 261
## 82 286
## 83 286
## 84 261
## 85 401
## 86 401
## 87 312
## 88 312
## 89 96
## 90 96
## 91 3644
## 92 3644
## 93 1530
## 94 293
## 95 437
## 96 104
## 97 285
## 98 285
## 99 237
## 100 152
## 101 209
## 102 112
## 103 7876
## 104 645
## 105 995
## 106 1842
## 107 1302
## 108 5231
## 109 2283
## 110 1811
## 111 3915
## 112 636
## 113 972
## 114 3166
## 115 816
## 116 3271
## 117 10125
## 118 108
## 119 125
## 120 352
## 121 498
## 122 1724
## 123 1135
## 124 3240
## 125 683
## 126 121
## 127 1722
## 128 12
## 129 4026
## 130 5103
## 131 446
## 132 1151
## 133 1958
## 134 36
## 135 5687
## 136 12558
## 137 3725
## 138 2244
## 139 12558
## 140 5684
## 141 2244
## 142 3725
## 143 104
## 144 1737
## 145 472
## 146 176
## 147 571
## 148 308
## 149 303
## 150 1071
## 151 608
## 152 558
## 153 325
## 154 5261
## 155 857
## 156 1510
## 157 182
## 158 1161
## 159 9828
## 160 148
## 161 378
## 162 913
## 163 3729
## 164 6715
## 165 1721
## 166 802
## 167 205
# perform enrichment
enrich_results <- enrichr(
genes = top5dg4$gene[ top5dg4$cluster == "covid" ],
databases = "GO_Biological_Process_2017b" )[[1]]
## Uploading data to Enrichr... Done.
## Querying GO_Biological_Process_2017b... Done.
## Parsing results... Done.
# Some databases of interest:
#
# GO_Biological_Process_2017b
# KEGG_2019_Human
# KEGG_2019_Mouse
# WikiPathways_2019_Human
# WikiPathways_2019_Mouse
enrich_results$Term <- sub("GO.*","",enrich_results$Term)
par(1,1,mar=c(3,20,2,1))
barplot( height = -log10(enrich_results$P.value)[20:1],
names.arg = enrich_results$Term[20:1],
horiz = T,
las=2,
border=F,
cex.names = .6 )
abline( v=c( -log10(0.05) ),lty=2 )
abline( v=0,lty=1 )
GO analysis of DEG in covid samples of cluster 6 shows and enrichment for the interferon signalling pathway, a pathway that has recently been shown to be impaired in severe COVID-19 patients (Hadjadj et al. Science 2020)