Research Analysis Tasks

The following milestones and tasks are what you will need to perform in order to complete the project-based learning outcomes for this course. Go through them in order and refer to the glossary for more detailed information on each individual step.

Milestone 1

Download the data

The data for this project is available at the course GitHub repository, which can be downloaded like so:

Run git clone https://github.com/sib-swiss/SchoolRNA2020.git
A new directory will be created named SchoolRNA2020; the data is residing in the single_cell/data/ directory

You can either read the data from there (check the glossary for details) or copy the data to some other location, if you prefer. You can run the above command either on the command line interface or from within RStudio (more details in the glossary).

Load and merge datasets

Consult the Glossary or additional sources for help
Which file format do we have the data in?

H5.

Reading data:

cov.17 <- Seurat::Read10X_h5(
  filename = "data/covid_data_GSE149689/nCoV_PBMC_17.h5",
  use.names = T)
cov.15 <- Seurat::Read10X_h5(
  filename = "data/covid_data_GSE149689/nCoV_PBMC_15.h5",
  use.names = T)
ctl.14 <- Seurat::Read10X_h5(
  filename = "data/covid_data_GSE149689/Normal_PBMC_14.h5",
  use.names = T)
ctl.13 <- Seurat::Read10X_h5(
  filename = "data/covid_data_GSE149689/Normal_PBMC_13.h5",
  use.names = T)

Why do we need to create a Seurat object?

Because it’s a useful object containing methods for single cell data analysis.

Creating Seurat objects for each sample:

library("Seurat", lib.loc="~/sibsinglecell/lib/R/library")

## Warning: package 'Seurat' was built under R version 3.6.3

cov17.seurat <- CreateSeuratObject(
  counts = cov.17,
  assay = "RNA",
  project = "covid_17")
cov15.seurat <- CreateSeuratObject(
  counts = cov.15,
  assay = "RNA",
  project = "covid_15")
ctl14.seurat <- CreateSeuratObject(
  counts = ctl.14,
  assay = "RNA",
  project = "control_14")
ctl13.seurat <- CreateSeuratObject(
  counts = ctl.13,
  assay = "RNA",
  project = "control_13")

Combining datasets:

CombinedSeuratObject <- merge(
  x = cov15.seurat,
  y = c( cov17.seurat ,
         ctl13.seurat,
         ctl14.seurat),
  add.cell.ids = c('covid_15','covid_17','control_13','control_14'))

The following is to create a column with the group (control or covid):

CombinedSeuratObject$group <-sub("_.*","",CombinedSeuratObject$orig.ident)

Where in the Seurat object is your counts stored?

#CombinedSeuratObject$nCount_RNA

Describe in form of text the rational for this step in your markdown report.

In this step we read the count data and create a Seurat object, a simple R object for scRNAseq data. We then combine all datasets, creating a metadata column that specificies sample group (control or covid).

Compute QC

Consult the Glossary or additional sources for help
Which QC metrics should you calculate?

Total number of features:

A standard approach is to filter cells with low amount of reads as well as genes that are present in at least a certain amount of cells. Here we will only consider cells with at least 200 detected genes and genes need to be expressed in at least 3 cells. Please note that those values are highly dependent on the library preparation method used. Extremely high number of detected genes could indicate doublets. However, depending on the cell type composition in your sample, you may have cells with higher number of genes (and also higher counts) from one cell type.

VlnPlot(CombinedSeuratObject,
        group.by= "orig.ident",
        features = c("nFeature_RNA","nCount_RNA"),
        pt.size = 0.1,
        ncol = 4) + NoLegend()

On the left figure each dot is a cell and we see the number of genes expressed in each cell. On the right plot, the number of reads obtained from each cell (UMI have been merged already). All samples seem to have a similar distribution in #genes and #counts, except sample covid_15 which shows a few cells with high #counts, that could indicate doublets (we don’t really know).

Gene QC

In single cell, the most detected genes usually belong to housekeeping gene families, such as mitochondrial (MT-), ribossomal (RPL and RPS) and other structural proteins (i.e., ACTB, TMSB4X, B2M, EEF1A1).

#Compute the relative expression of each gene per cell
rel_expression <- Matrix::t( Matrix::t(CombinedSeuratObject@assays$RNA@counts) / Matrix::colSums(CombinedSeuratObject@assays$RNA@counts)) * 100
most_expressed <- sort(Matrix::rowSums( rel_expression ),T) / ncol(CombinedSeuratObject)

#Plot the relative expression of each gene per cell
par(mfrow=c(1,3),mar=c(4,6,1,1))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[30:1]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[30:1],horizontal=TRUE,ylim=c(0,8))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[60:31]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[60:31],horizontal=TRUE,ylim=c(0,8))
boxplot( as.matrix(Matrix::t(rel_expression[names(most_expressed[90:61]),])),cex=.1, las=1, xlab="% total count per cell",col=scales::hue_pal()(90)[90:61],horizontal=TRUE,ylim=c(0,8))

par(mfrow=c(1,1))

This plot shows the highest expressed genes in the combined samples. We see the usually high expressed mt and RP genes, and probably house keeping genes like Malat1 (lncRNA highly expressed in PBMC cells).

Mitocondrial genes

Having the data in a suitable format, we can start calculating some quality metrics. We can for example calculate the percentage of mitocondrial and ribosomal genes per cell and add to the metadata. This will be helpfull to visualize them across different metadata parameteres (i.e. datasetID and chemistry version). There are several ways of doing this. Here is an example of how to manually calculate the proportion of mitochondrial reads and add to the metadata table.

Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017): “High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.”

(PS: non-linear relationship)

# create seurat object of MT- genes:
CombinedSeuratObject <- PercentageFeatureSet(
  object = CombinedSeuratObject,
  pattern = "^MT-",
  assay = "RNA",
  col.name = "percent_mito")
# Feature plot of mt genes
VlnPlot(CombinedSeuratObject,
        group.by= "orig.ident",
        features = c("nFeature_RNA","nCount_RNA","percent_mito"),
        pt.size = 0.1,
        ncol = 4) + NoLegend()

In the right plot we see the percentage of mt-genes per cell. Sample covid_15, we see large number of cells where high proportion of genes are mt. We will need filter out cells that have more that 25% mt genes.

Ribosomal genes

In the same manner we will calculate the proportion gene expression that comes from ribosomal proteins. Ribosomal genes are the also among the top expressed genes in any cell and, on the contrary to mitochondrial genes, are inversely proportional to the mitochondrial content: the higher the mitochondrial content, the lower is the detection of ribosomal genes (PS: non-linear relationship).

# create seurat object of RP- genes:
CombinedSeuratObject <- PercentageFeatureSet(
  object = CombinedSeuratObject,
  pattern = "^RP[SL]",
  assay = "RNA",
  col.name = "percent_ribo")
# Feature plot of rp genes
VlnPlot(CombinedSeuratObject,
        group.by= "orig.ident",
        features = c("nFeature_RNA","nCount_RNA","percent_ribo"),
        pt.size = 0.1,
        ncol = 4) + NoLegend()

We see the anticorrelation of cells with high mt genes having low % ribo genes. Sample covid_15 seems the one with poorest quality.

% Gene biotype and chromossome location

In RNA-sequencing, genes can be categorized into different groups depending on their RNA biotype. For example, “coding”, “non-coding”, “VDJ region genes” are “small interefering RNA” common gene biotypes. Besides, having information about chromossomal location might be usefull to identify bacth effects driven by sex chromossomes.

Depending on the desired type of analysis, some gene categories can be filtered out if not of interest. For single cell specifically, cell libraries are usually constructed using poly-A enrichment and therefore enriching for “protein-coding proteins”, which usually contitutes around 80-90% of all available genes.

How to run it:

library(biomaRt)
# get names of species:
#biomaRt::listDatasets(mart)[,"dataset"]
# get versin names, to use latest
#biomaRt::listEnsemblArchives()

# Retrieve human gene annotation from ENSEMBL
mart = biomaRt::useMart(
  biomart = "ensembl",
  dataset = "hsapiens_gene_ensembl",
  host = "aug2020.archive.ensembl.org")

# Retrieve the selected attributes mouse gene annotation
annot <- biomaRt::getBM(
  attributes = c(
    "external_gene_name",
    "gene_biotype",
    "chromosome_name"),
  mart = mart)

## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Cache found

# Match the gene names with their respective gene biotype
item <- annot[match(rownames(CombinedSeuratObject@assays$RNA@counts) , annot[,1]),"gene_biotype"]
item[is.na(item)] <- "unknown"

# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(CombinedSeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(CombinedSeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]

# Plot percentage of each gene biotype
boxplot( perc*100,outline=F,las=2,main="% reads per cell",col=scales::hue_pal()(100),horizontal=T)

# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))

CombinedSeuratObject@meta.data <- CombinedSeuratObject@meta.data[,!(colnames(CombinedSeuratObject@meta.data) %in% colnames(gene_biotype_table))]

CombinedSeuratObject@meta.data <- cbind(
  CombinedSeuratObject@meta.data,
  gene_biotype_table)

Same is done with chromosome_name

# Match the gene names with their respective chromosome location
item <- annot[match(rownames(CombinedSeuratObject@assays$RNA@counts) , annot[,1]),"chromosome_name"]
item[is.na(item)] <- "unknown"
item[! item %in% as.character(c(1:23,"X","Y","MT")) ] <- "other"

# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(CombinedSeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(CombinedSeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]

# Plot percentage of each gene biotype
boxplot( perc*100,outline=F,las=2,main="% reads per cell",col=scales::hue_pal()(100),horizontal=T)

# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))

CombinedSeuratObject@meta.data <- CombinedSeuratObject@meta.data[,!(colnames(CombinedSeuratObject@meta.data) %in% colnames(gene_biotype_table))]

CombinedSeuratObject@meta.data <- cbind(
  CombinedSeuratObject@meta.data,
  gene_biotype_table)

If you want to focus the analysis on only protein-coding genes, for example, you can do it like so:

dim(CombinedSeuratObject)

## [1] 33538  6800

sel <- annot[match(rownames(CombinedSeuratObject) ,
                   annot[,1]),2] == "protein_coding"
genes_use <- rownames(CombinedSeuratObject)[sel]
genes_use <- as.character(na.omit(genes_use))
CombinedSeuratObject <- CombinedSeuratObject[genes_use,]
dim(CombinedSeuratObject)

## [1] 19397  6800

Cell cycle scoring

We here perform cell cycle scoring. To score a gene list, the algorithm calculates the difference of mean expression of the given list and the mean expression of reference genes. To build the reference, the function randomly chooses a bunch of genes matching the distribution of the expression of the given list. Cell cycle scoring with Seurat adds three slots in data, a score for S phase, a score for G2M phase and the predicted cell cycle phase. The Seurat package provides a list of human G2M and S phase genes in cc.genes.

How to run it:

CombinedSeuratObject <- CellCycleScoring(
  object = CombinedSeuratObject,
  g2m.features = cc.genes$g2m.genes,
  s.features = cc.genes$s.genes)

## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms

## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms

CombinedSeuratObject$G1.Score <- 1 - CombinedSeuratObject$S.Score - CombinedSeuratObject$G2M.Score

VlnPlot(CombinedSeuratObject,
        group.by= "orig.ident",
        features = c("S.Score","G1.Score","G2M.Score"),
        pt.size = 0.1,
        ncol = 4) + NoLegend()

From the plots, we can see that cell cycle is not different between samples, so it won’t have an impact in the analysis.

What does each QC metric mean?

Features and Gene QC: A sample with low # genes or los # counts could indicate a problem with sequence depth or poor sample quality. We need to filter out cells with low number of features for further analysis, and features that are not expressed in a certain number of samples (e.g. min in 3 samples).

Mitocondrial genes: cells with high proportion of mt-genes are indicative of poor quality and should be removed. High expression of mitochondrial genes could be indicative of several processes. Cells that are stressed/damaged/lysed will lose their cytoplasmic components, whereas the mitochdria is still intact during the damage, therefore most of the genes from such cells will be mitochondrial genes. If these cells are used for differential analysis, we will falsely obtain genes involved in cell damage/apoptosis processes. Therefore, they need to be filtered out.

Cell cycle: we need to assess if cells are dividing, as this could affect differential expression analysis.

Describe in form of text the rationale for this step in your markdown report.

In this ste, we performed QC control of the data to remove low quality cells or features that could confound downstream analysis.

Define appropriate filtering thresholds

Consult the Glossary or additional sources for help
Which QC metric needs filtering in your data?

Mitocondrial content. Cells with low features. Features present in too few cells.

Which filtering parameters and respective thresholds suit your data?

Filtering out genes that are expressed in less than 3 samples, as well as cells with less than 200 genes expressed.

CombinedSeuratObject <- CombinedSeuratObject[rowSums(as.matrix(CombinedSeuratObject@assays$RNA@counts)>0)>=3 ,
                                             colSums(as.matrix(CombinedSeuratObject@assays$RNA@counts)>0)>=200]

Describe in form of text the rationale for this step in your markdown report

We removed cells with high mt-gene content; cells with less than 200 genes expressed, and genes that are expressed in less than 3 samples.

Milestone 2

Normalization and scaling

Consult the Glossary or additional sources for help
Why do we need to normalize the data? What exactly are we normalizing?

To make it comparable. To remove the effect of sequencing depth, and gene length.

Normalization

#remove genes with zero variance
CombinedSeuratObject <- CombinedSeuratObject[ Matrix::rowSums(CombinedSeuratObject) > 0, ]

#remove genes with zero variance
CombinedSeuratObject <- NormalizeData(
  object = CombinedSeuratObject,
  scale.factor = 10000,
  normalization.method = "LogNormalize")

feature selection

CombinedSeuratObject <- FindVariableFeatures(
  object = CombinedSeuratObject,
  nfeatures = 3000,
  selection.method = "vst",
  verbose = FALSE,
  assay = "RNA")

#Variable gene plot:

top20 <- head(VariableFeatures(CombinedSeuratObject), 20)
LabelPoints(plot = VariableFeaturePlot(CombinedSeuratObject), points = top20, repel = TRUE)

## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.

## When using repel, set xnudge and ynudge to 0 for optimal results

Where in the Seurat object is your normalized data stored?

#CombinedSeuratObject@assays$RNA@data

Which data covariates could potentially influence the interpretation of the results?

Mitocondrial genes, number of features, etc.

Following the question above, are there any covariates that need to be regressed out?

Percent_mito (mt-genes), S.score (cell cycle), nFeature_RNA, nCount_RNA.

Scaling and centering

# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
  object = CombinedSeuratObject,
  vars.to.regress = c("nCount_RNA","nFeature_RNA","percent_mito","S.Score"),
  model.use = "linear",
  do.scale = T,
  do.center = T)

## Regressing out nCount_RNA, nFeature_RNA, percent_mito, S.Score

## Centering and scaling data matrix

Are all genes equally important for your analysis? Justify.

No. We are interested in genes that are highly variable between samples, because they are most interesting biologically.

Where in the Seurat object is your scaled data stored?

# CombinedSeuratObject@assays$RNA@scale.data

Describe in form of text the rational for this step in your markdown report

We select features that are most informative (most variable); we normalize data (for sequencing depth differences), and we scale the data, regressing out covariates that could confound downstream analysis.

Data Visualization

Consult the Glossary or additional sources for help
Which method would you use to capture most significant information out of the data?

PCA

#run pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
                       assay = "RNA",
                       npcs = 100,
                       verbose = FALSE )

Following the question above, which parameters would you choose? Why?

The number of principal components.

Which method would you choose for visualization of the differences between your cells?

UMAP and tSNE (with the first 50 PC).

# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
                        reduction = "pca",
                        perplexity=30,
                        max_iter=1000,
                        theta=0.5,
                        eta=200,
                        exaggeration_factor=12,
                        dims.use = 1:50,
                        verbose = T,
                        num_threads=0)

## Read the 5931 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 1.20 seconds (sparsity = 0.020042)!
## Learning embedding...
## Iteration 50: error is 91.748648 (50 iterations in 2.86 seconds)
## Iteration 100: error is 73.893670 (50 iterations in 3.03 seconds)
## Iteration 150: error is 71.120532 (50 iterations in 3.14 seconds)
## Iteration 200: error is 70.200473 (50 iterations in 2.97 seconds)
## Iteration 250: error is 69.733782 (50 iterations in 3.06 seconds)
## Iteration 300: error is 2.211945 (50 iterations in 2.97 seconds)
## Iteration 350: error is 1.798589 (50 iterations in 2.69 seconds)
## Iteration 400: error is 1.582548 (50 iterations in 2.75 seconds)
## Iteration 450: error is 1.449213 (50 iterations in 2.94 seconds)
## Iteration 500: error is 1.359214 (50 iterations in 2.90 seconds)
## Iteration 550: error is 1.296529 (50 iterations in 3.27 seconds)
## Iteration 600: error is 1.250634 (50 iterations in 3.85 seconds)
## Iteration 650: error is 1.220024 (50 iterations in 3.60 seconds)
## Iteration 700: error is 1.202863 (50 iterations in 3.83 seconds)
## Iteration 750: error is 1.193324 (50 iterations in 3.97 seconds)
## Iteration 800: error is 1.185357 (50 iterations in 3.69 seconds)
## Iteration 850: error is 1.177214 (50 iterations in 3.70 seconds)
## Iteration 900: error is 1.169584 (50 iterations in 3.72 seconds)
## Iteration 950: error is 1.161636 (50 iterations in 3.67 seconds)
## Iteration 1000: error is 1.154700 (50 iterations in 3.68 seconds)
## Fitting performed in 66.29 seconds.

# uMAP
CombinedSeuratObject <- RunUMAP(object = CombinedSeuratObject,
                        reduction = "pca",
                        dims = 1:50,
                        n.components = 2,
                        n.neighbors = 20,
                        spread = .3,
                        repulsion.strength = 1,
                        min.dist= .001,
                        verbose = T,
                        num_threads=0,
                        n.epochs = 200,
                        metric = "euclidean",
                        seed.use = 42,
                        reduction.name="umap")

## Warning: The following arguments are not used: num_threads

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 09:09:48 UMAP embedding parameters a = 12.99 b = 0.7939

## 09:09:48 Read 5931 rows and found 50 numeric columns

## 09:09:48 Using Annoy for neighbor search, n_neighbors = 20

## 09:09:48 Building Annoy index with metric = euclidean, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 09:09:48 Writing NN index file to temp file /tmp/RtmpygNzsq/file15597d5f138
## 09:09:48 Searching Annoy index using 1 thread, search_k = 2000
## 09:09:49 Annoy recall = 100%
## 09:09:50 Commencing smooth kNN distance calibration using 1 thread
## 09:09:50 Initializing from normalized Laplacian + noise
## 09:09:51 Commencing optimization for 200 epochs, with 169892 positive edges
## 09:09:53 Optimization finished

# plot tSNE
DimPlot(object = CombinedSeuratObject,
#        group.by = c("DATASET"),
        reduction = "tsne",
        dims = c(1,2),
        pt.size = .1,
        label = T,
        ncol = 3)

# plot UMAP
DimPlot(object = CombinedSeuratObject,
#        group.by = c("DATASET"),
        reduction = "umap",
        dims = c(1,2),
        pt.size = .1,
        label = T,
        ncol = 3)

# plot tSNE
FeaturePlot(object = CombinedSeuratObject,
            features = c("percent_mito", "nCount_RNA", "nFeature_RNA"),
            reduction = "tsne",
            dims = c(1,2),
            order = T,
            pt.size = .1,
            ncol = 3)

Following the question above, how the parameters in this method influence your visual representation?

I we choose the first few PCs, that would explain the majority of the variance. If we’d choose more PCs, clusters will be less well separated because we are considering more variablity of the data.

How some of your QC parameters and datasets influence the separation of your cells?

If we don’t remove poor quality cells, these could cluster together without giving us biologically-relevant information.

Where in the Seurat object is your reductions stored?

# CombinedSeuratObject@reductions

Describe in form of text the rational for this step in your markdown report

We have visualized the data as a PCA ,to reduce the complexity.

Milestone 3

Dataset integration

Consult the Glossary or additional sources for help
Are there any batch effects in the data? Is batch correction / dataset integration necessary?

Patient. They are separated. We need to integrate them to be able to separate cell types.

Following the question above, which parameters allow you to say that batch effects are present?

We saw in in the UMAP/tSNE (covid_15 is apart.)

Would a simple linear regression be sufficient for removing batch effects on your dataset?

We do scaling again adding the sample replicates Then do PCA, tSNE and plot that again.

# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
  object = CombinedSeuratObject,
  vars.to.regress = c("nCount_RNA","nFeature_RNA","percent_mito","S.Score","orig.ident"),
  model.use = "linear",
  do.scale = T,
  do.center = T)

## Regressing out nCount_RNA, nFeature_RNA, percent_mito, S.Score, orig.ident

## Centering and scaling data matrix

# pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
                               assay = "RNA",
                               npcs = 100,
                               verbose = FALSE )
# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
                                reduction = "pca",
                                perplexity=30,
                                max_iter=1000,
                                theta=0.5,
                                eta=200,
                                exaggeration_factor=12,
                                dims.use = 1:50,
                                verbose = T,
                                num_threads=0)

## Read the 5931 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.98 seconds (sparsity = 0.019822)!
## Learning embedding...
## Iteration 50: error is 91.845550 (50 iterations in 3.16 seconds)
## Iteration 100: error is 74.787813 (50 iterations in 2.93 seconds)
## Iteration 150: error is 72.091245 (50 iterations in 2.93 seconds)
## Iteration 200: error is 71.337837 (50 iterations in 3.09 seconds)
## Iteration 250: error is 71.066622 (50 iterations in 3.25 seconds)
## Iteration 300: error is 2.196712 (50 iterations in 2.91 seconds)
## Iteration 350: error is 1.760298 (50 iterations in 3.36 seconds)
## Iteration 400: error is 1.537295 (50 iterations in 3.14 seconds)
## Iteration 450: error is 1.401632 (50 iterations in 2.89 seconds)
## Iteration 500: error is 1.310989 (50 iterations in 3.78 seconds)
## Iteration 550: error is 1.248310 (50 iterations in 3.66 seconds)
## Iteration 600: error is 1.204516 (50 iterations in 3.77 seconds)
## Iteration 650: error is 1.175341 (50 iterations in 6.15 seconds)
## Iteration 700: error is 1.155939 (50 iterations in 4.27 seconds)
## Iteration 750: error is 1.142352 (50 iterations in 3.83 seconds)
## Iteration 800: error is 1.131431 (50 iterations in 3.84 seconds)
## Iteration 850: error is 1.123433 (50 iterations in 3.83 seconds)
## Iteration 900: error is 1.115514 (50 iterations in 4.27 seconds)
## Iteration 950: error is 1.107891 (50 iterations in 3.83 seconds)
## Iteration 1000: error is 1.101574 (50 iterations in 3.79 seconds)
## Fitting performed in 72.66 seconds.

# plot tSNE
DimPlot(object = CombinedSeuratObject,
        #        group.by = c("DATASET"),
        reduction = "tsne",
        dims = c(1,2),
        pt.size = .1,
        label = T,
        ncol = 3)

No it’s not good enough to do simple linear regression.

Which method for batch effect would you choose?

CCA.

SeuratObject.list <- SplitObject(
  object = CombinedSeuratObject,
  split.by = "orig.ident")

SeuratObject.anchors <- FindIntegrationAnchors(
  object.list = SeuratObject.list,
  dims = 1:30)

## Computing 2000 integration features

## Scaling features for provided objects

## Finding all pairwise anchors

## Running CCA

## Merging objects

## Finding neighborhoods

## Finding anchors

##  Found 3161 anchors

## Filtering anchors

##  Retained 2330 anchors

## Running CCA

## Merging objects

## Finding neighborhoods

## Finding anchors

##  Found 3487 anchors

## Filtering anchors

##  Retained 2410 anchors

## Running CCA

## Merging objects

## Finding neighborhoods

## Finding anchors

##  Found 3652 anchors

## Filtering anchors

##  Retained 2621 anchors

## Running CCA

## Merging objects

## Finding neighborhoods

## Finding anchors

##  Found 3375 anchors

## Filtering anchors

##  Retained 2441 anchors

## Running CCA

## Merging objects

## Finding neighborhoods

## Finding anchors

##  Found 3652 anchors

## Filtering anchors

##  Retained 2658 anchors

## Running CCA

## Merging objects

## Finding neighborhoods

## Finding anchors

##  Found 4116 anchors

## Filtering anchors

##  Retained 3596 anchors

CombinedSeuratObject <- IntegrateData(
  anchorset = SeuratObject.anchors,
  dims = 1:30,
  new.assay.name = "cca")

## Merging dataset 3 into 4

## Extracting anchors for merged samples

## Finding integration vectors

## Finding integration vector weights

## Integrating data

## Merging dataset 2 into 4 3

## Extracting anchors for merged samples

## Finding integration vectors

## Finding integration vector weights

## Integrating data

## Merging dataset 1 into 4 3 2

## Extracting anchors for merged samples

## Finding integration vectors

## Finding integration vector weights

## Integrating data

## Warning: Adding a command log without an assay associated with it

How could you tell the batch correction procedure worked?

Scaling, PCA, tSNE and plotting.

# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
  object = CombinedSeuratObject,
  assay = "cca",
#  vars.to.regress = c("nCount_RNA","nFeatures_RNA","percent_mito","S.Score"),
  model.use = "linear",
  do.scale = T,
  do.center = T)

## Centering and scaling data matrix

# pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
                               assay = "cca",
                               npcs = 100,
                               verbose = FALSE )
# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
                                reduction = "pca",
                                perplexity=30,
                                max_iter=1000,
                                theta=0.5,
                                eta=200,
                                exaggeration_factor=12,
                                dims.use = 1:50,
                                verbose = T,
                                num_threads=0)

## Read the 5931 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.96 seconds (sparsity = 0.020564)!
## Learning embedding...
## Iteration 50: error is 91.422225 (50 iterations in 3.12 seconds)
## Iteration 100: error is 71.879686 (50 iterations in 3.31 seconds)
## Iteration 150: error is 68.618122 (50 iterations in 2.99 seconds)
## Iteration 200: error is 67.433919 (50 iterations in 3.24 seconds)
## Iteration 250: error is 66.819658 (50 iterations in 3.26 seconds)
## Iteration 300: error is 2.174736 (50 iterations in 3.41 seconds)
## Iteration 350: error is 1.787636 (50 iterations in 3.66 seconds)
## Iteration 400: error is 1.572558 (50 iterations in 3.71 seconds)
## Iteration 450: error is 1.437932 (50 iterations in 3.66 seconds)
## Iteration 500: error is 1.347387 (50 iterations in 3.69 seconds)
## Iteration 550: error is 1.284705 (50 iterations in 3.80 seconds)
## Iteration 600: error is 1.239576 (50 iterations in 3.78 seconds)
## Iteration 650: error is 1.207996 (50 iterations in 3.78 seconds)
## Iteration 700: error is 1.184813 (50 iterations in 3.84 seconds)
## Iteration 750: error is 1.169460 (50 iterations in 3.85 seconds)
## Iteration 800: error is 1.159007 (50 iterations in 3.80 seconds)
## Iteration 850: error is 1.150471 (50 iterations in 3.82 seconds)
## Iteration 900: error is 1.142919 (50 iterations in 3.95 seconds)
## Iteration 950: error is 1.136613 (50 iterations in 3.81 seconds)
## Iteration 1000: error is 1.131145 (50 iterations in 3.82 seconds)
## Fitting performed in 72.29 seconds.

# plot tSNE
DimPlot(object = CombinedSeuratObject,
        #        group.by = c("DATASET"),
        reduction = "tsne",
        dims = c(1,2),
        pt.size = .1,
        label = T,
        ncol = 3)

# plot tSNE with categories
FeaturePlot(object = CombinedSeuratObject,
            features = c("percent_mito", "nCount_RNA", "nFeature_RNA","S.Score"),
            reduction = "tsne",
            dims = c(1,2),
            order = T,
            pt.size = .1,
            ncol = 3)

After CCA samples have been well integrated. Only a cluster on covid-15 that hasnt been integrated, but it doesn’t correspond to cells with high mt-genes proportion.

But we do see a cluster of cells with high mt-genes content, that we should remove.

# CombinedSeuratObject <- CombinedSeuratObject[,CombinedSeuratObject$percent_mito < 25]

Run the whole analysis again, from combining datasets onwards.

CombinedSeuratObject <- merge(
  x = cov15.seurat,
  y = c( cov17.seurat ,
         ctl13.seurat,
         ctl14.seurat),
  add.cell.ids = c('covid_15','covid_17','control_13','control_14'))

# creating a metadata object with the group (control/covid), taken from sample name (everything before the "_")
CombinedSeuratObject$group <-sub("_.*","",CombinedSeuratObject$orig.ident)

# create seurat object of MT- genes:
CombinedSeuratObject <- PercentageFeatureSet(
  object = CombinedSeuratObject,
  pattern = "^MT-",
  assay = "RNA",
  col.name = "percent_mito")

# remove cells that have more than 25% mt genes
#rows are genes; columns are cells

CombinedSeuratObject <- CombinedSeuratObject[,CombinedSeuratObject$percent_mito < 25]

 # ribo genes
# create seurat object of ribo- genes:
CombinedSeuratObject <- PercentageFeatureSet(
  object = CombinedSeuratObject,
  pattern = "^RP[SL]",
  assay = "RNA",
  col.name = "percent_ribo")
# Feature plot of mt genes
VlnPlot(CombinedSeuratObject,
        group.by= "orig.ident",
        features = c("nFeature_RNA","nCount_RNA", "percent_mito","percent_ribo"),
        pt.size = 0.1,
        ncol = 4) + NoLegend()

#  % Gene biotype and chromossome location

library(biomaRt)
# get names of species:
#biomaRt::listDatasets(mart)[,"dataset"]
# get versin names, to use latest
#biomaRt::listEnsemblArchives()

# Retrieve human gene annotation from ENSEMBL
mart = biomaRt::useMart(
  biomart = "ensembl",
  dataset = "hsapiens_gene_ensembl",
  host = "aug2020.archive.ensembl.org")

# Retrieve the selected attributes mouse gene annotation
annot <- biomaRt::getBM(
  attributes = c(
    "external_gene_name",
    "gene_biotype",
    "chromosome_name"),
  mart = mart)

## Cache found

# Match the gene names with their respective gene biotype
item <- annot[match(rownames(CombinedSeuratObject@assays$RNA@counts) , annot[,1]),"gene_biotype"]
item[is.na(item)] <- "unknown"

# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(CombinedSeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(CombinedSeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]

# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))

CombinedSeuratObject@meta.data <- CombinedSeuratObject@meta.data[,!(colnames(CombinedSeuratObject@meta.data) %in% colnames(gene_biotype_table))]

CombinedSeuratObject@meta.data <- cbind(
  CombinedSeuratObject@meta.data,
  gene_biotype_table)

#Same is done with chromosome_name

# Match the gene names with their respective chromosome location
item <- annot[match(rownames(CombinedSeuratObject@assays$RNA@counts) , annot[,1]),"chromosome_name"]
item[is.na(item)] <- "unknown"
item[! item %in% as.character(c(1:23,"X","Y","MT")) ] <- "other"

# Calculate the percentage of each gene biotype
perc <- rowsum(as.matrix(CombinedSeuratObject@assays$RNA@counts) , group=item)
perc <- (t(perc)/Matrix::colSums(CombinedSeuratObject@assays$RNA@counts))
o <- order(apply(perc,2,median),decreasing = F)
perc <- perc[,o]

# Add table to the object
gene_biotype_table <- setNames(as.data.frame((perc*100)[,names(sort(table(item),decreasing = T))]),paste0("percent_",names(sort(table(item),decreasing = T))))

CombinedSeuratObject@meta.data <- CombinedSeuratObject@meta.data[,!(colnames(CombinedSeuratObject@meta.data) %in% colnames(gene_biotype_table))]

CombinedSeuratObject@meta.data <- cbind(
  CombinedSeuratObject@meta.data,
  gene_biotype_table)


# keep only protein coding genes
dim(CombinedSeuratObject)

## [1] 33538  5479

sel <- annot[match(rownames(CombinedSeuratObject) ,
                   annot[,1]),2] == "protein_coding"
genes_use <- rownames(CombinedSeuratObject)[sel]
genes_use <- as.character(na.omit(genes_use))
CombinedSeuratObject <- CombinedSeuratObject[genes_use,]
dim(CombinedSeuratObject)

## [1] 19397  5479

# cell cycle
CombinedSeuratObject <- CellCycleScoring(
  object = CombinedSeuratObject,
  g2m.features = cc.genes$g2m.genes,
  s.features = cc.genes$s.genes)

## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms

## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms

CombinedSeuratObject$G1.Score <- 1 - CombinedSeuratObject$S.Score - CombinedSeuratObject$G2M.Score

# filtering 
# Filtering out genes that are expressed in less than 3 samples, as well as cells with less than 200 genes expressed.

CombinedSeuratObject <- CombinedSeuratObject[rowSums(as.matrix(CombinedSeuratObject@assays$RNA@counts)>0)>=3 ,
                                             colSums(as.matrix(CombinedSeuratObject@assays$RNA@counts)>0)>=200]


# feature selection
CombinedSeuratObject <- FindVariableFeatures(
  object = CombinedSeuratObject,
  nfeatures = 3000,
  selection.method = "vst",
  verbose = FALSE,
  assay = "RNA")
# Normalization
#remove genes with zero variance
CombinedSeuratObject <- CombinedSeuratObject[ Matrix::rowSums(CombinedSeuratObject) > 0, ]

#remove genes with zero variance
CombinedSeuratObject <- NormalizeData(
  object = CombinedSeuratObject,
  scale.factor = 10000,
  normalization.method = "LogNormalize")

#Variable gene plot:
top20 <- head(VariableFeatures(CombinedSeuratObject), 20)
LabelPoints(plot = VariableFeaturePlot(CombinedSeuratObject), points = top20, repel = TRUE)

## When using repel, set xnudge and ynudge to 0 for optimal results

# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
  object = CombinedSeuratObject,
  vars.to.regress = c("nCount_RNA","nFeatures_RNA","percent_mito","S.Score"),
  model.use = "linear",
  do.scale = T,
  do.center = T)

## Warning: Requested variables to regress not in object: nFeatures_RNA

## Regressing out nCount_RNA, percent_mito, S.Score

## Centering and scaling data matrix

# pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
                               assay = "RNA",
                               npcs = 100,
                               verbose = FALSE )
# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
                                reduction = "pca",
                                perplexity=30,
                                max_iter=1000,
                                theta=0.5,
                                eta=200,
                                exaggeration_factor=12,
                                dims.use = 1:50,
                                verbose = T,
                                num_threads=0)

## Read the 5080 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.88 seconds (sparsity = 0.023303)!
## Learning embedding...
## Iteration 50: error is 89.855855 (50 iterations in 2.66 seconds)
## Iteration 100: error is 72.346303 (50 iterations in 2.71 seconds)
## Iteration 150: error is 69.827760 (50 iterations in 2.57 seconds)
## Iteration 200: error is 69.000762 (50 iterations in 2.51 seconds)
## Iteration 250: error is 68.377956 (50 iterations in 2.42 seconds)
## Iteration 300: error is 2.079499 (50 iterations in 2.39 seconds)
## Iteration 350: error is 1.659317 (50 iterations in 2.50 seconds)
## Iteration 400: error is 1.447837 (50 iterations in 2.45 seconds)
## Iteration 450: error is 1.320678 (50 iterations in 2.57 seconds)
## Iteration 500: error is 1.237463 (50 iterations in 2.95 seconds)
## Iteration 550: error is 1.182178 (50 iterations in 3.08 seconds)
## Iteration 600: error is 1.147065 (50 iterations in 3.14 seconds)
## Iteration 650: error is 1.124953 (50 iterations in 3.13 seconds)
## Iteration 700: error is 1.109929 (50 iterations in 3.16 seconds)
## Iteration 750: error is 1.100050 (50 iterations in 3.17 seconds)
## Iteration 800: error is 1.091429 (50 iterations in 3.18 seconds)
## Iteration 850: error is 1.083330 (50 iterations in 3.23 seconds)
## Iteration 900: error is 1.075522 (50 iterations in 3.18 seconds)
## Iteration 950: error is 1.068569 (50 iterations in 3.21 seconds)
## Iteration 1000: error is 1.063398 (50 iterations in 3.17 seconds)
## Fitting performed in 57.37 seconds.

# plot tSNE
DimPlot(object = CombinedSeuratObject,
        #        group.by = c("DATASET"),
        reduction = "tsne",
        dims = c(1,2),
        pt.size = .1,
        label = T,
        ncol = 3)

# uMAP
CombinedSeuratObject <- RunUMAP(object = CombinedSeuratObject,
                                reduction = "pca",
                                dims = 1:50,
                                n.components = 2,
                                n.neighbors = 20,
                                spread = .3,
                                repulsion.strength = 1,
                                min.dist= .001,
                                verbose = T,
                                num_threads=0,
                                n.epochs = 200,
                                metric = "euclidean",
                                seed.use = 42,
                                reduction.name="umap")

## Warning: The following arguments are not used: num_threads

## 09:13:26 UMAP embedding parameters a = 12.99 b = 0.7939

## 09:13:26 Read 5080 rows and found 50 numeric columns

## 09:13:26 Using Annoy for neighbor search, n_neighbors = 20

## 09:13:26 Building Annoy index with metric = euclidean, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 09:13:26 Writing NN index file to temp file /tmp/RtmpygNzsq/file15595f11491a
## 09:13:26 Searching Annoy index using 1 thread, search_k = 2000
## 09:13:27 Annoy recall = 100%
## 09:13:27 Commencing smooth kNN distance calibration using 1 thread
## 09:13:28 Initializing from normalized Laplacian + noise
## 09:13:28 Commencing optimization for 200 epochs, with 143780 positive edges
## 09:13:30 Optimization finished

# plot UMAP
DimPlot(object = CombinedSeuratObject,
        #        group.by = c("DATASET"),
        reduction = "umap",
        dims = c(1,2),
        pt.size = .1,
        label = T,
        ncol = 3)

# plot tSNE
FeaturePlot(object = CombinedSeuratObject,
            features = c("percent_mito", "nCount_RNA", "nFeature_RNA"),
            reduction = "tsne",
            dims = c(1,2),
            order = T,
            pt.size = .1,
            ncol = 3)

# plot umap
FeaturePlot(object = CombinedSeuratObject,
            features = c("percent_mito", "nCount_RNA", "nFeature_RNA"),
            reduction = "umap",
            dims = c(1,2),
            order = T,
            pt.size = .1,
            ncol = 3)

# remove batch with cca
SeuratObject.list <- SplitObject(
  object = CombinedSeuratObject,
  split.by = "orig.ident")

SeuratObject.anchors <- FindIntegrationAnchors(
  object.list = SeuratObject.list,
  dims = 1:30)

## Computing 2000 integration features
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 2926 anchors
## Filtering anchors
##  Retained 2149 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 3090 anchors
## Filtering anchors
##  Retained 1983 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 3408 anchors
## Filtering anchors
##  Retained 2469 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 2957 anchors
## Filtering anchors
##  Retained 2099 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 3333 anchors
## Filtering anchors
##  Retained 2431 anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 3523 anchors
## Filtering anchors
##  Retained 3046 anchors

CombinedSeuratObject <- IntegrateData(
  anchorset = SeuratObject.anchors,
  dims = 1:30,
  new.assay.name = "cca")

## Merging dataset 4 into 3
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
## Merging dataset 2 1 into 3 4
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data

## Warning: Adding a command log without an assay associated with it

# Linear scaling and centering (faster than poisson)
CombinedSeuratObject <- ScaleData(
  object = CombinedSeuratObject,
  assay = "cca",
#  vars.to.regress = c("nCount_RNA","nFeatures_RNA","percent_mito","S.Score"),
  model.use = "linear",
  do.scale = T,
  do.center = T)

## Centering and scaling data matrix

# pca
CombinedSeuratObject <- RunPCA(object = CombinedSeuratObject,
                               assay = "cca",
                               npcs = 100,
                               verbose = FALSE )
# tSNE
CombinedSeuratObject <- RunTSNE(object = CombinedSeuratObject,
                                reduction = "pca",
                                perplexity=30,
                                max_iter=1000,
                                theta=0.5,
                                eta=200,
                                exaggeration_factor=12,
                                dims.use = 1:50,
                                verbose = T,
                                num_threads=0)

## Read the 5080 x 5 data matrix successfully!
## OpenMP is working. 8 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.85 seconds (sparsity = 0.023741)!
## Learning embedding...
## Iteration 50: error is 89.616038 (50 iterations in 3.39 seconds)
## Iteration 100: error is 71.253908 (50 iterations in 2.86 seconds)
## Iteration 150: error is 68.214660 (50 iterations in 2.72 seconds)
## Iteration 200: error is 67.097111 (50 iterations in 2.83 seconds)
## Iteration 250: error is 66.572063 (50 iterations in 3.35 seconds)
## Iteration 300: error is 2.037887 (50 iterations in 3.08 seconds)
## Iteration 350: error is 1.651030 (50 iterations in 3.03 seconds)
## Iteration 400: error is 1.452680 (50 iterations in 3.07 seconds)
## Iteration 450: error is 1.331552 (50 iterations in 3.09 seconds)
## Iteration 500: error is 1.255069 (50 iterations in 3.10 seconds)
## Iteration 550: error is 1.202225 (50 iterations in 3.15 seconds)
## Iteration 600: error is 1.166650 (50 iterations in 3.18 seconds)
## Iteration 650: error is 1.143866 (50 iterations in 3.21 seconds)
## Iteration 700: error is 1.126984 (50 iterations in 3.33 seconds)
## Iteration 750: error is 1.113643 (50 iterations in 3.50 seconds)
## Iteration 800: error is 1.103519 (50 iterations in 3.22 seconds)
## Iteration 850: error is 1.095353 (50 iterations in 3.29 seconds)
## Iteration 900: error is 1.088870 (50 iterations in 3.41 seconds)
## Iteration 950: error is 1.082467 (50 iterations in 3.21 seconds)
## Iteration 1000: error is 1.076996 (50 iterations in 4.37 seconds)
## Fitting performed in 64.38 seconds.

# plot tSNE
DimPlot(object = CombinedSeuratObject,
        #        group.by = c("DATASET"),
        reduction = "tsne",
        dims = c(1,2),
        pt.size = .1,
        label = T,
        ncol = 3)

# plot tSNE with categories
FeaturePlot(object = CombinedSeuratObject,
            features = c("percent_mito", "nCount_RNA", "nFeature_RNA","S.Score"),
            reduction = "tsne",
            dims = c(1,2),
            order = T,
            pt.size = .1,
            ncol = 3)

Where in the Seurat object is your integrated data stored?

# CombinedSeuratObject@assays$cca

After batch correction, do you have corrected data matrix (genes x samples) or a matrix in a reduction embedding (samples x dimensions).

CCA: corrected matrix.

Scanorama and harmony: reduction embedding.

Where should you put the embedding results?

In CCA, we create a new assay (see above), but the embeddings (PCA) need to be calculated

But in scanorama and harmony, it creates already the embeddings (PCA), so pca doesnt need to run again, one can directly plot UMAP/tSNE.

Visualize your results using the new matricies.

See above.

Describe in form of text the rational for this step in your markdown report

We integrate the data to remove batch effects.

Milestone 4

Cell Clustering

Consult the Glossary or additional sources for help
What is a graph?

Representation of data in a simplified way.

Which kind of graph is the most robust to represent your data?

SNN, because it removes spurious connections.

## SNN
CombinedSeuratObject <- FindNeighbors(CombinedSeuratObject,
                              assay = "cca",
                              compute.SNN = T,
                              reduction = "pca" ,
                              dims = 1:50,
                              graph.name="SNN",
                              prune.SNN = 1/15,
                              k.param = 20,
                              force.recalc = T)

## Computing nearest neighbor graph

## Computing SNN

# Louvain
CombinedSeuratObject <- FindClusters(
  object = CombinedSeuratObject,
  graph.name = "SNN",
  resolution = 0.2, # to get 8 clusters
  algorithm = 1) #algorithim 1 = Louvain

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5080
## Number of edges: 52597
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9416
## Number of communities: 16
## Elapsed time: 0 seconds

## 8 singletons identified. 8 final clusters.

DimPlot(object = CombinedSeuratObject,
        group.by = c("seurat_clusters"),
        reduction = "tsne",
        dims = c(1,2),
        pt.size = .1,
        label = T,
        ncol = 3)

Where in the Seurat object is your graph stored?

# CombinedSeuratObject@graphs$SNN

Why do we need to cluster our cells?

Ti identify the cell types.

Which parameters have you chosen when clustering?

We choose a resolution (0.2) that gives us sufficient number of clusters to distinguish the main blood cell types (see also later the identification of marker genes).

We choose Louvain because we don’t have the package for Leiden.

How can you tell which clustering resolution is best?

By trying and see what makes sense from biological point of view. We could create more clusters but they have the same expression profile.

Do the clustering reflect the cell separation seen by the visualization method you are using?

Yes.

How are your clusters distributed across the samples, groups, experimental conditions, etc.?

# comparison of clusters. Are the clusters homogeneously distributed among samples.
comparison_table <- table(list(
  CombinedSeuratObject@meta.data[,"orig.ident"],
  CombinedSeuratObject@meta.data[,"seurat_clusters"]
))
#Transform data to percentages:
comparison_table <- t(t(comparison_table)/colSums(comparison_table))*100
#Barplot
barplot(comparison_table,
        col=scales::hue_pal()(nrow(comparison_table))[1:nrow(comparison_table)],
        border=NA,
        las=2, legend = T)

# comparison of clusters. Are the clusters homogeneously distributed among samples.
comparison_table <- table(list(
  CombinedSeuratObject@meta.data[,"group"],
  CombinedSeuratObject@meta.data[,"seurat_clusters"]
))
#Transform data to percentages:
comparison_table <- t(t(comparison_table)/colSums(comparison_table))*100
#Barplot
barplot(comparison_table,
        col=scales::hue_pal()(nrow(comparison_table))[1:nrow(comparison_table)],
        border=NA,
        las=2, legend = T)

We see a few clusters that have mostly covid samples.

Where in the Seurat object is your clustering data stored?

# CombinedSeuratObject@meta.data$seurat_clusters

Describe in form of text the rationale for this step in your markdown report

To identify cell clusters.

Milestone 5

Differential expression

Consult the Glossary or additional sources for help
Which biological question(s) do you want to answer with differential expression?

Detect genes that have higher expression in one cluster of cells compared to another (find cluster-specific markers):

#Find markers:

markers <- FindAllMarkers( object = CombinedSeuratObject,
                           assay = "cca",
                           logfc.threshold = 0.25,
                           test.use = "wilcox",
                           slot = "data",
                           min.pct = 0.1,
                           min.diff.pct = -Inf,
                           only.pos = FALSE,
                           max.cells.per.ident = 51,
                           latent.vars = NULL,
                           min.cells.feature = 3,
                           min.cells.group = 3,
                           pseudocount.use = 1,
                           return.thresh = 0.01 )

## Calculating cluster 0

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

#plot
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:biomaRt':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

top5 <- markers %>% group_by(cluster) %>% top_n(-5, p_val_adj)

DoHeatmap(object = CombinedSeuratObject,
          features = as.character(unique(top5$gene)),
          group.by = "seurat_clusters",
          assay = "cca")

which genes are differentially expressed within a cluster between covid vs. control, code below.

Are you interested in comparing all cells or using a specific cluster?

Specific cluster.

If you are interested in a particular cluster, which cluster, why?

We are interested in clusters that have DEG between conditions. Those are cluster 0 and 6.

After trying all 8 clusters, these 2 show DEG between covid and control samples.

# DEG within a cluster.
cell_selection <- CombinedSeuratObject[,CombinedSeuratObject$seurat_clusters == 0]
cell_selection <- SetIdent(cell_selection, value = "group")

# Compute differentiall expression
DGE_cell_selection <- FindAllMarkers(
  object = cell_selection,
  assay = "RNA",
  logfc.threshold = 0.2,
  test.use = "wilcox",
  min.pct = 0.1,
  min.diff.pct = -Inf,
  only.pos = FALSE,
  max.cells.per.ident = 51, # number cells of smallest cluster
  latent.vars = NULL,
  min.cells.feature = 3,
  min.cells.group = 3,
  pseudocount.use = 1,
  return.thresh = 0.01 )

## Calculating cluster covid

## Calculating cluster control

# plot top 5 genes
top5dg4 <- DGE_cell_selection %>% group_by(cluster) %>% top_n(-5, p_val_adj)
# heatmap
DoHeatmap(object = cell_selection,
          features = as.character(unique(top5dg4$gene)),
          group.by = "group",
          assay = "RNA",slot = "data")

# DEG within a cluster.
cell_selection <- CombinedSeuratObject[,CombinedSeuratObject$seurat_clusters == 6]
cell_selection <- SetIdent(cell_selection, value = "group")

# Compute differentiall expression
DGE_cell_selection <- FindAllMarkers(
  object = cell_selection,
  assay = "RNA",
  logfc.threshold = 0.2,
  test.use = "wilcox",
  min.pct = 0.1,
  min.diff.pct = -Inf,
  only.pos = FALSE,
  max.cells.per.ident = 51, # number cells of smallest cluster
  latent.vars = NULL,
  min.cells.feature = 3,
  min.cells.group = 3,
  pseudocount.use = 1,
  return.thresh = 0.01 )

## Calculating cluster covid
## Calculating cluster control

# plot top 5 genes
top5dg4 <- DGE_cell_selection %>% group_by(cluster) %>% top_n(-5, p_val_adj)
# heatmap
DoHeatmap(object = cell_selection,
          features = as.character(unique(top5dg4$gene)),
          group.by = "group",
          assay = "RNA",slot = "data")

From cluster 6, we have IFI6 and IFI27, two genes related to Interferon alpha signalling and innate immune response. These two are downregulated in covid patient samples (they could be interesting for follow-up studies)

Which clustering resolution would you run your differential expression?

Clustering with resolution = 0.2 in Louvain to get 8 clusters that are representative of one cell type.

Which test did you choose for differential expression?

Wilcox

What parameters did you set for computing differential expression? Justify each one

logfc.threshold = 0.2, # we will detect small differences (2^0.2= 1.15) as a first step; and later one could find stronger changes.

min.pct = 0.1, # means the gene should be expressed in at least 10% of cells.

min.diff.pct = -Inf, # to consider all

only.pos = FALSE, # for up and down-regulated

max.cells.per.ident = 51, # number cells of smallest cluster, so it’s scaled down to the smallest cluster.

latent.vars = NULL, # we don’t want to correct confounding factors, we already did.

min.cells.feature = 3, # we want genes that are expressed in at least 3 cells (we did this already before)

min.cells.group = 3, # Every group should have at least 3 cells.

return.thresh = 0.01 # significance threshold to consider.
Which marker genes can separate each of the cell clusters in your data?

We plot known marker genes for each blood cell type, to try to know which cells identify each cluster. This will help later to know in which cells we have DEG.

#plotting clusters
DimPlot(object = CombinedSeuratObject,
        group.by = c("seurat_clusters"),
        reduction = "tsne",
        dims = c(1,2),
        pt.size = .1,
        label = T,
        ncol = 3)

# plotting known marker genes for different cell types
FeaturePlot(object = CombinedSeuratObject,
            features = c("CD4","CD8B","CD3G","CD14","HBA1","MS4A1"),
            reduction = "tsne",
            dims = c(1,2),
            order = T,
            pt.size = .1,
            ncol = 3)

With this we see that the clusters 0 and 6 having DEG between covid and control correspond to CD4 and monocytes cells.

Which cell types do they represent?

“CD4”,“CD8B”,“CD3G”,“CD14”,“HBA1”,“MS4A1”, representing CD4, CD8, NK, monocytes, red blood cells, B cells, respectively.

How would you visualize the list of differentially expressed genes?

Heatmap.

Describe in form of text the rationale for this step in your markdown report

We identify DEG between conditions and we associate them to the cell type using known marker genes.

Functional enrichment analysis

# from cluster 6 (CD4 T cells)


# functional enrichment 
library(enrichR)

## Welcome to enrichR
## Checking connection ... Connection is Live!

# Check available databases to perform enrichment (then choose one)
enrichR::listEnrichrDbs()

##     geneCoverage genesPerTerm                                       libraryName
## 1          13362          275                               Genome_Browser_PWMs
## 2          27884         1284                          TRANSFAC_and_JASPAR_PWMs
## 3           6002           77                         Transcription_Factor_PPIs
## 4          47172         1370                                         ChEA_2013
## 5          47107          509                  Drug_Perturbations_from_GEO_2014
## 6          21493         3713                           ENCODE_TF_ChIP-seq_2014
## 7           1295           18                                     BioCarta_2013
## 8           3185           73                                     Reactome_2013
## 9           2854           34                                 WikiPathways_2013
## 10         15057          300               Disease_Signatures_from_GEO_up_2014
## 11          4128           48                                         KEGG_2013
## 12         34061          641                        TF-LOF_Expression_from_GEO
## 13          7504          155                               TargetScan_microRNA
## 14         16399          247                                  PPI_Hub_Proteins
## 15         12753           57                        GO_Molecular_Function_2015
## 16         23726          127                                         GeneSigDB
## 17         32740           85                               Chromosome_Location
## 18         13373          258                                  Human_Gene_Atlas
## 19         19270          388                                  Mouse_Gene_Atlas
## 20         13236           82                        GO_Cellular_Component_2015
## 21         14264           58                        GO_Biological_Process_2015
## 22          3096           31                          Human_Phenotype_Ontology
## 23         22288         4368                   Epigenomics_Roadmap_HM_ChIP-seq
## 24          4533           37                                          KEA_2013
## 25         10231          158                 NURSA_Human_Endogenous_Complexome
## 26          2741            5                                             CORUM
## 27          5655          342                           SILAC_Phosphoproteomics
## 28         10406          715                   MGI_Mammalian_Phenotype_Level_3
## 29         10493          200                   MGI_Mammalian_Phenotype_Level_4
## 30         11251          100                                       Old_CMAP_up
## 31          8695          100                                     Old_CMAP_down
## 32          1759           25                                      OMIM_Disease
## 33          2178           89                                     OMIM_Expanded
## 34           851           15                                         VirusMINT
## 35         10061          106                              MSigDB_Computational
## 36         11250          166                       MSigDB_Oncogenic_Signatures
## 37         15406          300             Disease_Signatures_from_GEO_down_2014
## 38         17711          300                   Virus_Perturbations_from_GEO_up
## 39         17576          300                 Virus_Perturbations_from_GEO_down
## 40         15797          176                     Cancer_Cell_Line_Encyclopedia
## 41         12232          343                          NCI-60_Cancer_Cell_Lines
## 42         13572          301       Tissue_Protein_Expression_from_ProteomicsDB
## 43          6454          301 Tissue_Protein_Expression_from_Human_Proteome_Map
## 44          3723           47                                  HMDB_Metabolites
## 45          7588           35                             Pfam_InterPro_Domains
## 46          7682           78                        GO_Biological_Process_2013
## 47          7324          172                        GO_Cellular_Component_2013
## 48          8469          122                        GO_Molecular_Function_2013
## 49         13121          305                              Allen_Brain_Atlas_up
## 50         26382         1811                           ENCODE_TF_ChIP-seq_2015
## 51         29065         2123                 ENCODE_Histone_Modifications_2015
## 52           280            9                 Phosphatase_Substrates_from_DEPOD
## 53         13877          304                            Allen_Brain_Atlas_down
## 54         15852          912                 ENCODE_Histone_Modifications_2013
## 55          4320          129                         Achilles_fitness_increase
## 56          4271          128                         Achilles_fitness_decrease
## 57         10496          201                      MGI_Mammalian_Phenotype_2013
## 58          1678           21                                     BioCarta_2015
## 59           756           12                                     HumanCyc_2015
## 60          3800           48                                         KEGG_2015
## 61          2541           39                                   NCI-Nature_2015
## 62          1918           39                                      Panther_2015
## 63          5863           51                                 WikiPathways_2015
## 64          6768           47                                     Reactome_2015
## 65         25651          807                                            ESCAPE
## 66         19129         1594                                        HomoloGene
## 67         23939          293               Disease_Perturbations_from_GEO_down
## 68         23561          307                 Disease_Perturbations_from_GEO_up
## 69         23877          302                  Drug_Perturbations_from_GEO_down
## 70         15886            9                  Genes_Associated_with_NIH_Grants
## 71         24350          299                    Drug_Perturbations_from_GEO_up
## 72          3102           25                                          KEA_2015
## 73         31132          298                    Gene_Perturbations_from_GEO_up
## 74         30832          302                  Gene_Perturbations_from_GEO_down
## 75         48230         1429                                         ChEA_2015
## 76          5613           36                                             dbGaP
## 77          9559           73                          LINCS_L1000_Chem_Pert_up
## 78          9448           63                        LINCS_L1000_Chem_Pert_down
## 79         16725         1443  GTEx_Tissue_Sample_Gene_Expression_Profiles_down
## 80         19249         1443    GTEx_Tissue_Sample_Gene_Expression_Profiles_up
## 81         15090          282                Ligand_Perturbations_from_GEO_down
## 82         16129          292                 Aging_Perturbations_from_GEO_down
## 83         15309          308                   Aging_Perturbations_from_GEO_up
## 84         15103          318                  Ligand_Perturbations_from_GEO_up
## 85         15022          290                  MCF7_Perturbations_from_GEO_down
## 86         15676          310                    MCF7_Perturbations_from_GEO_up
## 87         15854          279               Microbe_Perturbations_from_GEO_down
## 88         15015          321                 Microbe_Perturbations_from_GEO_up
## 89          3788          159             LINCS_L1000_Ligand_Perturbations_down
## 90          3357          153               LINCS_L1000_Ligand_Perturbations_up
## 91         12668          300          L1000_Kinase_and_GPCR_Perturbations_down
## 92         12638          300            L1000_Kinase_and_GPCR_Perturbations_up
## 93          8973           64                                     Reactome_2016
## 94          7010           87                                         KEGG_2016
## 95          5966           51                                 WikiPathways_2016
## 96         15562          887         ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X
## 97         17850          300                Kinase_Perturbations_from_GEO_down
## 98         17660          300                  Kinase_Perturbations_from_GEO_up
## 99          1348           19                                     BioCarta_2016
## 100          934           13                                     HumanCyc_2016
## 101         2541           39                                   NCI-Nature_2016
## 102         2041           42                                      Panther_2016
## 103         5209          300                                        DrugMatrix
## 104        49238         1550                                         ChEA_2016
## 105         2243           19                                             huMAP
## 106        19586          545                                    Jensen_TISSUES
## 107        22440          505 RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO
## 108         8184           24                      MGI_Mammalian_Phenotype_2017
## 109        18329          161                               Jensen_COMPARTMENTS
## 110        15755           28                                   Jensen_DISEASES
## 111        10271           22                                      BioPlex_2017
## 112        10427           38                        GO_Cellular_Component_2017
## 113        10601           25                        GO_Molecular_Function_2017
## 114        13822           21                        GO_Biological_Process_2017
## 115         8002          143                       GO_Cellular_Component_2017b
## 116        10089           45                       GO_Molecular_Function_2017b
## 117        13247           49                       GO_Biological_Process_2017b
## 118        21809         2316                                    ARCHS4_Tissues
## 119        23601         2395                                 ARCHS4_Cell-lines
## 120        20883          299                                  ARCHS4_IDG_Coexp
## 121        19612          299                              ARCHS4_Kinases_Coexp
## 122        25983          299                                  ARCHS4_TFs_Coexp
## 123        19500          137                           SysMyo_Muscle_Gene_Sets
## 124        14893          128                                   miRTarBase_2017
## 125        17598         1208                          TargetScan_microRNA_2017
## 126         5902          109              Enrichr_Libraries_Most_Popular_Genes
## 127        12486          299           Enrichr_Submissions_TF-Gene_Coocurrence
## 128         1073          100        Data_Acquisition_Method_Most_Popular_Genes
## 129        19513          117                                            DSigDB
## 130        14433           36                        GO_Biological_Process_2018
## 131         8655           61                        GO_Cellular_Component_2018
## 132        11459           39                        GO_Molecular_Function_2018
## 133        19741          270           TF_Perturbations_Followed_by_Expression
## 134        27360          802                          Chromosome_Location_hg19
## 135        13072           26                 NIH_Funded_PIs_2017_Human_GeneRIF
## 136        13464           45                 NIH_Funded_PIs_2017_Human_AutoRIF
## 137        13787          200          Rare_Diseases_AutoRIF_ARCHS4_Predictions
## 138        13929          200          Rare_Diseases_GeneRIF_ARCHS4_Predictions
## 139        16964          200    NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions
## 140        17258          200    NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions
## 141        10352           58                  Rare_Diseases_GeneRIF_Gene_Lists
## 142        10471           76                  Rare_Diseases_AutoRIF_Gene_Lists
## 143        12419          491                                   SubCell_BarCode
## 144        19378           37                                 GWAS_Catalog_2019
## 145         6201           45                           WikiPathways_2019_Human
## 146         4558           54                           WikiPathways_2019_Mouse
## 147         3264           22                 TRRUST_Transcription_Factors_2019
## 148         7802           92                                   KEGG_2019_Human
## 149         8551           98                                   KEGG_2019_Mouse
## 150        12444           23                             InterPro_Domains_2019
## 151         9000           20                                 Pfam_Domains_2019
## 152         7744          363     DepMap_WG_CRISPR_Screens_Broad_CellLines_2019
## 153         6204          387    DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019
## 154        13420           32              MGI_Mammalian_Phenotype_Level_4_2019
## 155        14148          122                                UK_Biobank_GWAS_v1
## 156         9813           49                                    BioPlanet_2019
## 157         1397           13                                      ClinVar_2019
## 158         9116           22                                       PheWeb_2019
## 159        17464           63                                          DisGeNET
## 160          394           73                              HMS_LINCS_KinomeScan
## 161        11851          586                              CCLE_Proteomics_2020
## 162         8189          421                                 ProteomicsDB_2020
## 163        18704          100                       lncHUB_lncRNA_Co-Expression
## 164         5605           39                     Virus-Host_PPI_P-HIPSTer_2020
## 165         5718           31                       Elsevier_Pathway_Collection
## 166        14156           40                    Table_Mining_of_CRISPR_Studies
## 167        16979          295                        COVID-19_Related_Gene_Sets
##                                                                          link
## 1                    http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/
## 2                                    http://jaspar.genereg.net/html/DOWNLOAD/
## 3                                                                            
## 4                              http://amp.pharm.mssm.edu/lib/cheadownload.jsp
## 5                                            http://www.ncbi.nlm.nih.gov/geo/
## 6                                http://genome.ucsc.edu/ENCODE/downloads.html
## 7                         https://cgap.nci.nih.gov/Pathways/BioCarta_Pathways
## 8                                 http://www.reactome.org/download/index.html
## 9                     http://www.wikipathways.org/index.php/Download_Pathways
## 10                                           http://www.ncbi.nlm.nih.gov/geo/
## 11                                          http://www.kegg.jp/kegg/download/
## 12                                           http://www.ncbi.nlm.nih.gov/geo/
## 13  http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_61
## 14                                              http://amp.pharm.mssm.edu/X2K
## 15                 http://www.geneontology.org/GO.downloads.annotations.shtml
## 16                             http://genesigdb.org/genesigdb/downloadall.jsp
## 17                   http://software.broadinstitute.org/gsea/msigdb/index.jsp
## 18                                               http://biogps.org/downloads/
## 19                                               http://biogps.org/downloads/
## 20                 http://www.geneontology.org/GO.downloads.annotations.shtml
## 21                 http://www.geneontology.org/GO.downloads.annotations.shtml
## 22                                   http://www.human-phenotype-ontology.org/
## 23                                         http://www.roadmapepigenomics.org/
## 24                           http://amp.pharm.mssm.edu/lib/keacommandline.jsp
## 25                                      https://www.nursa.org/nursa/index.jsf
## 26                        http://mips.helmholtz-muenchen.de/genre/proj/corum/
## 27                           http://amp.pharm.mssm.edu/lib/keacommandline.jsp
## 28                                            http://www.informatics.jax.org/
## 29                                            http://www.informatics.jax.org/
## 30                                        http://www.broadinstitute.org/cmap/
## 31                                        http://www.broadinstitute.org/cmap/
## 32                                              http://www.omim.org/downloads
## 33                                              http://www.omim.org/downloads
## 34                                  http://mint.bio.uniroma2.it/download.html
## 35                  http://www.broadinstitute.org/gsea/msigdb/collections.jsp
## 36                  http://www.broadinstitute.org/gsea/msigdb/collections.jsp
## 37                                           http://www.ncbi.nlm.nih.gov/geo/
## 38                                           http://www.ncbi.nlm.nih.gov/geo/
## 39                                           http://www.ncbi.nlm.nih.gov/geo/
## 40                             https://portals.broadinstitute.org/ccle/home\n
## 41                                               http://biogps.org/downloads/
## 42                                              https://www.proteomicsdb.org/
## 43                                  http://www.humanproteomemap.org/index.php
## 44                                               http://www.hmdb.ca/downloads
## 45                                ftp://ftp.ebi.ac.uk/pub/databases/interpro/
## 46                 http://www.geneontology.org/GO.downloads.annotations.shtml
## 47                 http://www.geneontology.org/GO.downloads.annotations.shtml
## 48                 http://www.geneontology.org/GO.downloads.annotations.shtml
## 49                                                  http://www.brain-map.org/
## 50                               http://genome.ucsc.edu/ENCODE/downloads.html
## 51                               http://genome.ucsc.edu/ENCODE/downloads.html
## 52                                            http://www.koehn.embl.de/depod/
## 53                                                  http://www.brain-map.org/
## 54                               http://genome.ucsc.edu/ENCODE/downloads.html
## 55                                     http://www.broadinstitute.org/achilles
## 56                                     http://www.broadinstitute.org/achilles
## 57                                            http://www.informatics.jax.org/
## 58                        https://cgap.nci.nih.gov/Pathways/BioCarta_Pathways
## 59                                                       http://humancyc.org/
## 60                                          http://www.kegg.jp/kegg/download/
## 61                                                    http://pid.nci.nih.gov/
## 62                                                  http://www.pantherdb.org/
## 63                    http://www.wikipathways.org/index.php/Download_Pathways
## 64                                http://www.reactome.org/download/index.html
## 65                                           http://www.maayanlab.net/ESCAPE/
## 66                                     http://www.ncbi.nlm.nih.gov/homologene
## 67                                           http://www.ncbi.nlm.nih.gov/geo/
## 68                                           http://www.ncbi.nlm.nih.gov/geo/
## 69                                           http://www.ncbi.nlm.nih.gov/geo/
## 70                                    https://grants.nih.gov/grants/oer.htm\n
## 71                                           http://www.ncbi.nlm.nih.gov/geo/
## 72                                          http://amp.pharm.mssm.edu/Enrichr
## 73                                           http://www.ncbi.nlm.nih.gov/geo/
## 74                                           http://www.ncbi.nlm.nih.gov/geo/
## 75                                          http://amp.pharm.mssm.edu/Enrichr
## 76                                            http://www.ncbi.nlm.nih.gov/gap
## 77                                                           https://clue.io/
## 78                                                           https://clue.io/
## 79                                                 http://www.gtexportal.org/
## 80                                                 http://www.gtexportal.org/
## 81                                           http://www.ncbi.nlm.nih.gov/geo/
## 82                                           http://www.ncbi.nlm.nih.gov/geo/
## 83                                           http://www.ncbi.nlm.nih.gov/geo/
## 84                                           http://www.ncbi.nlm.nih.gov/geo/
## 85                                           http://www.ncbi.nlm.nih.gov/geo/
## 86                                           http://www.ncbi.nlm.nih.gov/geo/
## 87                                           http://www.ncbi.nlm.nih.gov/geo/
## 88                                           http://www.ncbi.nlm.nih.gov/geo/
## 89                                                           https://clue.io/
## 90                                                           https://clue.io/
## 91                                                           https://clue.io/
## 92                                                           https://clue.io/
## 93                                http://www.reactome.org/download/index.html
## 94                                          http://www.kegg.jp/kegg/download/
## 95                    http://www.wikipathways.org/index.php/Download_Pathways
## 96                                                                           
## 97                                           http://www.ncbi.nlm.nih.gov/geo/
## 98                                           http://www.ncbi.nlm.nih.gov/geo/
## 99                         http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways
## 100                                                      http://humancyc.org/
## 101                                                   http://pid.nci.nih.gov/
## 102                                         http://www.pantherdb.org/pathway/
## 103                                     https://ntp.niehs.nih.gov/drugmatrix/
## 104                                         http://amp.pharm.mssm.edu/Enrichr
## 105                                              http://proteincomplexes.org/
## 106                                             http://tissues.jensenlab.org/
## 107                                          http://www.ncbi.nlm.nih.gov/geo/
## 108                                           http://www.informatics.jax.org/
## 109                                        http://compartments.jensenlab.org/
## 110                                            http://diseases.jensenlab.org/
## 111                                           http://bioplex.hms.harvard.edu/
## 112                                              http://www.geneontology.org/
## 113                                              http://www.geneontology.org/
## 114                                              http://www.geneontology.org/
## 115                                              http://www.geneontology.org/
## 116                                              http://www.geneontology.org/
## 117                                              http://www.geneontology.org/
## 118                                          http://amp.pharm.mssm.edu/archs4
## 119                                          http://amp.pharm.mssm.edu/archs4
## 120                                          http://amp.pharm.mssm.edu/archs4
## 121                                          http://amp.pharm.mssm.edu/archs4
## 122                                          http://amp.pharm.mssm.edu/archs4
## 123                                               http://sys-myo.rhcloud.com/
## 124                                        http://mirtarbase.mbc.nctu.edu.tw/
## 125                                                http://www.targetscan.org/
## 126                                         http://amp.pharm.mssm.edu/Enrichr
## 127                                         http://amp.pharm.mssm.edu/Enrichr
## 128                                         http://amp.pharm.mssm.edu/Enrichr
## 129                             http://tanlab.ucdenver.edu/DSigDB/DSigDBv1.0/
## 130                                              http://www.geneontology.org/
## 131                                              http://www.geneontology.org/
## 132                                              http://www.geneontology.org/
## 133                                          http://www.ncbi.nlm.nih.gov/geo/
## 134                             http://hgdownload.cse.ucsc.edu/downloads.html
## 135                                      https://www.ncbi.nlm.nih.gov/pubmed/
## 136                                      https://www.ncbi.nlm.nih.gov/pubmed/
## 137                                      https://amp.pharm.mssm.edu/geneshot/
## 138                           https://www.ncbi.nlm.nih.gov/gene/about-generif
## 139                                      https://www.ncbi.nlm.nih.gov/pubmed/
## 140                                      https://www.ncbi.nlm.nih.gov/pubmed/
## 141                           https://www.ncbi.nlm.nih.gov/gene/about-generif
## 142                                      https://amp.pharm.mssm.edu/geneshot/
## 143                                            http://www.subcellbarcode.org/
## 144                                                https://www.ebi.ac.uk/gwas
## 145                                             https://www.wikipathways.org/
## 146                                             https://www.wikipathways.org/
## 147                                          https://www.grnpedia.org/trrust/
## 148                                                      https://www.kegg.jp/
## 149                                                      https://www.kegg.jp/
## 150                                           https://www.ebi.ac.uk/interpro/
## 151                                                    https://pfam.xfam.org/
## 152                                                       https://depmap.org/
## 153                                                       https://depmap.org/
## 154                                           http://www.informatics.jax.org/
## 155                                     https://www.ukbiobank.ac.uk/tag/gwas/
## 156                                         https://tripod.nih.gov/bioplanet/
## 157                                     https://www.ncbi.nlm.nih.gov/clinvar/
## 158                                              http://pheweb.sph.umich.edu/
## 159                                                  https://www.disgenet.org
## 160                                  http://lincs.hms.harvard.edu/kinomescan/
## 161                                   https://portals.broadinstitute.org/ccle
## 162                                             https://www.proteomicsdb.org/
## 163                                        https://amp.pharm.mssm.edu/lnchub/
## 164                                                      http://phipster.org/
## 165                                 http://www.transgene.ru/disease-pathways/
## 166                                                                          
## 167                                        https://amp.pharm.mssm.edu/covid19
##     numTerms
## 1        615
## 2        326
## 3        290
## 4        353
## 5        701
## 6        498
## 7        249
## 8         78
## 9        199
## 10       142
## 11       200
## 12       269
## 13       222
## 14       385
## 15      1136
## 16      2139
## 17       386
## 18        84
## 19        96
## 20       641
## 21      5192
## 22      1779
## 23       383
## 24       474
## 25      1796
## 26      1658
## 27        84
## 28        71
## 29       476
## 30      6100
## 31      6100
## 32        90
## 33       187
## 34        85
## 35       858
## 36       189
## 37       142
## 38       323
## 39       323
## 40       967
## 41        93
## 42       207
## 43        30
## 44      3906
## 45       311
## 46       941
## 47       205
## 48       402
## 49      2192
## 50       816
## 51       412
## 52        59
## 53      2192
## 54       109
## 55       216
## 56       216
## 57       476
## 58       239
## 59       125
## 60       179
## 61       209
## 62       104
## 63       404
## 64      1389
## 65       315
## 66        12
## 67       839
## 68       839
## 69       906
## 70     32876
## 71       906
## 72       428
## 73      2460
## 74      2460
## 75       395
## 76       345
## 77     33132
## 78     33132
## 79      2918
## 80      2918
## 81       261
## 82       286
## 83       286
## 84       261
## 85       401
## 86       401
## 87       312
## 88       312
## 89        96
## 90        96
## 91      3644
## 92      3644
## 93      1530
## 94       293
## 95       437
## 96       104
## 97       285
## 98       285
## 99       237
## 100      152
## 101      209
## 102      112
## 103     7876
## 104      645
## 105      995
## 106     1842
## 107     1302
## 108     5231
## 109     2283
## 110     1811
## 111     3915
## 112      636
## 113      972
## 114     3166
## 115      816
## 116     3271
## 117    10125
## 118      108
## 119      125
## 120      352
## 121      498
## 122     1724
## 123     1135
## 124     3240
## 125      683
## 126      121
## 127     1722
## 128       12
## 129     4026
## 130     5103
## 131      446
## 132     1151
## 133     1958
## 134       36
## 135     5687
## 136    12558
## 137     3725
## 138     2244
## 139    12558
## 140     5684
## 141     2244
## 142     3725
## 143      104
## 144     1737
## 145      472
## 146      176
## 147      571
## 148      308
## 149      303
## 150     1071
## 151      608
## 152      558
## 153      325
## 154     5261
## 155      857
## 156     1510
## 157      182
## 158     1161
## 159     9828
## 160      148
## 161      378
## 162      913
## 163     3729
## 164     6715
## 165     1721
## 166      802
## 167      205

# perform enrichment
enrich_results <- enrichr(
  genes = top5dg4$gene[ top5dg4$cluster == "covid" ],
  databases = "GO_Biological_Process_2017b" )[[1]]

## Uploading data to Enrichr... Done.
##   Querying GO_Biological_Process_2017b... Done.
## Parsing results... Done.

# Some databases of interest:
#   
#   GO_Biological_Process_2017b
# KEGG_2019_Human
# KEGG_2019_Mouse
# WikiPathways_2019_Human
# WikiPathways_2019_Mouse

enrich_results$Term <- sub("GO.*","",enrich_results$Term)

par(1,1,mar=c(3,20,2,1))
barplot( height = -log10(enrich_results$P.value)[20:1],
         names.arg = enrich_results$Term[20:1],
         horiz = T,
         las=2,
         border=F,
         cex.names = .6 )
abline( v=c( -log10(0.05) ),lty=2 )
abline( v=0,lty=1 )

GO analysis of DEG in covid samples of cluster 6 shows and enrichment for the interferon signalling pathway, a pathway that has recently been shown to be impaired in severe COVID-19 patients (Hadjadj et al. Science 2020)