# Dimensionality reduction

## Material

- Making sense of PCA
- Understanding t-SNE
- t-SNE explained by Josh Starmer
- Understanding UMAP
- Video by one of the UMAP authors
- More info on UMAP parameters

## Exercises

This chapter uses the

`gbm`

dataset

Load the `gbm`

dataset you have created yesterday:

```
gbm <- readRDS("gbm_day1.rds")
```

And load the following packages:

```
library(Seurat)
library(clustree)
```

Once the data is normalized, scaled and variable features have been identified, we can start to reduce the dimensionality of the data.
For the PCA, by default, only the previously determined variable features are used as input, but can be defined using features argument if you wish to specify a vector of genes. The PCA will only be run on the variable features, that you can check with `VariableFeatures(gbm)`

.

```
gbm <- Seurat::RunPCA(gbm)
```

To view the PCA plot:

```
Seurat::DimPlot(gbm, reduction = "pca")
```

We can colour the PCA plot according to any factor that is present in `@meta.data`

. For example we can take the column `Phase`

(i.e. predicted cell cycle phase):

```
Seurat::DimPlot(gbm, reduction = "pca", group.by = "Phase")
```

Note

Coming back to the cell cycle analysis, we can check the distribution of the different cell cycle phases over the PCA, and eventually regress it out using the `ScaleData()`

function. But here, the PCA doesn’t seem to cluster according to the cell cycle phase.

We can generate heatmaps according to the correlations with the different dimensions of our PCA:

```
Seurat::DimHeatmap(gbm, dims = 1:12, cells = 500, balanced = TRUE)
```

The elblowplot can help you in determining how many PCs to use for downstream analysis such as UMAP:

```
Seurat::ElbowPlot(gbm, ndims = 40)
```

The elblow plot ranks principle components based on the percentage of variance explained by each one. Where we observe an “elblow” or flattening curve, the majority of true signal is captured by this number of PCs, eg around 25 PCs for the gbm dataset.

Including too many PCs usually does not affect much the result, while including too few PCs can affect the results very much.

UMAP: The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space.

```
gbm <- Seurat::RunUMAP(gbm, dims = 1:25)
```

To view the UMAP plot:

```
Seurat::DimPlot(gbm, reduction = "umap")
```

Cells can be coloured according to cell cycle phase. Is there a group of cells than contains a high proportion of cells in G2/M phase?

```
Seurat::DimPlot(gbm, reduction = "umap", group.by = "Phase")
```