Project 1

Project 1 is about a single cell sequencing project of zebrafish retina. Photoreceptors were damaged with MNU and the response was investigated with help of transgenic fish that contained contstruct with a non-coding element (careg) regulating attached to a EGFP transcript, that can be used as regenerative activation marker. For single-cell transcriptomics analysis, cell suspensions were created from retinal cells, and processed with the 10x 3’ kit.

Available data

Data has been downloaded and prepared for you from GEO GSE202212. The count matrices are created with cellranger. To create the count tables, the EGFP sequence was added to the reference genome. The gene name of EGFP is EGFP.

In order to download the data, run:

wget https://single-cell-transcriptomics.s3.eu-central-1.amazonaws.com/projects/data/project1.tar.gz
tar -xvf project1.tar.gz
rm project1.tar.gz

After extracting, a directory project1 appears with the following content:

.
├── data
│   ├── 10dp1
│   │   ├── filtered_feature_bc_matrix
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   └── web_summary.html
│   ├── 10dp2
│   │   ├── filtered_feature_bc_matrix
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   └── web_summary.html
│   ├── 3dp1
│   │   ├── filtered_feature_bc_matrix
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   └── web_summary.html
│   ├── 3dp2
│   │   ├── filtered_feature_bc_matrix
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   └── web_summary.html
│   ├── 7dp1
│   │   ├── filtered_feature_bc_matrix
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   └── web_summary.html
│   ├── 7dp2
│   │   ├── filtered_feature_bc_matrix
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   └── web_summary.html
│   ├── ctrl1
│   │   ├── filtered_feature_bc_matrix
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   └── web_summary.html
│   └── ctrl2
│       ├── filtered_feature_bc_matrix
│       │   ├── barcodes.tsv.gz
│       │   ├── features.tsv.gz
│       │   └── matrix.mtx.gz
│       └── web_summary.html
└── paper.pdf

17 directories, 33 files

Showing us that we have two replicates per treatment, and four treatments:

ctrl: controls
3dp: 3 days post injury
7dp: 7 days post injury
10dp: 10 days post injury

Now create a new project in the project1 directory (Project (None) > New Project …), and create Seurat object from the count matrices:

library(Seurat)

# vector of paths to all sample directories
datadirs <- list.files(path = "data", full.names = TRUE) 

# get the sample names
# replace underscores with hyphen to correctly extract sample names later on
samples <- basename(datadirs) |> gsub("_", "-", x = _)

# files are in filter_feature_bc_matrix
datadirs <- paste(datadirs, "filtered_feature_bc_matrix", sep = "/")

names(datadirs) <- samples

# create a large sparse matrix from all count data
sparse_matrix <- Seurat::Read10X(data.dir = datadirs)

# create a seurat object from sparse matrix
seu <- Seurat::CreateSeuratObject(counts = sparse_matrix,
                                  project = "Zebrafish")

Project exercise

With this dataset, go through the steps we have performed during the course, and try to answer the following questions. Pay specific attention to quality control, clustering and annotation.

Project excercises and questions

Look at the quality of the dataset, understand possible pitfalls and decide how to best filter the data

Have a look at the seurat object before starting your analysis. How many cells and genes are present? Is the data already normalised?
Assess the quality of the samples. Are there any important issues to consider? Would you remove genes?

Normalise, integrate and visualise the data in order to cluster the cells into biologically meaningful populations

Which clustering resolution would you choose?
Inspect the clustering (ie. % of mitochondrial, number of features..), would you filter out some populations? do you need to change your resolution?

Next step will be to identify the different cell types, in particular MG, rods and cons. MG is the cell population where we want to evaluate if EGFP is expressed during regeneration (as a reported of CAREG).

Annotate cells (at least rods, cons and MG) and describe EGFP behavior in the different conditions (Ctr, 3dp, 7dp, 10dp)
Can you identify a subpopulation of MG cells expression EGFP?

Bonus: can you identify expression profile differences in injured vs uninjured Rods cells? and in Cons cells? How do your results compare to the ones from the paper?

Tips

For mitochondrial genes, ribosomal genes and hemoglobin genes you can use the following patterns: "^mt-", "^rp[sl]" and "^hb[^(p)]".
Work iterative; meaning that based on results of an analsyis, adjust the previous analysis. For example, if clustering is not according to cell types, try to adjust the number of components or the resolution.
You can plot the samples by condition (keeping the 2 replicates together). For that, try adding a new metadata column that summarises the information you need.
For cell type annotation, have a look at the methods section from the original paper, they provide a supplementary table with retina gene markers.