Analysis tools and quality control (QC)

Download Presentation: Introduction to scanpy

In this exercise, you will begin to learn about the standard workflow for analyzing scRNA-seq count data in Python. As single cell data is complex and often tailored to the particular experimental design, so there is not one “correct” approach to analyzing these data. However, certain steps have become accepted as a sort of standard “best practice.”

A useful overview on the current best practices is found in the articles below, which we also borrow from in this tutorial. We thank the authors for compiling such handy resources!

Current best practices in single-cell RNA-seq analysis are explained in a recent Nature Review

Accompanying this review is an online webpage, which is still under development but can be quite handy nonetheless:

Learning outcomes

After having completed this chapter you will be able to:

  • Load single cell data into Python.
  • Explain the basic structure of a AnnData object and extract count data and metadata.
  • Calculate and visualize quality measures based on:
    • mitochondrial genes
    • ribosomal genes
    • hemoglobin genes
    • relative gene expression
  • Interpret the above quality measures per cell.
  • Perform cell filtering based on user-selected quality thresholds.

Loading scRNAseq data

After the generation of the count matrices with cellranger, the next step is the data analysis. The scanpy package is currently the most popular software in Python to do this. To start working with scanpy, you must import the package into your Jupyter notebook as follows:

import scanpy as sc

An excellent resource for documentation on scanpy can be found on the software page at the following link.

There are some supplemental packages for data handling and visualization that are also very useful to import into your notebook as well.

import pandas as pd # for handling data frames (i.e. data tables)
import numpy as np # for handling numbers, arrays, and matrices
import matplotlib.pyplot as plt # plotting package

First, we will load a file specifying the different samples, and create a dictionary “datadirs” specifying the location of the count data:

sample_info = pd.read_csv("course_data/sample_info_course.csv")

datadirs = {}
for sample_name in sample_info["SampleName"]:
    if "PBMMC" in sample_name:
        datadirs[sample_name] = "course_data/count_matrices/" + sample_name + "/outs/filtered_feature_bc_matrix"

To run through a typical scanpy analysis, we will use the files that are in the directory outs/filtered_feature_bc_matrix. This directory is part of the output generated by CellRanger.

We will use the list of file paths generated in the previous step to load each sample into a separate AnnData object. We will then store all six of those samples in a list called adatas, and combine them into a single AnnData object for our analysis.

adatas = []
for sample in datadirs.keys():
    print("Loading: ", sample)
    curr_adata = sc.read_10x_mtx(datadirs[sample]) # load file into an AnnData object
    curr_adata.obs["sample"] = sample
    curr_adata.X = curr_adata.X.toarray()
    adatas.append(curr_adata)
    
adata = sc.concat(adatas) # combine all samples into a single AnnData object
adata.obs_names_make_unique() # make sure each cell barcode has a unique identifier
Loading:  PBMMC_1
Loading:  PBMMC_2
Loading:  PBMMC_3
/home/alex/anaconda3/envs/sctp/lib/python3.8/site-packages/anndata/_core/anndata.py:1838: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")

The AnnData object is similar to a detailed spreadsheet! Some basic commands to view the object are shown below. For a new dataset, there will be little to no metdata other than Cell IDs and gene names, but as you perform analyses, the metadata fields will be populated with more detail.

Exercise 1: Check what’s in the adata object, by typing adata in the Python console. How many gene features are in there? And how many cells?