Exercise 2

Quality control at spot-level

In this second exercise, we will focus on the critical step of quality control (QC) for spatial transcriptomics data. This step focuses on removing low-quality spots or technical artifacts from the dataset to avoid biases into downstream analysis.

Learning objectives

By the end of this exercise, you will be able to:

  • Calculate per-spot QC metrics.
  • Identify local outliers based on various QC metrics.
  • Detect spatial artifacts using SpotSweeper.
  • Visualize QC metrics and detected artifacts.

Libraries

Calculate QC metrics

We will start by calculating QC metrics that are also commonly used in scRNA-seq data analysis (number of UMI counts per cell, number of detected genes per cell, percentage of UMIs originating from mitochondrial genes). Here these metrics will be calculated per sport, and used to identify low-quality spots.

We will start by loading our SpatialExperiment object, which was saved in the previous exercise, and prepare it for quality control analysis.

# Load the SpatialExperiment object
spe <- loadHDF5SummarizedExperiment(dir="results/day1/", prefix="01.1_spe")

# Identify mitochondrial genes (genes with symbol starting with "MT-")
is.mito <- rownames(spe)[grepl("^MT-", rownames(spe))]

Now, we will calculate per-spot QC metrics using the scuttle package. The function addPerCellQCMetrics adds metrics mitochondrial gene percentage to the colData of the SpatialExperiment object.

Run the following code to compute these metrics and inspect the results.

spe <- addPerCellQCMetrics(spe, subsets = list(Mito = is.mito))
Exercise 1

Check which metadata has been added to colData. Do you recognize the different metrics?

# Display the colData to see the newly added QC metrics.
colData(spe)
DataFrame with 14207 rows and 12 columns
                                    barcode in_tissue array_row array_col
                                <character> <integer> <integer> <integer>
s_016um_00144_00175-1 s_016um_00144_00175-1         1       144       175
s_016um_00204_00145-1 s_016um_00204_00145-1         1       204       145
s_016um_00191_00159-1 s_016um_00191_00159-1         1       191       159
s_016um_00111_00233-1 s_016um_00111_00233-1         1       111       233
s_016um_00202_00235-1 s_016um_00202_00235-1         1       202       235
...                                     ...       ...       ...       ...
s_016um_00217_00238-1 s_016um_00217_00238-1         1       217       238
s_016um_00192_00176-1 s_016um_00192_00176-1         1       192       176
s_016um_00214_00240-1 s_016um_00214_00240-1         1       214       240
s_016um_00193_00227-1 s_016um_00193_00227-1         1       193       227
s_016um_00109_00223-1 s_016um_00109_00223-1         1       109       223
                         bin_size   sample_id       sum  detected
                      <character> <character> <numeric> <integer>
s_016um_00144_00175-1         016    sample01       322       202
s_016um_00204_00145-1         016    sample01       253       203
s_016um_00191_00159-1         016    sample01      1546       982
s_016um_00111_00233-1         016    sample01       818       665
s_016um_00202_00235-1         016    sample01       227       195
...                           ...         ...       ...       ...
s_016um_00217_00238-1         016    sample01       291       246
s_016um_00192_00176-1         016    sample01       626       557
s_016um_00214_00240-1         016    sample01       330       271
s_016um_00193_00227-1         016    sample01       802       684
s_016um_00109_00223-1         016    sample01      1020       794
                      subsets_Mito_sum subsets_Mito_detected
                             <numeric>             <integer>
s_016um_00144_00175-1              109                    10
s_016um_00204_00145-1               37                     7
s_016um_00191_00159-1              230                    11
s_016um_00111_00233-1               51                     8
s_016um_00202_00235-1                7                     5
...                                ...                   ...
s_016um_00217_00238-1               10                     6
s_016um_00192_00176-1                7                     6
s_016um_00214_00240-1               10                     6
s_016um_00193_00227-1               19                     7
s_016um_00109_00223-1               67                    10
                      subsets_Mito_percent     total
                                 <numeric> <numeric>
s_016um_00144_00175-1             33.85093       322
s_016um_00204_00145-1             14.62451       253
s_016um_00191_00159-1             14.87710      1546
s_016um_00111_00233-1              6.23472       818
s_016um_00202_00235-1              3.08370       227
...                                    ...       ...
s_016um_00217_00238-1              3.43643       291
s_016um_00192_00176-1              1.11821       626
s_016um_00214_00240-1              3.03030       330
s_016um_00193_00227-1              2.36908       802
s_016um_00109_00223-1              6.56863      1020
  • The sum column contains the total number of unique molecular identifiers (UMIs) for each spot
  • The detected column contains the number of unique genes detected per spot
  • The subsets_mito_percent contains the percentage of transcripts mapping to mitochondrial genes per spot

Global outlier detection

In order to detect low-quality spots and filter them out, QC methods adapted from scRNA-seq analysis can be applied. A simple option is to apply a fixed (upper or lower-bound) threshold to a metric across all spots, and remove the spots that do not pass the filtering criteria applied.

To set up these threshold we can look for information from 10X and the community, for example https://github.com/10XGenomics/HumanColonCancer_VisiumHD/issues/28. Based on this discussion the authors of the OSTA book wrote (see “Remove bins overlaying empty tissue” section in https://lmweber.org/OSTA/pages/seq-workflow-visium-hd-bin.html#dependencies):

Based on a discussion with the original author at 10x Genomics, in this dataset, 8 µm bins with a library size below 100 are removed. Since we expect a 4-fold increase in library size across all bins at 16 µm, we can set the filtering threshold to 400.

We can also have a look and plot distributions of the values to decide which is the best cutoff:

# Plot library size vs. detected genes
p1 <- ggcells(spe, aes(x=detected, y=sum)) + 
  geom_point() +
  geom_density_2d() +
  scale_x_log10() +
  scale_y_log10() +
  geom_xsidehistogram() + 
  geom_ysidehistogram() + 
  geom_vline(xintercept = 300, col="red", lty=2) +
  geom_hline(yintercept = 400, col="red", lty=2) +
  labs(y="Number of UMIs", x="Number of detected genes") +
  ggtitle("Library size vs. detected")

# Plot mito proportion vs. detected genes
p2 <- ggcells(spe, aes(x=detected, y=subsets_Mito_percent)) + 
  geom_point() +
  geom_density_2d() +
  scale_x_log10() +
  geom_xsidehistogram() + 
  geom_ysidehistogram() + 
  geom_vline(xintercept = 300, col="red", lty=2) +
  geom_hline(yintercept = 20, col="red", lty=2) +
  labs(y="% mitochondrial reads", x="Number of detected genes") +
  ggtitle("Mito proportion vs. genes detected")

p1 + p2

And further check how many spots would be removed with the given thresholds:

# Flag spots based on fixed thresholds
spe$qc_lib_size <- spe$sum < 400
spe$qc_detected <- spe$detected < 300
spe$qc_mito_prop <- spe$subsets_Mito_percent > 20

# Tabulate number of spots flagged by each
qc <- grep("^qc", names(colData(spe)))
sapply(colData(spe)[qc], table)
      qc_lib_size qc_detected qc_mito_prop
FALSE       11021       11263        12148
TRUE         3186        2944         2059
Exercise 2

What do you think about this intial filtering?

Have a look at the mitochondrial percentage numbers on the tissue section using plotVisium() function. Which spots are more likely to be excluded?

plotVisium(spe, annotate = "subsets_Mito_percent", zoom = TRUE)

Visualize spatial pattern of QC metrics (Global outliers)

Very importantly, we want to have a look into the spatial pattern of using a fixed threshold for QC metrics :

# check spatial pattern of discarded spots
p1 <- plotObsQC(spe, 
                plot_type="spot", annotate="qc_lib_size") + 
  ggtitle("Library size (< 400 UMI)")

p2 <- plotObsQC(spe, 
                plot_type="spot", annotate="qc_detected") + 
  ggtitle("Detected genes (< 300 genes)")

p3 <- plotObsQC(spe, 
                plot_type="spot", annotate="qc_mito_prop") + 
  ggtitle("Mito proportion (> 20%)")

# plot the three together
gridExtra::grid.arrange(p1, p2, p3, ncol=2)

Local Outlier Detection

To avoid discarding whole layers of tissues, another approach to identify low-quality spots is to use local outlier detection methods that take into account the spatial context of each spot.

SpotSweeper package provides functions to identify common QC metrics (such as the ones we have seen in the previous section: library size, number of detected genes, and mitochondrial percentage) and detect local outliers based on these metrics.

We will use localOutliers from SpotSweeper which helps identifying spots that deviate significantly from their local neighborhood.

# Identify local outliers based on library size ("sum" of counts).
# Spots with unusually low library size compared to their neighbors will be flagged.
spe <- localOutliers(spe,
  metric = "sum",
  direction = "lower",
  log = TRUE
)

# Identify local outliers based on the number of unique genes detected ("detected").
# Spots with an unusually low number of detected genes will be flagged.
spe <- localOutliers(spe,
  metric = "detected",
  direction = "lower",
  log = TRUE
)

# Identify local outliers based on the mitochondrial gene percentage.
# Spots with an unusually high mitochondrial percentage will be flagged.
spe <- localOutliers(spe,
  metric = "subsets_Mito_percent",
  direction = "higher",
  log = FALSE
)

# Combine all individual outlier flags into a single "local_outliers" column.
# A spot is considered a local outlier if it's flagged by any of the above metrics.
spe$local_outliers <- as.logical(spe$sum_outliers) |
  as.logical(spe$detected_outliers) |
  as.logical(spe$subsets_Mito_percent_outliers)
Exercise 3

How many spots were identified as local outliers based on the combined criteria?

As we have seen before, visualizing the QC metrics is crucial for understanding the quality of your spatial transcriptomics data and for deciding on appropriate filtering strategies, so look at their spatial localization using the plotObsQC function.

What do you think?

# Count the number of spots flagged as local outliers.
table(spe$local_outliers)

FALSE  TRUE 
14036   171 
plotObsQC(spe, 
          plot_type="spot", annotate="local_outliers") + 
  ggtitle("All local outliers")

Even better, we can visualize the spatial distribution of the local outliers together with the QC metrics, for example plot the spatial distribution of local outliers for mitochondrial percentage.

plotQCmetrics(spe,
  metric = "subsets_Mito_percent", outliers = "subsets_Mito_percent_outliers",
  point_size = 0.4,
  stroke = 0.75
) +
  ggtitle("Local outliers for %MT")

Exercise 4

Plot the spatial distribution of local outliers for library size and number of detected genes. What do you observe?

# Plot the spatial distribution of local outliers for library size.
plotQCmetrics(spe,
  metric = "sum_log", outliers = "sum_outliers", point_size = 0.4,
  stroke = 0.75
) +
  ggtitle("Local outliers for number of UMIs")

# Plot the spatial distribution of local outliers for number of detected genes.
plotQCmetrics(spe,
  metric = "detected_log", outliers = "detected_outliers", point_size = 0.4,
  stroke = 0.75
) +
  ggtitle("Local outliers for number of UMIs")

Very few spots seem to be flagged as outliers!

We can visualisaze the z-scores from the local outlier detection test, and compare it to the global threshold used earlier. How do you interpret this?

plotObsQC(spe, plot_type = "violin", x_metric = "subsets_Mito_percent_z", annotate = "qc_mito_prop")

Compare and combine global and local outliers and apply filtering

Based on the comparison of global and local outliers, we choose here to adjust the global cutoffs to be less stringent:

Adjust global filtering cutoffs based on previous plots (apply less stringent cutoffs)

# adjust global filtering cutoff
spe$qc_lib_size <- spe$sum < 120
spe$qc_detected <- spe$detected < 100
spe$qc_mito_prop <- spe$subsets_Mito_percent > 35

Compare global and local outliers being flagged

spe$global_outliers <- spe$qc_lib_size | spe$qc_detected | spe$qc_mito_prop
p1 <- plotObsQC(spe, plot_type="spot", annotate="global_outliers")
p2 <- plotObsQC(spe, plot_type="spot", annotate="local_outliers")
gridExtra::grid.arrange(p1, p2, ncol=2)

Combine both:

# combine local and global outliers to create a final discard flag
spe$discard <- spe$global_outliers | spe$local_outliers
Exercise 5

Visualise the spots that will be discarded based on the combined criteria. How many spots will be removed?

# check how many spots will be discarded
table(spe$discard)

FALSE  TRUE 
12615  1592 
# have a final look to the spots that we will discard
plotVisium(spe, zoom=T, annotate="discard")

Technical artifacts

Technical artifacts can arise during tissue preparation, for example tissue dissection can cause mechanical damage, leading to large areas with artificially low biological heterogeneity. These can be detected with the findArtifacts function of the spotSweeper package.

# find artifacts using SpotSweeper
spe <- findArtifacts(spe,
    mito_percent = "subsets_Mito_percent",
    mito_sum = "subsets_Mito_sum",
    n_order = 5,
    name = "artifact"
)
Exercise 6

What is the findArtifacts function performing and what are the (potentially problematic) assumptions?

Exercise 7

Plot the spatial distribution of the artifact column identified by SpotSweeper. What do you observe?

# Plot the spatial distribution of the detected artifacts.
plotQCmetrics(spe,
    metric = "subsets_Mito_percent", # You can choose any relevant metric to display
    outliers = "artifact", 
    point_size = 1.1
) +
    ggtitle("Detected Spatial Artifacts")

Filtering

We decide to filter out the low-quality spots from the SpatialExperiment object based on the combined criteria of global thresholds and local outliers. Additionally, we will remove any features (genes) that have zero counts across all remaining spots.

# remove the spots considered of low quality based on both global and local metrics
spe <- spe[, !spe$discard]

# remove features with all 0 counts
spe <- spe[rowSums(counts(spe)) > 0, ]

What is the size of the dataset after filtering?

Save the object

Save the filtered SpatialExperiment object for the next steps:

saveHDF5SummarizedExperiment(spe, dir="results/day1", prefix="01.2_spe", replace=TRUE,
                                chunkdim=NULL, level=NULL, as.sparse=NA,
                                verbose=NA)

Clear your environment:

rm(list = ls())
gc()
.rs.restartR()
Important

Key Takeaways: