Exercise 1A: SpatialExperiment - VisiumHD binned

Learning Objectives

By the end of this exercise, you will be able to:

Understand the structure of a SpatialExperiment object.
Access and interpret spatialCoords and imgData.
Inspect assay, row, and column metadata.
Perform basic subsetting operations on SpatialExperiment objects.
Visualize Visium HD binned data on the tissue image.

Libraries

library(SpatialExperiment)
library(VisiumIO)
library(ggspavis)
library(patchwork)
library(dplyr)
library(scuttle)
library(HDF5Array)

Data for the course

We will work on a human colorectal cancer study by Oliveira et al. The dataset contains normal adjacent tissue (NAT) and colorectal carcinoma (CRC) from 5 patients. We will focus on data from a patient called “P2”, available from the 10X website.

For this sample, the study provides Visium HD data at 2, 8, and 16 um resolution and the per-cell output after cell segmentation. Matching Xenium data generated from serial sections of the same sample is also available and can be used for cross-platform comparisons.

In this exercise we will work with Visium HD data and focus on the “binned output” available as an output of the Space Ranger pipeline (version 3.0). The full output is very large, so for practical reasons we use a region of interest, selected to include an interesting part of the tissue, because it contains gland-like epithelial structures together with stromal or lower-density areas:

# Define region of interest (roi)
roi_visium <- c(
  xmin = 49000,
  xmax = 58000,
  ymin = 7500,
  ymax = 16000
)
roi_visium

 xmin  xmax  ymin  ymax 
49000 58000  7500 16000

Note: In Exercise 1B we focus on the matching Xenium dataset from a serial section of the same P2CRC sample, subsetted to an approximately matching region of interest.

Download P2 Visium HD data:

options(timeout = 600)

data_dir <- "data"
dir.create(data_dir, showWarnings = FALSE)

folders <- c(
  "day3",
  "FLEX",
  "Human_Colon_Cancer_P2",
  "segmented_outputs"
)

local_paths <- file.path(data_dir, "Human_Colon_Cancer_P2")

# CASE 1: already present locally (downloaded OR symlinked)
if (file.exists(local_paths)) {

  message("Data already available in ./data — nothing to do")

# CASE 2: available in /data → create symlinks
} else if (all(file.exists(file.path("/data", folders)))) {

  message("Data found in /data — creating symlinks")

  for (f in folders) {
    src <- file.path("/data", f)
    dest <- file.path(data_dir, f)

    if (!file.exists(dest)) {
      file.symlink(from = src, to = dest)
    }
  }

# CASE 3: missing → download
} else {

  message("Data not found — downloading into ./data")

  download.file(
    url = "https://intro-spatial-transcriptomics-training.s3.eu-central-1.amazonaws.com/spatial_course_data.tar.gz",
    destfile = file.path(data_dir, "spatial_course_data.tar.gz"),
    mode = "wb"
  )

  untar(
    tarfile = file.path(data_dir, "spatial_course_data.tar.gz"),
    exdir = data_dir
  )

  file.remove(file.path(data_dir, "spatial_course_data.tar.gz"))
}

We will first import the 16 um binned Visium HD output into a SpatialExperiment object:

# Import data into a SpatialExperiment object
spe <- TENxVisiumHD(
  spacerangerOut = "data/Human_Colon_Cancer_P2/",
  processing = "filtered",
  format = "h5",
  images = "lowres",
  bin_size = "016"
) |>
  import()

# Use unique Symbol for rownames
rownames(spe) <- uniquifyFeatureNames(
  ID = rowData(spe)$ID,
  names = rowData(spe)$Symbol
)

spe

class: SpatialExperiment 
dim: 18085 137051 
metadata(2): resources spatialList
assays(1): counts
rownames(18085): SAMD11 NOC2L ... MT-ND6 MT-CYB
rowData names(3): ID Symbol Type
colnames(137051): s_016um_00052_00082-1 s_016um_00010_00367-1 ...
  s_016um_00037_00193-1 s_016um_00144_00329-1
colData names(6): barcode in_tissue ... bin_size sample_id
reducedDimNames(0):
mainExpName: Gene Expression
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

Then, we will subset it to the region of interest. The coordinates are in the native Visium HD image coordinate system returned by spatialCoords().

# Explore the spatial coordinates
head(spatialCoords(spe))

                      pxl_col_in_fullres pxl_row_in_fullres
s_016um_00052_00082-1           45434.83          19379.250
s_016um_00010_00367-1           62050.15          22051.927
s_016um_00163_00399-1           64033.98          13137.745
s_016um_00238_00388-1           63447.51           8747.905
s_016um_00144_00175-1           50935.66          14076.263
s_016um_00347_00254-1           55702.21           2278.720

# Subset to the region of interest based on spatial coordinates
spe <- spe[, spatialCoords(spe)[, 1] >= roi_visium[["xmin"]] &
  spatialCoords(spe)[, 1] <= roi_visium[["xmax"]] &
  spatialCoords(spe)[, 2] >= roi_visium[["ymin"]] &
  spatialCoords(spe)[, 2] <= roi_visium[["ymax"]]]

spe

class: SpatialExperiment 
dim: 18085 22419 
metadata(2): resources spatialList
assays(1): counts
rownames(18085): SAMD11 NOC2L ... MT-ND6 MT-CYB
rowData names(3): ID Symbol Type
colnames(22419): s_016um_00144_00175-1 s_016um_00204_00145-1 ...
  s_016um_00231_00291-1 s_016um_00193_00227-1
colData names(6): barcode in_tissue ... bin_size sample_id
reducedDimNames(0):
mainExpName: Gene Expression
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

Exercise 1

Have a look at the paper and the 10X Genomics website.

How are transcript molecules measured in Visium HD?
What does the 16 um binned output represent?
Why might it be a good idea to use 16 um bins instead of the smallest 2 um bins? What might be the issues?
Which report from the Space Ranger tool can you inspect to get an overview of sample and processing quality?

Answer

The measurements of RNA molecules are indirect: targeted probe pairs are hybridized, ligated, captured on the Visium HD slide, and sequenced. The binned Visium HD outputs aggregate molecules over square spatial bins.

The smallest Visium HD bins are 2x2 um. Larger bins are pools of neighboring 2 um bins: an 8 um bin aggregates 16 smaller bins, and a 16 um bin aggregates 64 smaller bins. This increases the number of UMIs per gene and per bin and reduces the sparsity of the data. This is of course at the cost of spatial resolution, and it should be kept in mind that a 16 um may captures material from multiple cells. For this course, 16 um bins are a practical compromise. They make the object smaller and faster to plot and process while still preserving tissue-level spatial structure.

The Space Ranger web summary and metrics summary are useful reports for checking processing quality before using the sample in the course.

Exploring the object

In this exercise, we will inspect the SpatialExperiment class used to store the data. SpatialExperiment extends SingleCellExperiment, classically used for scRNA-seq analysis, so familiar accessors such as colData(), rowData(), assay(), and reducedDims() still apply.

Overview of the `SpatialExperiment` class structure

Note: The practical Exercise 1B uses a Xenium dataset stored into a SpatialFeatureExperiment class, which is an extension of the SpatialExperiment incorporating geometries, allowing to store cell segmentation polygons for example.

Exercise 2

Check the outputs of:

colData()
rowData()
assay()
reducedDims()

Then check the spatial-specific accessors:

Questions:

What kind of data is stored in each slot?
How many bins and genes are retained in the subsetted spe object?
Are all retained bins within the tissue area?
Bonus: what class is used to store the UMI count matrix? What is special about it?

Answer

The slot colData contains metadata for each spatial bin:

colData(spe) |> head()

DataFrame with 6 rows and 6 columns
                                    barcode in_tissue array_row array_col
                                <character> <integer> <integer> <integer>
s_016um_00144_00175-1 s_016um_00144_00175-1         1       144       175
s_016um_00204_00145-1 s_016um_00204_00145-1         1       204       145
s_016um_00237_00273-1 s_016um_00237_00273-1         1       237       273
s_016um_00191_00159-1 s_016um_00191_00159-1         1       191       159
s_016um_00245_00233-1 s_016um_00245_00233-1         1       245       233
s_016um_00116_00279-1 s_016um_00116_00279-1         1       116       279
                         bin_size   sample_id
                      <character> <character>
s_016um_00144_00175-1         016    sample01
s_016um_00204_00145-1         016    sample01
s_016um_00237_00273-1         016    sample01
s_016um_00191_00159-1         016    sample01
s_016um_00245_00233-1         016    sample01
s_016um_00116_00279-1         016    sample01

ncol(spe)

[1] 22419

We can check whether bins are covered by tissue:

colData(spe) |>
  as.data.frame() |>
  group_by(in_tissue) |>
  summarise(number = n())

# A tibble: 1 × 2
  in_tissue number
      <int>  <int>
1         1  22419

The slot rowData contains gene metadata:

rowData(spe) |> head()

DataFrame with 6 rows and 3 columns
                     ID      Symbol            Type
            <character> <character>        <factor>
SAMD11  ENSG00000187634      SAMD11 Gene Expression
NOC2L   ENSG00000188976       NOC2L Gene Expression
KLHL17  ENSG00000187961      KLHL17 Gene Expression
PLEKHN1 ENSG00000187583     PLEKHN1 Gene Expression
PERM1   ENSG00000187642       PERM1 Gene Expression
HES4    ENSG00000188290        HES4 Gene Expression

nrow(spe)

[1] 18085

The assay stores the UMI counts matrix:

assay(spe)

<18085 x 22419> sparse DelayedMatrix object of type "integer":
        s_016um_00144_00175-1 ... s_016um_00193_00227-1
 SAMD11                     0   .                     0
  NOC2L                     0   .                     0
 KLHL17                     0   .                     0
PLEKHN1                     0   .                     0
  PERM1                     0   .                     0
    ...                     .   .                     .
MT-ND4L                     1   .                     0
 MT-ND4                     2   .                     8
 MT-ND5                     0   .                     1
 MT-ND6                     0   .                     0
 MT-CYB                     0   .                     2

class(assay(spe))

[1] "DelayedMatrix"
attr(,"package")
[1] "DelayedArray"

The DelayedArray class allows to store out-of-memory representations of count matrices, which are saved as .h5 files on the disk. This is very useful for efficient processing of large datasets (100,000s of cells or spots)

The reducedDims slot is empty at this stage:

reducedDims(spe)

List of length 0
names(0):

The spatialCoords slot stores image coordinates for each bin:

spatialCoords(spe) |> head()

                      pxl_col_in_fullres pxl_row_in_fullres
s_016um_00144_00175-1           50935.66          14076.263
s_016um_00204_00145-1           49228.67          10548.476
s_016um_00237_00273-1           56729.41           8718.557
s_016um_00191_00159-1           50036.54          11318.543
s_016um_00245_00233-1           54399.11           8220.724
s_016um_00116_00279-1           56989.11          15791.601

The imgData slot lists the linked tissue image and scale factor:

imgData(spe)

DataFrame with 1 row and 4 columns
    sample_id    image_id   data scaleFactor
  <character> <character> <list>   <numeric>
1    sample01      lowres   ####  0.00797342

The image can be exported as a raster:

imgRaster(spe) |> plot()

Visualizing the tissue region

To get a visual overview, we use plotVisium() function from the ggspavis package. First we plot the tissue image without bins:

plotVisium(spe, spots = FALSE)

Then we overlay the 16 um bins to see which part of the tissue was covered by the array:

plotVisium(spe, point_shape = 22, point_size = 0.5)

Exercise 3

Why do we see only a specific region of the slide?

Answer

The full Visium HD output is much larger than needed for this introductory exercise. At the start of this exercise, we subsetted the object to one representative tissue region so that it remains fast to plot and process.

It is possible to colour bins according to a gene or a column in colData, for example PIGR. We renamed the rows to gene symbols after import, which is why we can refer to this marker as "PIGR" rather than by its Ensembl ID. At this stage we plot the number of UMIs since log-normalized counts will be introduced later in the course.

plotVisium(spe,
  annotate = "PIGR",
  assay = "counts",
  zoom = TRUE,
  point_size = 1,
  point_shape = 22
)

Exercise 4

Check the usage of plotVisium with ?plotVisium. Create a plot that zooms into the selected region and colors the bins according to column array_row.

In which slot is array_row stored, and what does it represent?

Answer

The column array_row is stored in colData. It represents the row position of each bin in the Visium HD grid, so the plot shows a smooth coordinate gradient across the selected region. This is technical spatial metadata, not a gene expression pattern.

plotVisium(
  spe,
  annotate = "array_row",
  zoom = TRUE,
  point_shape = 22
)

Retain only observations that map to tissue

Exercise 5

Subset the SpatialExperiment object to retain only bins that are within the tissue area.

How many bins were present in spe, and how many are left after selecting bins overlapping the tissue?

Answer

table(colData(spe)$in_tissue)


    1 
22419

## all spots, although it's clear that some do not overlap the tissue...

## let's assume that the spots not overlapping the tissue will not collect many counts
quantile(colSums(counts(spe)), prob=0.01)

1% 
32

quantile(colSums(counts(spe)), prob=0.02)

2% 
50

sub <- colSums(counts(spe)) > 50
table(sub)

sub
FALSE  TRUE 
  453 21966

plotVisium(
  spe[, sub],
  annotate = "array_row",
  zoom = TRUE,
  point_shape = 22
)

Filtering on the library size thus seems to remove a few”empty” bins not overlapping the tissue. Let’s keep this in mind for the Exercise 2 where we perform QC filtering!

Save the object

We save the filtered SpatialExperiment object for the next steps. Remember that the count matrix is stored as an out-of-memory DelayedArray object, so the object cannot be saved with the usual saveRDS command. The following command saves the assay matrices as .h5 files along with the SpatialExperiment object as an .rds file.

dir.create("results/day1", showWarnings = FALSE, recursive = TRUE)
saveHDF5SummarizedExperiment(spe,
  dir = "results/day1", prefix = "01.1a_spe_", replace = TRUE,
  chunkdim = NULL, level = NULL, as.sparse = NA,
  verbose = NA
)

Clear your environment:

rm(list = ls())
gc()
.rs.restartR()

Important

Key Takeaways:

SpatialExperiment is a convenient container for spot- or bin-level spatial transcriptomics data.
Visium HD binned outputs can be represented naturally as a SpatialExperiment.
Course subsets are chosen to balance biological realism with practical runtime.