Exercise 8 : Multi-scale analysis

Learning objectives

Learn the basic principles of exploratory spatial statistics
Compute and visualise local and global measures of spatial autocorrelation
Compute and interpret point pattern analysis metrics on cell-based data
Primer to analysis of multicellular anatomical structures
Compare spatial statistics measures across samples using linear mixed models
This chapter is based on the OSTA online book chapter on spatial statistics and uses heavily the Voyager spatial statistics library (Moses et al. 2023)

Libraries

library("SpatialExperiment")
library("ggplot2")
library("Voyager")
library("HDF5Array")
library("tidyr")
library("dplyr")
library("lme4")
library("lmerTest")
library("spatialFDA")
library("scrapper")
library("dittoSeq")
library("sosta")
library("alabaster.sfe")
set.seed(1234)

knitr::opts_chunk$set(cache.lazy = FALSE)

Load data

# Load the SpatialExperiment objectq
sfe <- readObject("results/day2/02.4_sfe/")

>>> Reading SpatialExperiment

>>> Reading colgeometries

# set imageData to NULL as we do not need it anymore
imgData(sfe) <- NULL

# Let's define the color palette for plotting
cls <- levels(factor(colData(sfe)$SingleR_label))
ctc <- setNames(unname(dittoColors())[seq_along(cls)], cls)

In spatial statistics, we often compare categorical labels across space. Useful categorical labels for spatial omics datasets include non-spatial clusters or cell types (Aran et al. 2019).

Data modalities of spatial statistics

In spatial transcriptomics, we differentiate between imaging-based and sequencing-based technologies, which give rise to very different data modalities (Rao et al. 2021).

Data Modalities in Spatial Transcriptomics (Emons et al. 2025)

Our dataset was acquired using 10x Visium HD. The data modality of this technology can be described by a regular lattice, which means that the observations are taken at regularly spaced intervals. This is very different to imaging-based approaches where observations can be thought of as a stochastic data generating process that is related to some underlying biological mechanism.

With spots of \(2 \mu m^2\), Visium HD can provide subcellular resolution. We can segment cells from this regular lattice of pixels and approximate the cells by their centroid. This in turn can be described as an irregular lattice of the cell types in space or even be analysed as a point patterns using the cell centroid locations.

In what follows, we will perform three types of analyses on different spatial scales of the dataset.

1. cell-type-free analysis: This is an irregular lattice and we analyse it thus with tools from lattice data analysis
1. cell-level analysis: This is a point pattern pattern of cell centroids and we will analyse it with point pattern analysis.
1. multi-cellular analysis: This is a structure-based analysis of anatomical structures.

Cell type-free analysis

Lattice-based analysis is focused on comparing a numerical value across the lattice of measurements. Such a lattice is constructed by defining the neighbourhood of each cell. First, we will define a neighbourhood on which we want to compute spatial statistics metrics.

In binned 10x Visium, it is very natural to consider the direct neighbours of the bins in the Visium lattice, for example with the function Voyager::findVisiumGraph().

Here we create a knn graph with the function SpatialFeatureExperiment::findSpatialNeighbors(), considering the \(k = 5\) nearest neighbourhood of each cell.

(g <- findSpatialNeighbors(sfe,
    method = "knearneigh", # wraps the spdep function with the same name
    k = 5,
    zero.policy = TRUE,
    sample_id = "all"
  ))

Characteristics of weights list object:
Neighbour list object:
Number of regions: 33763 
Number of nonzero links: 168815 
Percentage nonzero weights: 0.01480911 
Average number of links: 5 
Non-symmetric neighbours list

Weights style: W 
Weights constants summary:
      n         nn    S0       S1       S2
W 33763 1139940169 33763 12383.72 137520.7

colGraph(sfe, "knn5") <- g

plotColGraph(sfe,
  colGraphName = "knn5"
) + theme_void()

We notice two things:

The \(y\)-coordinate is flipped compared to the image. This is due to the way that Visium data is acquired. The Visium slide is placed on top of the tissue but the H&E image is acquired from below the tissue.
The irregular lattice is clearly visible.

Global indicators of spatial association

Now that we have defined both the neighbourhood of a cell, we can turn to measures of spatial association.

In general, in a global association metric, we compute for each point, \(i\), a metric for its neighbourhood \(j\). The neighbourhood we quantified before as the regular hexagonal grid and is passed to the functions as a weight matrix \(w_{ij}\). The association measure is then a function \(f(x_i,x_j)\) and could e.g. quantify spatial correlation

\[ \sum_i \sum_j f(x_i,x_j) w_{ij} \]

A famous measure of global spatial autocorrelation is called Moran’s \(I\), which assesses correlation of a gene with itself given a defined weight matrix \(w\) (Moran 1950).

Exercise

Extract the top 200 highly variable genes in this dataset with scrapper::modelGeneVariances() and scrapper::chooseHighlyVariableGenes
Calculate the global Moran’s \(I\) coefficient with the function Voyager::runUnivariate() on these 200 highly variable genes. You can find information and examples on how to run this function by typing ?runUnivariate in your console
Extract the top 3 genes with the highest Moran’s \(I\) value
Plot the expression of these three genes in space using Voyager::plotSpatialFeature

Answer

First, we model the variance on a subset of the genes for computational reasons Then, we extract the 200 most highly variable genes

gene.var <- modelGeneVariances(logcounts(sfe), 
                               num.threads = 1)
# chooseHighlyVariableGenes() expects the residual variance statistics
hvg.index <- chooseHighlyVariableGenes(gene.var$statistics$residuals, 
                                       top = 100)

#add the hvgs to the `sfe` - adapted from https://github.com/libscran/scrapper/blob/master/R/se_chooseRnaHvgs.R
is.hvg <- logical(nrow(gene.var$statistics))
is.hvg[hvg.index] <- TRUE
gene.var[["hvg"]] <- is.hvg
rowData(sfe) <- cbind(rowData(sfe), gene.var)
#extract the gene names of the hvgs
hvg <- rownames(sfe)[is.hvg]

Next, we calculate global Moran’s I for these 200 genes

sfe <- runUnivariate(sfe, 
    type="moran", 
    features=hvg, 
    colGraphName="knn5")

We order them and plot the gene expression top 3 genes wiht the plotSpatialFeature function from the Voyager package.

I <- rowData(sfe)$moran_sample01
o <- order(I, decreasing=TRUE)
topGenes <- rownames(sfe)[head(o, 3)]
plotSpatialFeature(sfe, topGenes, ncol=3)

Local indicators of spatial association

In the section above, we have computed a global measure of spatial association, meaning that we get one number per field-of-view. This is an average of the local contributions and neglects the potential underlying heterogeneity completely.

Therefore, we will investigate local indicators of spatial association (Anselin 1995).

In the case of Moran’s \(I\), we will now not calculate one measure for the entire field-of-view but rather one metric per location/spot \(i\): \(I_i\).

Exercise

Calculate local Moran’s \(I\) for the top three global Moran’s \(I\) genes from above using the function Voyager::runUnivariate()
Plot the local Moran’s \(I\) values in space using Voyager::plotLocalResult

Answer

First, we have to calculate the metric on the first most autocorrelated genes.

sfe <- runUnivariate(sfe,
                     features = topGenes,
                     colGraphName = "knn5",
                     type = "localmoran")

Then, we plot the local Moran’s \(I\) values in space.

plotLocalResult(sfe, "localmoran",
                features = topGenes, 
                ncol = 3,
                colGeometryName = "centroids",
                divergent = TRUE, 
                diverge_center = 0)

Discretising regions with Moran’s scatterplot

The continuous local Moran’s \(I_i\) measurements can be difficult to interpret. A simplification is Moran’s scatter plot (Anselin 2019). The idea is to compare the Moran’s \(I\) of the spot \(i\) itself with its neighbours \(j\). Like this we obtain four options: high \(I_i\) next to neighbours that have high \(I_j\), high \(I_i\) next to neighbours that have low \(I_j\) etc. This separates outliers (high-low and low-high) from values within more homogeneous regions (high-high and low-low).

Bonus Exercise

Calculate Moran’s scatterplot with runUnivariate on the top three global Moran’s \(I\) genes from above
Plot the Moran’s scatterplot results in space using Voyager::plotLocalResult

Answer

First, we have to calculate the metric on the first three most variable HVGs.

sfe <- runUnivariate(
    sfe,
    features =  topGenes,
    colGraphName = "knn5",
    type = "moran.plot"
  )

We will show the Moran’s scatter plot for the gene PIGR.

# we will show one Moran's scatterplot
moranPlot(sfe, topGenes[1], graphName = "knn5", swap_rownames = "symbol") + theme_classic()

The grey dashed lines indicate the mean expression of PIGR in the spots themselves and among their neighbours. According to these means, the spots are categorised into 4 categories, which can be plotted back in space. We notice that there is an interesting part of the tissue that constitutes as “high-high” for all three genes in the same region (bottom left).

plotLocalResult(
    sfe,
    name = "localmoran",
    features = topGenes,
    ncol = 3,
    attribute = "mean",
    colGeometryName = "centroids"
)

We can also filter on the p-values for non-significant cells

localResults(sfe)[["localmoran"]][["PIGR"]] <- 
  localResults(sfe)[["localmoran"]][["PIGR"]] |>
  mutate(locClust = ifelse(`-log10p_adj` > -log10(0.05), as.character(mean), "non-siginificant"))

plotLocalResult(
    sfe,
    name = "localmoran",
    features = "PIGR",
    ncol = 1,
    attribute = "locClust",
    colGeometryName = "centroids"
)

Cell type-based analysis

In the cell type-based analysis chapter, we will compare the distribution of categorical cell types in space using point pattern analysis. Let us first plot the cell types in space

plotSpatialFeature(sfe, 
    features="SingleR_label",
    colGeometryName="centroids",
    size = 1) +
    guides(col=guide_legend(override.aes=list(size=2))) +
    scale_color_manual(values = ctc)

Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

Point pattern analysis

Next, we will compute a measure of spatial correlation across scales for all pairwise cell type combinations called Ripley’s cross \(K\) function using the R/Bioconductor package spatialFDA (Emons et al. 2026, 2025). Ripley’s cross \(K\) function is defined as (Baddeley et al. 2016).

\[ K(r) = \frac{1}{\lambda_j} \mathbb{E} [t(u,r,X^{j}) \mid u \in X^{i}], \]

Ripley’s cross \(K\) function can be variance stabilised, which is then called Besag’s cross \(L\) (Baddeley et al. 2016). This is the measure we will compute for our dataset.

\[ L(r) = \sqrt{\frac{K(r)}{\pi}} \]

We will focus on three cell types to make the results more interpretable

resCross <- calcCrossMetricPerFov(
  sfe,
  correction = "iso",
  selection = c("Tumor", "T cells", "B cells"),
  subsetby = "sample_id",
  fun = "Lcross", 
  marks = "SingleR_label",
  rSeq = seq(0, 100, length.out = 100), by = c("sample_id"),
  ncores = 1
)

plotCrossMetricPerFov(
  resCross,
  theo = TRUE,
  correction = "iso",
  x = "r",
  imageId = 'sample_id',
  linewidth = 1
)

[[1]]

Exercise

What do you learn from these plots in combination with the distribution of cell types in space above?

Answer

We note a few things:

All cells cluster among themselves and across scales (the diagonal of the plot)
All cells show spacing (below the dashed line) of radii \(r\) smaller than 15 \(\mu\)m \(\rightarrow\) cell diameter
Immune cells build aggregates
Tumor cells show spacing in comparison to immune cells, there seems to be no immune invasion into the tumor

Multicellular analysis

Reconstruction of the Tumor region

Next, we will continue with multicellular analysis. The R/Bioconductor package sosta is designed for structure-based analysis of multicellular anatomical structures. For example in spatial omics, “structures” could be germinal centers (GCs) in the tonsil, metastases or tertiary lymphoid structures (TLS) in cancer (Gunz et al. 2025).

In sosta, the reconstruction of anatomical structures is based on the density profile of a marked point pattern. Based on an estimated threshold of the density profile the structures are segmented.

First, we will have a look at the point pattern intensity of the mark of interest. The histogram shows the distribution of intensity values of the pixel image on the left.

# Some of the later functions need to have an colData column with a image name identifier.
# We will define this here.
sfe$imageName <- "image1"

# Let us first look
shapeIntensityImage(
    sfe,
    marks = "SingleR_label",
    markSelect = "Tumor"
)

We can use the following function to segment the Tumor region. The result is an sf polygon. Internally, the functions estimate the bandwidth of the smoothing kernel (bndw) and the intensity threshold (thres) based on the distribution showed above.

tumor <- reconstructShapeDensitySPE(
  sfe,
  marks = "SingleR_label",
  markSelect = "Tumor",
  imageCol = "imageName")

We can directly plot the resulting polygon using geom_sf. Note that the argument inherit.aes = FALSE is important for the function to work correctly.

plotSpatialFeature(sfe, features = "SingleR_label", colGeometryName = "centroids") +
  guides(col = guide_legend(override.aes = list(size = 3))) +
  geom_sf(data = tumor, fill = NA, color = "black", linewidth = 0.75,
          inherit.aes = FALSE) +
  scale_color_manual(values = ctc)

Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

Exercise

Do you think this is a good reconstruction / segmentation?

Answer

Overall the segmentation seems to have worked fine. However, we note a few things: 1. There is one very small regions with a few Tumor cells present. 2. There might be some under-segmentation, i.e., there are some Myeloid and Endothelial cells included in the Tumor regions. As they are part of the Tumor micro-environment we could also include them in our segmentation / definition of tumor regions. In general, the correct segmentation depends heavily on the research question. Using sosta we have a data driven way of segmentation that is not influenced by subjective annotation bias.

Cell-based analysis

For later analyses, we will assign the cells to each tumor region.

# Define unique colnames by numbering the cells
colnames(sfe) <- paste0("cell_", c(1:dim(sfe)[2]))
assign <- assingCellsToStructures(sfe, tumor, imageCol = "imageName")

# Assign using the correct order of the columns in the spe object
sfe$tumorAssign <- assign[colnames(sfe)]

Distance to tumor border

We can now find the distance to the closest structure / tumor boundary using the function minBoundaryDistances. As the function is defined to work with multiple samples, we have to give the imageCol that we defined above.

sfe$minDist <- minBoundaryDistances(
    spe = sfe, imageCol = "imageName",
    structColumn = "tumorAssign", allStructs = tumor
)

Again, we can plot the result. Note that we have boundary effects due to the FOV, cells that are close to the field-of-view boundary might actually be deep inside the tumor.

plotSpatialFeature(sfe, features="minDist", colGeometryName="centroids") +
  scale_colour_gradient2(low = "darkred", high = "navy",)

Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

We can now visualize the density distribution with respect to the tumor borders in a violin plot. Negative values correspond to cells within a tumor structure.

colData(sfe) |>
  as.data.frame() |>
  mutate(SingleR_label = reorder(SingleR_label, minDist, FUN = median)) |>
  ggplot(aes(x = minDist, y = SingleR_label, fill = SingleR_label)) +
  geom_violin() +
  theme_bw() +
  geom_vline(xintercept = 0, linetype = "dashed") +
  scale_fill_manual(values = ctc) +
  theme(legend.position = "none")

Exercise

What do you notice?

Answer

As expected, almost all of the tumor cells have negative distances (within the tumor structures). There are some Myeloid, Endothelial and T cells within the tumor regions and in the near vicinity. This hints at the composition of the tumor microenvironment. As outlined above, the values depend on the exact segmentation and definition of the tumor region.

Cell-type proportions

Next, we look at the absolute and relative cell type proportions in each tumor region.

sel <- colData(sfe) |>
  as.data.frame() |>
  count(tumorAssign) |>
  filter(!is.na(tumorAssign), n > 50) |> # select regions with at least 50 cells
  pull(tumorAssign)

df <- colData(sfe) |>
  as.data.frame() |>
  filter(tumorAssign %in% sel) |>
  count(tumorAssign, SingleR_label)

p0 <- ggplot(df, aes(reorder(tumorAssign, n),
                     x = n, fill = SingleR_label)) +
  theme_bw() +
  scale_fill_manual(values = ctc)

p1 <- p0 +
  geom_col() +
  labs(x = "# cells", y = "region")

p2 <- p0 +
  geom_col(position = "fill") +
  labs(x = "% cells", y = "region") +
  scale_x_continuous(labels = scales::percent)

patchwork::wrap_plots(p1, p2, guides = "collect")

Parts of the multi-cellular section were adapted from the following vignette.

Additional Material

Multi-sample analysis

This chapter so far showed how to perform lattice-data analysis for one sample. Lattice data analysis is not yet very common in multi-sample analyses. One option is to compute a measure of global spatial association giving one numerical value per field-of-view. First, we will load 4 slides from the dataset (2 healthy slides and 2 cancerous slides).

sfeMult <- readObject("data/day3/03.1_sfeMultNorm")

>>> Reading SpatialExperiment

>>> Reading colgeometries

The loaded data is already preprocessed. The chunk below shows how this was done, you don’t have to rerun it.

Don’t run

# identify mitochondrial genes
gn <- rowData(sfeMult)$Symbol
mt <- grepl("^MT-", gn, ignore.case = TRUE)
table(mt)
# remove them
sfeMult <- sfeMult[!mt, ]

#calculate the size factors
sf <- centerSizeFactors(colSums(assay(sfeMult)))
#normalise the data
normed <- scrapper::normalizeCounts(assay(sfeMult), size.factors = sf)
logcounts(sfeMult) <- normed

Next, we will define a neighbourhood on which we want to compute spatial statistics metrics. For computational reasons, we will consider a \(k\) nearest neighbourhood around the cells.

allGraphs <- findSpatialNeighbors(sfeMult,
    method = "knearneigh", # wraps the spdep function with the same name
    k = 5,
    zero.policy = TRUE,
    sample_id = "all"
  )

colGraphs(sfeMult, "knn5", sample_id = "all") <- allGraphs
sfeMult

class: SpatialFeatureExperiment 
dim: 18034 56120 
metadata(0):
assays(2): counts logcounts
rownames(18034): SAMD11 NOC2L ... KDM5D EIF1AY
rowData names(3): ID Symbol Type
colnames(56120): s_016um_00145_00029-1.Normal_P5
  s_016um_00165_00109-1.Normal_P5 ... s_016um_00122_00096-1.Normal_P3
  s_016um_00127_00062-1.Normal_P3
colData names(7): row col ... array_row array_col
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

unit: 
Geometries:
colGeometries: centroids (POINT) 

Graphs:
Normal_P5: col: knn5
Cancer_P5: col: knn5
Cancer_P1: col: knn5
Normal_P3: col: knn5

Question

Given you have now a SpatialFeatureExperiment object with four samples, how would you compare a global indicator of spatial association across conditions, possibly with multiple samples? As an example, calculate global univariate Moran’s \(I\) for the gene “PIGR”.
An important flag in this question is sample_id = "all". For reasons of having isolated cell patches, consider a \(k\)-NN neighbourhood this time instead of findVisiumGraph.

Answer

First, we need to calculate univariate Moran’s \(I\) for all samples

sfeMult <- runUnivariate(sfeMult,
                     features = "PIGR",
                     colGraphName = "knn5",
                     exprs_values = "logcounts",
                     sample_id = "all",
                     type = "moran.mc",
                     nsim = 200)

res <- rowData(sfeMult)["PIGR",]
res

DataFrame with 1 row and 27 columns
                  ID      Symbol            Type moran.mc_statistic_Normal_P5
         <character> <character>        <factor>                    <numeric>
PIGR ENSG00000162896        PIGR Gene Expression                     0.665933
     moran.mc_parameter_Normal_P5 moran.mc_p.value_Normal_P5
                        <numeric>                  <numeric>
PIGR                          201                 0.00497512
     moran.mc_alternative_Normal_P5 moran.mc_method_Normal_P5
                        <character>               <character>
PIGR                        greater    Monte-Carlo simulati..
                         moran.mc_res_Normal_P5 moran.mc_statistic_Cancer_P5
                                         <list>                    <numeric>
PIGR  7.56633e-05, 1.03790e-02,-2.57316e-03,...                     0.662601
     moran.mc_parameter_Cancer_P5 moran.mc_p.value_Cancer_P5
                        <numeric>                  <numeric>
PIGR                          201                 0.00497512
     moran.mc_alternative_Cancer_P5 moran.mc_method_Cancer_P5
                        <character>               <character>
PIGR                        greater    Monte-Carlo simulati..
                      moran.mc_res_Cancer_P5 moran.mc_statistic_Cancer_P1
                                      <list>                    <numeric>
PIGR 0.006536053,0.001089384,0.000175518,...                     0.839516
     moran.mc_parameter_Cancer_P1 moran.mc_p.value_Cancer_P1
                        <numeric>                  <numeric>
PIGR                          201                 0.00497512
     moran.mc_alternative_Cancer_P1 moran.mc_method_Cancer_P1
                        <character>               <character>
PIGR                        greater    Monte-Carlo simulati..
                   moran.mc_res_Cancer_P1 moran.mc_statistic_Normal_P3
                                   <list>                    <numeric>
PIGR 0.00194370,0.00335518,0.00273481,...                     0.554004
     moran.mc_parameter_Normal_P3 moran.mc_p.value_Normal_P3
                        <numeric>                  <numeric>
PIGR                          201                 0.00497512
     moran.mc_alternative_Normal_P3 moran.mc_method_Normal_P3
                        <character>               <character>
PIGR                        greater    Monte-Carlo simulati..
                      moran.mc_res_Normal_P3
                                      <list>
PIGR  0.00208629,-0.01153171, 0.00286547,...

In order to model these measures, we have to convert the data into long format

df_long <- res |> as.data.frame() |>
  tibble::rownames_to_column("gene") |>
  pivot_longer(
    cols = starts_with("moran.mc_statistic"),
    names_to = c("measure", "sample"),
    names_pattern = "(.*)statistic_(.*)",
    values_to = "morans"
  ) |> separate(sample, into = c("condition", "replicate"), sep = "_") |> 
  select(c( "gene", "condition",  "replicate", "measure", "morans"))

df_long

# A tibble: 4 × 5
  gene  condition replicate measure   morans
  <chr> <chr>     <chr>     <chr>      <dbl>
1 PIGR  Normal    P5        moran.mc_  0.666
2 PIGR  Cancer    P5        moran.mc_  0.663
3 PIGR  Cancer    P1        moran.mc_  0.840
4 PIGR  Normal    P3        moran.mc_  0.554

table(df_long$condition, df_long$replicate)

        
         P1 P3 P5
  Cancer  1  0  1
  Normal  0  1  1

We see that we have a nested variance structure in this data. We have two conditions “Normal” and “Cancer”. We note that the patient identifiers are not unique, meaning that we have multiple samples from the same patient.

In order to account for this dependence, we have to model the data with a mixed-effects model. One option is the lme4 package in R.

#convert condition to a factor and relevel to have Normal as comparison group
df_long$condition <- as.factor(df_long$condition) |>
  relevel(ref = "Normal")
mdl <- lmer(morans ~ condition + (1|replicate), data = df_long)

boundary (singular) fit: see help('isSingular')

summary(mdl)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: morans ~ condition + (1 | replicate)
   Data: df_long

REML criterion at convergence: -2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-0.8451 -0.6123  0.0000  0.6123  0.8451 

Random effects:
 Groups    Name        Variance Std.Dev.
 replicate (Intercept) 0.00000  0.0000  
 Residual              0.01096  0.1047  
Number of obs: 4, groups:  replicate, 3

Fixed effects:
                Estimate Std. Error      df t value Pr(>|t|)  
(Intercept)      0.60997    0.07402 2.00000   8.241   0.0144 *
conditionCancer  0.14109    0.10467 2.00000   1.348   0.3101  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
conditnCncr -0.707
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')

We note that in this example there is no difference in the global Moran’s \(I\) value for the gene “PIGR”. This not surprising since a model with 3 independent samples is strongly under-powered.

An important step in interpreting a statistical model is looking at some model diagnostics. We will plot the QQ plot of this mixed effects model. We note that the four samples follow the theoretical line well. But we notice again, that having only four samples is a great limitation of this dataset.

qqnorm(resid(mdl), pch = 16)
qqline(resid(mdl))

Gene expression within structures

Now we will subset to only the tumor regions.

sfe_sub <- sfe[, sfe$tumorAssign %in% sel]

Warning in sn2listw(df, style = style, zero.policy = zero.policy,
from_mat2listw = TRUE): neighbour object has 7 sub-graphs

Warning in mat2listw(m_wts, style = style, zero.policy = zero.policy):
neighbour object has 7 sub-graphs

We will pseudo-bulk within each tumor and do PCA.

# pseudo-bulk at the structure level
agg <- scrapper::aggregateAcrossCells(counts(sfe_sub), list(groups = sfe_sub$tumorAssign))
# normalize counts
nc <- normalizeCounts(agg$sums, size.factors = colSums(agg$sums))
# run PCA
pcaTum <- runPca(nc)

Let’s look at the three top features (genes) for each principle component.

# Top features for PCs 1 to 3
apply(pcaTum$rotation[, 1:3], 2, \(.) {
    . <- sort(., decreasing=TRUE)
    head(names(.), 10)
})

      [,1]      [,2]      [,3]     
 [1,] "COL3A1"  "COL1A1"  "PYGB"   
 [2,] "COL1A1"  "COL3A1"  "MT-ND4" 
 [3,] "SELENOP" "TMSB4X"  "MT-CYB" 
 [4,] "SPARC"   "MT-CO1"  "FABP1"  
 [5,] "VIM"     "MT-CYB"  "CEACAM5"
 [6,] "GPCPD1"  "FTH1"    "MT-ATP6"
 [7,] "CD74"    "MT-ATP6" "GPCPD1" 
 [8,] "COL1A2"  "MT-ND4"  "MT-ND3" 
 [9,] "FN1"     "COL1A2"  "MT-CO1" 
[10,] "PYGB"    "SPARC"   "SELENOP"

Let’s look at some features in the subset and the whole FOV.

plotSpatialFeature(sfe_sub, c("COL3A1", "FTH1", "PYGB"), ncol=3)

plotSpatialFeature(sfe, c("COL3A1", "FTH1", "PYGB"), ncol=3)

Exercise

What do you notice?

Answer

Interestingly, several top-loaded genes for these PCs are strongly associated with non-tumor cell types. This suggests that the PCA is capturing variance that is driven by non-tumor cells within the tumor regions. Because the expression to these cells deviate so drastically from the dominant pseudo-bulk tumor profile, PCA identifies them as an important axis of variation. This could either result from the over-segementation discussed above or indicate some interesting biology or mixing patterns at the border (as we have seen in the distance density plots). This would be interesting to investigate further, e.g., check which cell types dominantly express these markers.

Bonus Exercise

Rerun the segmentation with a higher intensity threshold and rerun the PCA analysis. Do you notice a change in the top genes?

Answer

To generate a more refined segmentation, the intensity threshold in reconstructShapeDensitySPE has to be increased to around 2.75E-4. The rest follows as above.

reconstructShapeDensitySPE(
  sfe,
  marks = "SingleR_label",
  markSelect = "Tumor",
  imageCol = "imageName",
  thres = 2.75E-4
)

You should see some change, but the non-tumor genes are still picked up. This further indicates that we might see an interesting biological pattern.

Save the object

alabaster.sfe::saveObject(sfe, "results/day3/03.1_sfe")

References

Anselin, Luc. 1995. “Local Indicators of Spatial Association—LISA.” Geographical Analysis 27 (2): 93–115.

Anselin, Luc. 2019. “The Moran Scatterplot as an ESDA Tool to Assess Local Instability in Spatial Association.” In Spatial Analytical Perspectives on GIS. Routledge.

Aran, Dvir, Agnieszka P Looney, Leqian Liu, et al. 2019. “Reference-Based Analysis of Lung Single-Cell Sequencing Reveals a Transitional Profibrotic Macrophage.” Nature Immunology 20 (2): 163–72.

Baddeley, Adrian, Ege Rubak, Rolf Turner, et al. 2016. Spatial Point Patterns: Methodology and Applications with r. Vol. 1. CRC press Boca Raton.

Emons, Martin, Samuel Gunz, Helena L Crowell, et al. 2025. “Harnessing the Potential of Spatial Statistics for Spatial Omics Data with Pasta.” Nucleic Acids Research 53 (17): gkaf870.

Emons, Martin, Fabian Scheipl, Samuel Gunz, Elizabeth Purdom, and Mark D Robinson. 2026. “Differential Co-Localisation Analysis of Multi-Sample and Multi-Condition Experiments with spatialFDA.” bioRxiv, 2026–04.

Gunz, Samuel, Helena L Crowell, and Mark D Robinson. 2025. “Analysis of Anatomical Multi-Cellular Structures from Spatial Omics Data Using Sosta.” bioRxiv, 2025–10.

Moran, Patrick AP. 1950. “Notes on Continuous Stochastic Phenomena.” Biometrika 37 (1/2): 17–23.

Moses, Lambda, Pétur Helgi Einarsson, Kayla Jackson, et al. 2023. “Voyager: Exploratory Single-Cell Genomics Data Analysis with Geospatial Statistics.” BioRxiv.

Rao, Anjali, Dalia Barkley, Gustavo S França, and Itai Yanai. 2021. “Exploring Tissue Architecture Using Spatial Transcriptomics.” Nature 596 (7871): 211–20.