Exercise 6

Spatial Variable Genes (SVGs) and Feature-set Signatures

In this practical exercise, we focus on spatially structured gene expression, pathway-level signatures. We will calculate SVGs using two families of methods, compare SVGs to HVGs, understand how spatial information influences downstream analysis, and quantify pathway activities using gene-set scoring.

Learning Objectives

By the end of this exercise, you will be able to:

  • Identify SVGs with two types of methods
  • Compare SVGs to HVGs
  • Score cells/spots with gene-set signatures and interpret spatial pathway patterns

Libraries

library(HDF5Array) 
library(SpatialExperiment)
library(scran) # HVG selection

library(nnSVG)   # SVGs
library(DESpace) # SVGs 

library(AUCell)  # Gene set scoring
library(msigdbr) # Gene sets
library(pals)    # Color palette

library(ggplot2)
library(ggspavis)
library(patchwork)
library(pheatmap)

Load and Prepare Data

We begin by loading the SpatialExperiment object prepared in the previous exercise containing clustering results.

spe <- loadHDF5SummarizedExperiment(dir="results/day1", prefix="01.4_spe")

Spatially variable genes (SVGs)

In Day 1, we performed the analyses using the set of most highly variable genes (HVGs), which capture overall transcriptional variability and are useful for identifying major sources of variation and reducing dimensionality. HVGs reflect global heterogeneity, but they do not use spatial information.

In spatial transcriptomics, however, we are often interested in genes whose expression changes across tissue in structured spatial patterns. These are called spatially variable genes (SVGs), which conceptually generalize HVGs by adding spatial information.

SVG inference methods differ in how they use spatial coordinates and what types of spatial structure they detect. They can be broadly grouped into three categories (Yan, Hua, and Li 2025):

  • Overall SVGs detect any spatial pattern across the tissue; can be used for feature selection for downstream analyses (e.g., spatially-aware clustering)

  • Spatial domain-specific SVGs identify genes whose expression differs between spatial domains, using domains as a proxy for spatial information

  • Cell type-specific SVGs use external cell type annotations to identify spatially structures expression within cell types.

In this exercise, we will focus on the first two categories, using nnSVG(Weber et al. 2023) for overall SVGs and DESpace(Cai, Robinson, and Tiberi 2024) for domain-specific SVGs.

set.seed(123)
# Sample 100 spots to decrease runtime in this exercise
n <- 100
sub <- spe[, sample(ncol(spe), n)]
rowData(sub)$gene_name <- rowData(sub)$Symbol

# Subset to a subset of 200 HVGs, to further decrease runtime
hvg_idx <- which(rowData(spe)$hvg)[seq_len(200)]
sub <- sub[hvg_idx, ]

# Remove genes with all zeros on this subset of spots
keep_genes <- rowSums(counts(sub)) > 0
sub <- sub[keep_genes, ]

# Remove lowly expressed genes (expressed in less than 1% of the spots)
keep_genes2 <- rowMeans(counts(sub) > 0) >= 0.01 
sub <- sub[keep_genes2, ]
dim(sub)
[1] 199 100
## Renormalize on this subset of genes 
sub <- logNormCounts(sub)

nnSVG

nnSVG detects SVGs by modeling spatial gene expression using Gaussian processes, capturing smooth gradients or spatial trends.

Note: nnSVG may take up to ~10 minutes to run. Feel free to take a short break and grab a coffee! While it is running, take a look at the nnSVG paper and the package vignette to better understand how the method works, including its handling of covariates and multiple samples.

set.seed(123)
sub <- nnSVG(sub) 
Exercise 1

Inspect the nnSVG output stored in rowData(sub).

  • What do the result columns represent?
  • Select the top SVGs detected by nnSVG.
# Select top SVGs
res_nnSVG <- rowData(sub) 
res_nnSVG <- res_nnSVG[order(res_nnSVG$pval), ]
head(res_nnSVG, 3)
DataFrame with 3 rows and 20 columns
                    ID      Symbol            Type subsets_Mito       hvg
           <character> <character>        <factor>    <logical> <logical>
PIGR   ENSG00000162896        PIGR Gene Expression        FALSE      TRUE
LEFTY1 ENSG00000243709      LEFTY1 Gene Expression        FALSE      TRUE
CXCR4  ENSG00000121966       CXCR4 Gene Expression        FALSE      TRUE
         gene_name  sigma.sq      tau.sq       phi    loglik   runtime
       <character> <numeric>   <numeric> <numeric> <numeric> <numeric>
PIGR          PIGR  2.392157 2.39216e-08   6.99998 -149.7420     0.033
LEFTY1      LEFTY1  0.192993 1.92993e-09  10.94084  -37.4109     0.026
CXCR4        CXCR4  0.802186 3.41954e-01   1.92394 -115.1814     0.023
            mean       var     spcov   prop_sv loglik_lm   LR_stat      rank
       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
PIGR    0.715634  2.327622   2.16124  1.000000 -183.6337   67.7834         1
LEFTY1  0.141431  0.230258   3.10618  1.000000  -67.9637   61.1056         2
CXCR4   0.677274  1.028289   1.32243  0.701126 -142.7862   55.2095         3
              pval        padj
         <numeric>   <numeric>
PIGR   1.88738e-15 3.75588e-13
LEFTY1 5.38458e-14 5.35766e-12
CXCR4  1.02662e-12 6.80993e-11
top_nnSVG <- res_nnSVG$gene_name[seq_len(3)]

The main output is in rowData of the SpatialExperiment object, which includes: the likelihood ratio statistic (LR_stat), the gene’s rank based on this statistic (rank), raw p-values from a chi-squared test with 2 degrees of freedom (pval), p-values adjusted for multiple testing (padj), and the estimated effect size (prop_sv), defined as the proportion of spatial variance relative to total variance.

If you did not run nnSVG due to runtime constraints, you may use these pre-computed top SVGs:

top_nnSVG <- c("PIGR", "LEFTY1", "CXCR4")

DESpace

DESpace detects genes that show significantly higher/lower expression in one spatial domain compared to the rest of the tissue. To run DESpace, we need cluster labels or manual annotation. Here we use pre-computed spatial clusters from the day1, exercise 4.

res <- svg_test(sub, cluster_col="Banksy", verbose = TRUE)
using 'svg_test' for spatial gene/pattern detection.
Filter low quality genes: 
min_counts = 20; min_non_zero_spots = 10.
The number of genes that pass filtering is 57.
single sample test
Exercise 2

Have a look at the DESpace results.

  • What do each element of the returned list represent?
  • Select top SVGs detected by DESpace.
head(res_DESpace <- res$gene_results, 3)
       gene_id       LR   logCPM       PValue          FDR
PIGR      PIGR 370.9333 17.04892 2.159632e-74 1.230990e-72
COL3A1  COL3A1 211.9558 17.26972 1.026703e-40 2.926105e-39
EPCAM    EPCAM 108.7656 16.29632 2.622142e-19 4.982069e-18
top_DESpace <- res_DESpace$gene_id[seq_len(3)]

The main output is in gene_results: a data.frame containing: gene name (gene_id), likelihood ratio test statistics (LR), average (across spots) log-2 counts per million (logCPM), raw p-values (PValue) and Benjamini-Hochberg adjusted p-values (FDR).

Comparing HVGs and SVGs

As we did yesterday, we first model gene variance using a Poisson noise model with scran::modelGeneVarByPoisson() and select the top HVGs.

dec <- modelGeneVarByPoisson(spe)
top_HVG <- row.names(dec[order(dec$bio, decreasing = T), ])[1:3]
top_HVG
[1] "IGKC"   "MT-CO3" "MT-CO2"
Exercise 3
  • Visualize the expression patterns of the top genes from each list via ggspavis::plotCoords().
  • What types of biological signals does each category capture?
gs <- c(
    nnSVG=top_nnSVG,
    DESpace=top_DESpace,
    HVGs=top_HVG)
# Expression plots for each top gene
ps <- lapply(seq_along(gs), \(.) {
    plotCoords(spe,
        point_size=0,
        annotate=gs[.],
        assay_name="logcounts",
        feature_names="Symbol")
})
wrap_plots(ps, nrow=3) & theme(
    legend.key.width=unit(0.4, "lines"),
    legend.key.height=unit(0.8, "lines")) &
    scale_color_gradientn(colors = pals::parula())
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

Bonus question

Try using SVGs for dimensionality reduction (e.g., PCA) and clustering (e.g., Leiden clustering, BayesSpace, Banksy) as seen yesterday (exercises 3 and 4). How do clustering results differ compared to using HVGs?

Feature-set signatures

So far, we have focused on single genes. We now consider gene set signatures, which summarize activity of pathways or biological program.

Here we will use AUCell(Aibar et al. 2017), which works in two main steps: (i) rank genes for every cell/spot, and (ii) compute an AUC score per (spot, gene set) pair, roughly reflecting the fraction of high-ranking genes that belong to a given set. High AUC values correspond to high coordinated expression of the gene set.

Retrieve MSigDB Hallmark gene sets

# Retrieve hallmark gene sets from 'MSigDB'
db <- msigdbr(species="Homo sapiens", category="H")
Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
ℹ Please use the `collection` argument instead.
# Get list of gene symbols, one element per set
gs <- split(db$ensembl_gene, db$gs_name)

# Simplify set identifiers (drop prefix, use lower case)
names(gs) <- tolower(gsub("HALLMARK_", "", names(gs)))
Exercise 4

How many gene sets are there? What is the range of gene counts across these sets?

length(gs)
[1] 50
range(sapply(gs, length))
[1]  32 201

Gene set scoring with AUCell

# Realize (sparse) gene expression matrix
mtx <- as(logcounts(spe), "dgCMatrix") 

# Use ensembl identifiers as feature names
rownames(mtx) <- rowData(spe)$ID

# Filter for genes represented in panel
.gs <- lapply(gs, intersect, rownames(mtx))
# Keep only those with at least 5 genes
.gs <- .gs[sapply(.gs, length) >= 5]

# Build per-spot gene rankings
rnk <- AUCell_buildRankings(mtx, plotStats=FALSE, verbose=FALSE)

# Calculate AUC for each gene set in each spot
auc <- AUCell_calcAUC(geneSets=.gs, rankings=rnk, verbose=FALSE)

# Add results as spot metadata
colData(spe)[rownames(auc)] <- res <- t(assay(auc)) 
Exercise 5
  • Calculate the percentage of spots with non-zero AUC scores for each gene set (Hint: use rowMeans()).
  • Which gene sets are rarely detected? Which ones are mostly detected?
  • Identify the gene set with the highest score variability across spots (Hint: corVars()).
  • Visualize this top-scoring set in tissue space via ggspavis::plotCoords().
  • Visualize how signature scores correlated with one another (Hint: compute Spearman correlations and visualize it with pheatmap()).
# The percentage of spots with non-zero score across signatures
fq <- rowMeans(assay(auc) > 0)
fq <- sort(round(100*fq, 2))
## Rarely detected
head(fq)
       pancreas_beta_cells            notch_signaling 
                     80.86                      83.17 
        hedgehog_signaling wnt_beta_catenin_signaling 
                     88.61                      91.68 
            apical_surface               angiogenesis 
                     95.42                      95.66 
## Mostly detected
tail(fq)
oxidative_phosphorylation               p53_pathway   tnfa_signaling_via_nfkb 
                      100                       100                       100 
           uv_response_dn            uv_response_up     xenobiotic_metabolism 
                      100                       100                       100 
# Subset on highest score variability across spots
## Variance across spots
var <- colVars(res) 
## Top sets
top <- names(tail(sort(var), 1)) 

# Scale the rows (genes) for better visualization in the heatmap.
spe[[top]] <- scale(spe[[top]]) 
plotCoords(spe, annotate = top) 

cm <- cor(t(assay(auc)), method="spearman")
pheatmap(cm, 
    breaks=seq(-1, 1, 0.1), 
    color=pals::coolwarm(20), 
    cellwidth=10, cellheight=10)

Highly correlated sets will exhibit similar spatial patterns.

Clear your environment:

rm(list = ls())
gc()
.rs.restartR()
Important

Key Takeaways:

  • HVGs and SVGs emphasize different aspects of the data: HVGs (global heterogeneity); SVGs (spatially structured patterns).
  • Different SVG methods detect different types of structure.
  • Gene set signatures summarize coordinated activity of biological pathways. They can be mapped across the tissue and compared between regions or clusters.

References

Aibar, Sara, Carmen Bravo González-Blas, Thomas Moerman, Vân Anh Huynh-Thu, Hana Imrichova, Gert Hulselmans, Florian Rambow, et al. 2017. “SCENIC: Single-Cell Regulatory Network Inference and Clustering.” Nature Methods 14 (11): 1083–86.
Cai, Peiying, Mark D Robinson, and Simone Tiberi. 2024. “DESpace: Spatially Variable Gene Detection via Differential Expression Testing of Spatial Clusters.” Bioinformatics 40 (2): btae027.
Weber, Lukas M, Arkajyoti Saha, Abhirup Datta, Kasper D Hansen, and Stephanie C Hicks. 2023. “nnSVG for the Scalable Identification of Spatially Variable Genes Using Nearest-Neighbor Gaussian Processes.” Nature Communications 14 (1): 4059.
Yan, Guanao, Shuo Harper Hua, and Jingyi Jessica Li. 2025. “Categorization of 34 Computational Methods to Detect Spatially Variable Genes from Spatially Resolved Transcriptomics Data.” Nature Communications 16 (1): 1141.