library("SpatialExperiment")
library("ggplot2")
library("Voyager")
library("HDF5Array")
library("tidyr")
library("dplyr")
library("lme4")
library("lmerTest")
library("spatialFDA")
library("scrapper")
library("dittoSeq")
library("sosta")
library("alabaster.sfe")
set.seed(1234)Exercise 8 : Multi-scale analysis
Learning objectives
- Learn the basic principles of exploratory spatial statistics
- Compute and visualise local and global measures of spatial autocorrelation
- Compute and interpret point pattern analysis metrics on cell-based data
- Primer to analysis of multicellular anatomical structures
- Compare spatial statistics measures across samples using linear mixed models
- This chapter is based on the
OSTAonline book chapter on spatial statistics and uses heavily theVoyagerspatial statistics library (Moses et al. 2023)
Libraries
knitr::opts_chunk$set(cache.lazy = FALSE)Load data
# Load the SpatialExperiment objectq
sfe <- readObject("results/day2/02.4_sfe/")>>> Reading SpatialExperiment
>>> Reading colgeometries
In spatial statistics, we often compare categorical labels across space. Useful categorical labels for spatial omics datasets include non-spatial clusters or cell types (Aran et al. 2019).
Data modalities of spatial statistics
In spatial transcriptomics, we differentiate between imaging-based and sequencing-based technologies, which give rise to very different data modalities (Rao et al. 2021).
Our dataset was acquired using 10x Visium HD. The data modality of this technology can be described by a regular lattice, which means that the observations are taken at regularly spaced intervals. This is very different to imaging-based approaches where observations can be thought of as a stochastic data generating process that is related to some underlying biological mechanism.
With spots of \(2 \mu m^2\), Visium HD can provide subcellular resolution. We can segment cells from this regular lattice of pixels and approximate the cells by their centroid. This in turn can be described as an irregular lattice of the cell types in space or even be analysed as a point patterns using the cell centroid locations.
In what follows, we will perform three types of analyses on different spatial scales of the dataset.
- cell-type-free analysis: This is an irregular lattice and we analyse it thus with tools from lattice data analysis
- cell-level analysis: This is a point pattern pattern of cell centroids and we will analyse it with point pattern analysis.
- multi-cellular analysis: This is a structure-based analysis of anatomical structures.
Cell type-free analysis
Lattice-based analysis is focused on comparing a numerical value across the lattice of measurements. Such a lattice is constructed by defining the neighbourhood of each cell. First, we will define a neighbourhood on which we want to compute spatial statistics metrics.
In binned 10x Visium, it is very natural to consider the direct neighbours of the bins in the Visium lattice, for example with the function Voyager::findVisiumGraph().
Here we create a knn graph with the function SpatialFeatureExperiment::findSpatialNeighbors(), considering the \(k = 5\) nearest neighbourhood of each cell.
(g <- findSpatialNeighbors(sfe,
method = "knearneigh", # wraps the spdep function with the same name
k = 5,
zero.policy = TRUE,
sample_id = "all"
))Characteristics of weights list object:
Neighbour list object:
Number of regions: 33763
Number of nonzero links: 168815
Percentage nonzero weights: 0.01480911
Average number of links: 5
Non-symmetric neighbours list
Weights style: W
Weights constants summary:
n nn S0 S1 S2
W 33763 1139940169 33763 12383.72 137520.7
colGraph(sfe, "knn5") <- g
plotColGraph(sfe,
colGraphName = "knn5"
) + theme_void()We notice two things:
The \(y\)-coordinate is flipped compared to the image. This is due to the way that Visium data is acquired. The Visium slide is placed on top of the tissue but the H&E image is acquired from below the tissue.
The irregular lattice is clearly visible.
Global indicators of spatial association
Now that we have defined both the neighbourhood of a cell, we can turn to measures of spatial association.
In general, in a global association metric, we compute for each point, \(i\), a metric for its neighbourhood \(j\). The neighbourhood we quantified before as the regular hexagonal grid and is passed to the functions as a weight matrix \(w_{ij}\). The association measure is then a function \(f(x_i,x_j)\) and could e.g. quantify spatial correlation
\[ \sum_i \sum_j f(x_i,x_j) w_{ij} \]
A famous measure of global spatial autocorrelation is called Moran’s \(I\), which assesses correlation of a gene with itself given a defined weight matrix \(w\) (Moran 1950).
- Extract the top 200 highly variable genes in this dataset with
scrapper::modelGeneVariances()andscrapper::chooseHighlyVariableGenes - Calculate the global Moran’s \(I\) coefficient with the function
Voyager::runUnivariate()on these 200 highly variable genes. You can find information and examples on how to run this function by typing?runUnivariatein your console - Extract the top 3 genes with the highest Moran’s \(I\) value
- Plot the expression of these three genes in space using
Voyager::plotSpatialFeature
First, we model the variance on a subset of the genes for computational reasons Then, we extract the 200 most highly variable genes
gene.var <- modelGeneVariances(logcounts(sfe),
num.threads = 1)
# chooseHighlyVariableGenes() expects the residual variance statistics
hvg.index <- chooseHighlyVariableGenes(gene.var$statistics$residuals,
top = 100)
#add the hvgs to the `sfe` - adapted from https://github.com/libscran/scrapper/blob/master/R/se_chooseRnaHvgs.R
is.hvg <- logical(nrow(gene.var$statistics))
is.hvg[hvg.index] <- TRUE
gene.var[["hvg"]] <- is.hvg
rowData(sfe) <- cbind(rowData(sfe), gene.var)
#extract the gene names of the hvgs
hvg <- rownames(sfe)[is.hvg]Next, we calculate global Moran’s I for these 200 genes
sfe <- runUnivariate(sfe,
type="moran",
features=hvg,
colGraphName="knn5")We order them and plot the gene expression top 3 genes wiht the plotSpatialFeature function from the Voyager package.
I <- rowData(sfe)$moran_sample01
o <- order(I, decreasing=TRUE)
topGenes <- rownames(sfe)[head(o, 3)]
plotSpatialFeature(sfe, topGenes, ncol=3)Local indicators of spatial association
In the section above, we have computed a global measure of spatial association, meaning that we get one number per field-of-view. This is an average of the local contributions and neglects the potential underlying heterogeneity completely.
Therefore, we will investigate local indicators of spatial association (Anselin 1995).
In the case of Moran’s \(I\), we will now not calculate one measure for the entire field-of-view but rather one metric per location/spot \(i\): \(I_i\).
Calculate local Moran’s \(I\) for the top three global Moran’s \(I\) genes from above using the function
Voyager::runUnivariate()Plot the local Moran’s \(I\) values in space using
Voyager::plotLocalResult
First, we have to calculate the metric on the first most autocorrelated genes.
sfe <- runUnivariate(sfe,
features = topGenes,
colGraphName = "knn5",
type = "localmoran")Then, we plot the local Moran’s \(I\) values in space.
plotLocalResult(sfe, "localmoran",
features = topGenes,
ncol = 3,
colGeometryName = "centroids",
divergent = TRUE,
diverge_center = 0)Discretising regions with Moran’s scatterplot
The continuous local Moran’s \(I_i\) measurements can be difficult to interpret. A simplification is Moran’s scatter plot (Anselin 2019). The idea is to compare the Moran’s \(I\) of the spot \(i\) itself with its neighbours \(j\). Like this we obtain four options: high \(I_i\) next to neighbours that have high \(I_j\), high \(I_i\) next to neighbours that have low \(I_j\) etc. This separates outliers (high-low and low-high) from values within more homogeneous regions (high-high and low-low).
Calculate Moran’s scatterplot with
runUnivariateon the top three global Moran’s \(I\) genes from abovePlot the Moran’s scatterplot results in space using
Voyager::plotLocalResult
First, we have to calculate the metric on the first three most variable HVGs.
sfe <- runUnivariate(
sfe,
features = topGenes,
colGraphName = "knn5",
type = "moran.plot"
)We will show the Moran’s scatter plot for the gene PIGR.
# we will show one Moran's scatterplot
moranPlot(sfe, topGenes[1], graphName = "knn5", swap_rownames = "symbol") + theme_classic()The grey dashed lines indicate the mean expression of PIGR in the spots themselves and among their neighbours. According to these means, the spots are categorised into 4 categories, which can be plotted back in space. We notice that there is an interesting part of the tissue that constitutes as “high-high” for all three genes in the same region (bottom left).
plotLocalResult(
sfe,
name = "localmoran",
features = topGenes,
ncol = 3,
attribute = "mean",
colGeometryName = "centroids"
)We can also filter on the p-values for non-significant cells
localResults(sfe)[["localmoran"]][["PIGR"]] <-
localResults(sfe)[["localmoran"]][["PIGR"]] |>
mutate(locClust = ifelse(`-log10p_adj` > -log10(0.05), as.character(mean), "non-siginificant"))plotLocalResult(
sfe,
name = "localmoran",
features = "PIGR",
ncol = 1,
attribute = "locClust",
colGeometryName = "centroids"
)Cell type-based analysis
In the cell type-based analysis chapter, we will compare the distribution of categorical cell types in space using point pattern analysis. Let us first plot the cell types in space
plotSpatialFeature(sfe,
features="SingleR_label",
colGeometryName="centroids",
size = 1) +
guides(col=guide_legend(override.aes=list(size=2))) +
scale_color_manual(values = ctc)Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Point pattern analysis
Next, we will compute a measure of spatial correlation across scales for all pairwise cell type combinations called Ripley’s cross \(K\) function using the R/Bioconductor package spatialFDA (Emons et al. 2026, 2025). Ripley’s cross \(K\) function is defined as (Baddeley et al. 2016).
\[ K(r) = \frac{1}{\lambda_j} \mathbb{E} [t(u,r,X^{j}) \mid u \in X^{i}], \]
Ripley’s cross \(K\) function can be variance stabilised, which is then called Besag’s cross \(L\) (Baddeley et al. 2016). This is the measure we will compute for our dataset.
\[ L(r) = \sqrt{\frac{K(r)}{\pi}} \]
We will focus on three cell types to make the results more interpretable
resCross <- calcCrossMetricPerFov(
sfe,
correction = "iso",
selection = c("Tumor", "T cells", "B cells"),
subsetby = "sample_id",
fun = "Lcross",
marks = "SingleR_label",
rSeq = seq(0, 100, length.out = 100), by = c("sample_id"),
ncores = 1
)plotCrossMetricPerFov(
resCross,
theo = TRUE,
correction = "iso",
x = "r",
imageId = 'sample_id',
linewidth = 1
) [[1]]
What do you learn from these plots in combination with the distribution of cell types in space above?
We note a few things:
- All cells cluster among themselves and across scales (the diagonal of the plot)
- All cells show spacing (below the dashed line) of radii \(r\) smaller than 15 \(\mu\)m \(\rightarrow\) cell diameter
- Immune cells build aggregates
- Tumor cells show spacing in comparison to immune cells, there seems to be no immune invasion into the tumor
Multicellular analysis
Reconstruction of the Tumor region
Next, we will continue with multicellular analysis. The R/Bioconductor package sosta is designed for structure-based analysis of multicellular anatomical structures. For example in spatial omics, “structures” could be germinal centers (GCs) in the tonsil, metastases or tertiary lymphoid structures (TLS) in cancer (Gunz et al. 2025).
In sosta, the reconstruction of anatomical structures is based on the density profile of a marked point pattern. Based on an estimated threshold of the density profile the structures are segmented.
First, we will have a look at the point pattern intensity of the mark of interest. The histogram shows the distribution of intensity values of the pixel image on the left.
# Some of the later functions need to have an colData column with a image name identifier.
# We will define this here.
sfe$imageName <- "image1"
# Let us first look
shapeIntensityImage(
sfe,
marks = "SingleR_label",
markSelect = "Tumor"
)We can use the following function to segment the Tumor region. The result is an sf polygon. Internally, the functions estimate the bandwidth of the smoothing kernel (bndw) and the intensity threshold (thres) based on the distribution showed above.
tumor <- reconstructShapeDensitySPE(
sfe,
marks = "SingleR_label",
markSelect = "Tumor",
imageCol = "imageName")We can directly plot the resulting polygon using geom_sf. Note that the argument inherit.aes = FALSE is important for the function to work correctly.
plotSpatialFeature(sfe, features = "SingleR_label", colGeometryName = "centroids") +
guides(col = guide_legend(override.aes = list(size = 3))) +
geom_sf(data = tumor, fill = NA, color = "black", linewidth = 0.75,
inherit.aes = FALSE) +
scale_color_manual(values = ctc)Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Do you think this is a good reconstruction / segmentation?
Overall the segmentation seems to have worked fine. However, we note a few things: 1. There is one very small regions with a few Tumor cells present. 2. There might be some under-segmentation, i.e., there are some Myeloid and Endothelial cells included in the Tumor regions. As they are part of the Tumor micro-environment we could also include them in our segmentation / definition of tumor regions. In general, the correct segmentation depends heavily on the research question. Using sosta we have a data driven way of segmentation that is not influenced by subjective annotation bias.
Cell-based analysis
For later analyses, we will assign the cells to each tumor region.
Distance to tumor border
We can now find the distance to the closest structure / tumor boundary using the function minBoundaryDistances. As the function is defined to work with multiple samples, we have to give the imageCol that we defined above.
sfe$minDist <- minBoundaryDistances(
spe = sfe, imageCol = "imageName",
structColumn = "tumorAssign", allStructs = tumor
)Again, we can plot the result. Note that we have boundary effects due to the FOV, cells that are close to the field-of-view boundary might actually be deep inside the tumor.
plotSpatialFeature(sfe, features="minDist", colGeometryName="centroids") +
scale_colour_gradient2(low = "darkred", high = "navy",)Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
We can now visualize the density distribution with respect to the tumor borders in a violin plot. Negative values correspond to cells within a tumor structure.
colData(sfe) |>
as.data.frame() |>
mutate(SingleR_label = reorder(SingleR_label, minDist, FUN = median)) |>
ggplot(aes(x = minDist, y = SingleR_label, fill = SingleR_label)) +
geom_violin() +
theme_bw() +
geom_vline(xintercept = 0, linetype = "dashed") +
scale_fill_manual(values = ctc) +
theme(legend.position = "none")What do you notice?
As expected, almost all of the tumor cells have negative distances (within the tumor structures). There are some Myeloid, Endothelial and T cells within the tumor regions and in the near vicinity. This hints at the composition of the tumor microenvironment. As outlined above, the values depend on the exact segmentation and definition of the tumor region.
Cell-type proportions
Next, we look at the absolute and relative cell type proportions in each tumor region.
df <- colData(sfe) |>
as.data.frame() |>
filter(tumorAssign %in% sel) |>
count(tumorAssign, SingleR_label)
p0 <- ggplot(df, aes(reorder(tumorAssign, n),
x = n, fill = SingleR_label)) +
theme_bw() +
scale_fill_manual(values = ctc)
p1 <- p0 +
geom_col() +
labs(x = "# cells", y = "region")
p2 <- p0 +
geom_col(position = "fill") +
labs(x = "% cells", y = "region") +
scale_x_continuous(labels = scales::percent)
patchwork::wrap_plots(p1, p2, guides = "collect")Parts of the multi-cellular section were adapted from the following vignette.
Additional Material
This chapter so far showed how to perform lattice-data analysis for one sample. Lattice data analysis is not yet very common in multi-sample analyses. One option is to compute a measure of global spatial association giving one numerical value per field-of-view. First, we will load 4 slides from the dataset (2 healthy slides and 2 cancerous slides).
sfeMult <- readObject("data/day3/03.1_sfeMultNorm")>>> Reading SpatialExperiment
>>> Reading colgeometries
The loaded data is already preprocessed. The chunk below shows how this was done, you don’t have to rerun it.
Don’t run
# identify mitochondrial genes
gn <- rowData(sfeMult)$Symbol
mt <- grepl("^MT-", gn, ignore.case = TRUE)
table(mt)
# remove them
sfeMult <- sfeMult[!mt, ]
#calculate the size factors
sf <- centerSizeFactors(colSums(assay(sfeMult)))
#normalise the data
normed <- scrapper::normalizeCounts(assay(sfeMult), size.factors = sf)
logcounts(sfeMult) <- normedNext, we will define a neighbourhood on which we want to compute spatial statistics metrics. For computational reasons, we will consider a \(k\) nearest neighbourhood around the cells.
allGraphs <- findSpatialNeighbors(sfeMult,
method = "knearneigh", # wraps the spdep function with the same name
k = 5,
zero.policy = TRUE,
sample_id = "all"
)
colGraphs(sfeMult, "knn5", sample_id = "all") <- allGraphs
sfeMultclass: SpatialFeatureExperiment
dim: 18034 56120
metadata(0):
assays(2): counts logcounts
rownames(18034): SAMD11 NOC2L ... KDM5D EIF1AY
rowData names(3): ID Symbol Type
colnames(56120): s_016um_00145_00029-1.Normal_P5
s_016um_00165_00109-1.Normal_P5 ... s_016um_00122_00096-1.Normal_P3
s_016um_00127_00062-1.Normal_P3
colData names(7): row col ... array_row array_col
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor
unit:
Geometries:
colGeometries: centroids (POINT)
Graphs:
Normal_P5: col: knn5
Cancer_P5: col: knn5
Cancer_P1: col: knn5
Normal_P3: col: knn5
Given you have now a
SpatialFeatureExperimentobject with four samples, how would you compare a global indicator of spatial association across conditions, possibly with multiple samples? As an example, calculate global univariate Moran’s \(I\) for the gene “PIGR”.An important flag in this question is
sample_id = "all". For reasons of having isolated cell patches, consider a \(k\)-NN neighbourhood this time instead offindVisiumGraph.
First, we need to calculate univariate Moran’s \(I\) for all samples
sfeMult <- runUnivariate(sfeMult,
features = "PIGR",
colGraphName = "knn5",
exprs_values = "logcounts",
sample_id = "all",
type = "moran.mc",
nsim = 200)res <- rowData(sfeMult)["PIGR",]
resDataFrame with 1 row and 27 columns
ID Symbol Type moran.mc_statistic_Normal_P5
<character> <character> <factor> <numeric>
PIGR ENSG00000162896 PIGR Gene Expression 0.665933
moran.mc_parameter_Normal_P5 moran.mc_p.value_Normal_P5
<numeric> <numeric>
PIGR 201 0.00497512
moran.mc_alternative_Normal_P5 moran.mc_method_Normal_P5
<character> <character>
PIGR greater Monte-Carlo simulati..
moran.mc_res_Normal_P5 moran.mc_statistic_Cancer_P5
<list> <numeric>
PIGR 7.56633e-05, 1.03790e-02,-2.57316e-03,... 0.662601
moran.mc_parameter_Cancer_P5 moran.mc_p.value_Cancer_P5
<numeric> <numeric>
PIGR 201 0.00497512
moran.mc_alternative_Cancer_P5 moran.mc_method_Cancer_P5
<character> <character>
PIGR greater Monte-Carlo simulati..
moran.mc_res_Cancer_P5 moran.mc_statistic_Cancer_P1
<list> <numeric>
PIGR 0.006536053,0.001089384,0.000175518,... 0.839516
moran.mc_parameter_Cancer_P1 moran.mc_p.value_Cancer_P1
<numeric> <numeric>
PIGR 201 0.00497512
moran.mc_alternative_Cancer_P1 moran.mc_method_Cancer_P1
<character> <character>
PIGR greater Monte-Carlo simulati..
moran.mc_res_Cancer_P1 moran.mc_statistic_Normal_P3
<list> <numeric>
PIGR 0.00194370,0.00335518,0.00273481,... 0.554004
moran.mc_parameter_Normal_P3 moran.mc_p.value_Normal_P3
<numeric> <numeric>
PIGR 201 0.00497512
moran.mc_alternative_Normal_P3 moran.mc_method_Normal_P3
<character> <character>
PIGR greater Monte-Carlo simulati..
moran.mc_res_Normal_P3
<list>
PIGR 0.00208629,-0.01153171, 0.00286547,...
In order to model these measures, we have to convert the data into long format
df_long <- res |> as.data.frame() |>
tibble::rownames_to_column("gene") |>
pivot_longer(
cols = starts_with("moran.mc_statistic"),
names_to = c("measure", "sample"),
names_pattern = "(.*)statistic_(.*)",
values_to = "morans"
) |> separate(sample, into = c("condition", "replicate"), sep = "_") |>
select(c( "gene", "condition", "replicate", "measure", "morans"))
df_long# A tibble: 4 × 5
gene condition replicate measure morans
<chr> <chr> <chr> <chr> <dbl>
1 PIGR Normal P5 moran.mc_ 0.666
2 PIGR Cancer P5 moran.mc_ 0.663
3 PIGR Cancer P1 moran.mc_ 0.840
4 PIGR Normal P3 moran.mc_ 0.554
table(df_long$condition, df_long$replicate)
P1 P3 P5
Cancer 1 0 1
Normal 0 1 1
We see that we have a nested variance structure in this data. We have two conditions “Normal” and “Cancer”. We note that the patient identifiers are not unique, meaning that we have multiple samples from the same patient.
In order to account for this dependence, we have to model the data with a mixed-effects model. One option is the lme4 package in R.
boundary (singular) fit: see help('isSingular')
summary(mdl)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: morans ~ condition + (1 | replicate)
Data: df_long
REML criterion at convergence: -2
Scaled residuals:
Min 1Q Median 3Q Max
-0.8451 -0.6123 0.0000 0.6123 0.8451
Random effects:
Groups Name Variance Std.Dev.
replicate (Intercept) 0.00000 0.0000
Residual 0.01096 0.1047
Number of obs: 4, groups: replicate, 3
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.60997 0.07402 2.00000 8.241 0.0144 *
conditionCancer 0.14109 0.10467 2.00000 1.348 0.3101
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
conditnCncr -0.707
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
We note that in this example there is no difference in the global Moran’s \(I\) value for the gene “PIGR”. This not surprising since a model with 3 independent samples is strongly under-powered.
An important step in interpreting a statistical model is looking at some model diagnostics. We will plot the QQ plot of this mixed effects model. We note that the four samples follow the theoretical line well. But we notice again, that having only four samples is a great limitation of this dataset.
Now we will subset to only the tumor regions.
sfe_sub <- sfe[, sfe$tumorAssign %in% sel]Warning in sn2listw(df, style = style, zero.policy = zero.policy,
from_mat2listw = TRUE): neighbour object has 7 sub-graphs
Warning in mat2listw(m_wts, style = style, zero.policy = zero.policy):
neighbour object has 7 sub-graphs
We will pseudo-bulk within each tumor and do PCA.
# pseudo-bulk at the structure level
agg <- scrapper::aggregateAcrossCells(counts(sfe_sub), list(groups = sfe_sub$tumorAssign))
# normalize counts
nc <- normalizeCounts(agg$sums, size.factors = colSums(agg$sums))
# run PCA
pcaTum <- runPca(nc)Let’s look at the three top features (genes) for each principle component.
[,1] [,2] [,3]
[1,] "COL3A1" "COL1A1" "PYGB"
[2,] "COL1A1" "COL3A1" "MT-ND4"
[3,] "SELENOP" "TMSB4X" "MT-CYB"
[4,] "SPARC" "MT-CO1" "FABP1"
[5,] "VIM" "MT-CYB" "CEACAM5"
[6,] "GPCPD1" "FTH1" "MT-ATP6"
[7,] "CD74" "MT-ATP6" "GPCPD1"
[8,] "COL1A2" "MT-ND4" "MT-ND3"
[9,] "FN1" "COL1A2" "MT-CO1"
[10,] "PYGB" "SPARC" "SELENOP"
Let’s look at some features in the subset and the whole FOV.
plotSpatialFeature(sfe_sub, c("COL3A1", "FTH1", "PYGB"), ncol=3)plotSpatialFeature(sfe, c("COL3A1", "FTH1", "PYGB"), ncol=3)What do you notice?
Interestingly, several top-loaded genes for these PCs are strongly associated with non-tumor cell types. This suggests that the PCA is capturing variance that is driven by non-tumor cells within the tumor regions. Because the expression to these cells deviate so drastically from the dominant pseudo-bulk tumor profile, PCA identifies them as an important axis of variation. This could either result from the over-segementation discussed above or indicate some interesting biology or mixing patterns at the border (as we have seen in the distance density plots). This would be interesting to investigate further, e.g., check which cell types dominantly express these markers.
Rerun the segmentation with a higher intensity threshold and rerun the PCA analysis. Do you notice a change in the top genes?
To generate a more refined segmentation, the intensity threshold in reconstructShapeDensitySPE has to be increased to around 2.75E-4. The rest follows as above.
reconstructShapeDensitySPE(
sfe,
marks = "SingleR_label",
markSelect = "Tumor",
imageCol = "imageName",
thres = 2.75E-4
)You should see some change, but the non-tumor genes are still picked up. This further indicates that we might see an interesting biological pattern.
Save the object
alabaster.sfe::saveObject(sfe, "results/day3/03.1_sfe")