Cell Type Annotation in Spatial Transcriptomics

Updated 19 April 2026

Cell type annotation in spatial transcriptomics is the process of assigning meaningful cellular identities to spatially resolved gene expression data, linking tissue architecture with gene function.
It integrates spatial proximity, gene regulatory networks, and morphology using advanced computational frameworks like graph neural networks and transformer models to overcome noise and resolution challenges.
Recent models such as CellScape and HEIST demonstrate high accuracy in resolving cellular heterogeneity, offering powerful tools for understanding microenvironment interactions.

Cell type annotation in spatial transcriptomics (ST) is the systematic process of assigning biologically meaningful cell identities to spatially resolved gene expression profiles. This process underpins the elucidation of tissue organization, cellular heterogeneity, and microenvironmental interactions by linking single-cell or multi-cellular transcriptomic data to anatomical context. The advent of high-resolution spatial transcriptomics technologies has catalyzed the development of computational frameworks that address the integration of spatial proximity, gene regulatory networks, and multimodal information—including morphology and protein landscape—for robust, interpretable cell type annotation.

1. Conceptual Framework and Challenges

Cell identity in tissues is influenced by both intrinsic genomic programs and extrinsic spatial context. In spatial transcriptomics, these data are characterized by high dimensionality, spatial correlations, technical noise, cell or spot heterogeneity, and platform-specific limitations (e.g., spot-based vs. single-cell resolution). Comprehensive annotation demands models that capture spatial neighborhood effects, gene–gene regulatory structure, and sometimes morphological context, while being scalable, interpretable, and robust to the heterogeneities and artifacts inherent to tissue environments (Yan et al., 13 Feb 2026).

A key challenge is the “entanglement” of technical and biological variation: cells of the same type may appear transcriptionally distinct across microenvironments, while adjacent but unrelated cell types may share overlapping gene expression patterns due to spatial gradients, technical diffusion, or partial volume effects. Additionally, the spot-based technologies (e.g., 10× Visium) introduce the further problem of mixed cell populations per spot, necessitating deconvolution or latent mixture modeling (Koo et al., 9 Nov 2025).

2. Dual-Graph and Hierarchical Representation Models

Recent state-of-the-art approaches leverage dual branch or hierarchical graph models to encapsulate both spatial and regulatory interactions. For example, CellScape is a dual-branch framework where spatial adjacency is captured via a Graph Attention Network (GAT) on a cell–cell graph, while gene–gene regulatory relationships are encoded via a convolutional neural network on a co-expression–derived affinity map (Yan et al., 13 Feb 2026). Each branch produces a low-dimensional embedding; contrastive loss on the spatial branch enforces similarity among neighboring cells’ embeddings, while a nonlinear reconstruction loss on the gene branch ensures fidelity to gene co-regulatory structure.

HEIST extends this paradigm with a true hierarchy: a tissue-level cell–cell spatial graph (neighborhood defined by Voronoi tessellation) is coupled with per-cell gene regulatory graphs, with cross-level message passing (cell ↔ gene), intra-level transformers, and spatially-aware contrastive and masked auto-encoding pretraining (Madhu et al., 11 Jun 2025). Ablation studies in HEIST demonstrate that omitting either spatial or regulatory hierarchy leads to substantial drops in F1 performance (from ≈0.995 to ≈0.18–0.20 in SEA-AD), underscoring the criticality of integrated modeling.

3. Spatial Graph Neural Networks and Foundation Models

Graph neural networks (GNNs) and transformer-derived architectures have been leveraged extensively as foundation models for spatial transcriptomics, harnessing the natural graph structure of spatial data. SAGE-FM utilizes a lightweight GCN trained via a masked central spot prediction objective, in which masked gene expression in each spot is imputed using the local spatial context provided by a fixed-radius neighborhood graph (Zhan et al., 21 Jan 2026). The resulting 1024-dimensional spot embeddings outperform non-spatial and factor analysis baselines (e.g., MOFA) in clustering, classification, and biological heterogeneity preservation (e.g., achieving 81% accuracy and macro-F1 of 0.62 in oropharyngeal squamous cell carcinoma).

Multi-scale models such as SToFM further enhance annotation fidelity by extracting and fusing gene-level, cellular neighborhood (micro-scale), and tissue (macro-scale) features using SE(2)–equivariant transformers and virtual cell aggregation (Zhao et al., 15 Jul 2025). The multi-scale pipeline is ablated into (i) gene-scale domain-adapted transformers, (ii) macro-scale clusters collapsed into virtual cells, and (iii) micro-scale tiling of the spatial field. When benchmarked for cell type annotation in mouse brain datasets, SToFM achieves macro-F1 of 0.4951, exceeding all prior transformer architectures.

4. Deconvolution, Mixture Models, and Spot-Based Annotation

Spot-based platforms (e.g., Visium) necessitate deconvolution, wherein the per-spot gene expression is modeled as a convex mixture of reference cell type gene expression profiles. DUET formulates this as a penalized likelihood with (i) Poisson or Negative Binomial spot count likelihood, (ii) per-spot composition vectors $\theta_i$ constrained to the probability simplex, and (iii) a convex clustering penalty that fuses spot compositions across spatial neighbors as a function of their similarity (Koo et al., 9 Nov 2025). Proximal ADMM and coordinate descent are employed to optimize both spot size factors and compositions, yielding both spatial domain assignments and per-domain cell type compositions.

Performance benchmarks indicate that DUET attains lower deconvolution error and higher ARI in simulated and real datasets compared to existing methods (SPOTlight, CARD, Seurat+deconv), especially as domain smoothness and composition overlap increase.

Huang et al. (Huang et al., 2016) present a related mixture model in the context of ISH data, applying marked spatial point process mixtures coupled with non-negative matrix factorization and variational LDA for cell-type annotation, and enabling extraction of spatial density maps, gene profiles, and morphological statistics per type.

5. Multimodal and Morphology-Integrated Annotation Methods

Joint modeling of transcriptomic and morphological data is realized in frameworks such as CellSymphony, which processes single-cell spatial transcriptomics data (e.g., Xenium) alongside paired morphology embeddings from high-resolution H&E images (Acosta et al., 13 Aug 2025). Separate transformer-based encoders process molecular and image features, linearly projecting them into a common latent space before joint fusion by a lightweight transformer. Supervised cross-entropy loss is combined with contrastive alignment to ensure internal consistency across modalities.

Empirically, combining gene and morphology inputs improves F1 scores across various cancer types relative to unimodal baselines. For example, in lung cancer T cell annotation, a joint model achieves F1 = 0.93, compared to 0.94 in the unimodal transformer, and 0.97 in a multi-input model integrating explicit spatial tokens. Morphological context sharpens detection of spatially-organized microenvironmental niches, such as immune rings or stromal compartments.

6. Topological and Multiscale Approaches

Persistent homology and topological data analysis offer a complementary paradigm for cell-type annotation in subcellular or high-density spatial transcriptomics. TopACT constructs a multiscale classifier referencing single-cell RNA-seq data and applies multiscale aggregation over increasing radii to assign spot-level cell types (Benjamin et al., 2022). Critical in TopACT is the combination of a robust probabilistic classifier (e.g., linear SVM with Platt-scaling) and adaptive window aggregation, followed by blob detection to extract individual cell loci.

Topological features, such as the Betti numbers of the Vietoris–Rips complex over detected loci, quantify tissue-level organizational principles. The use of multiparameter persistence landscapes enables quantification of higher-order arrangements, such as immune cell peripheral rings in glomeruli, with statistical validation against immunofluorescence measurements in lupus nephritis tissue.

7. Comparative Benchmarks and Case Studies

Performance comparisons across annotation methods are summarized in the following tables, illustrating quantitative metrics for cell type classification across several datasets:

Dataset	Method	Accuracy	Macro F1	ARI
Slide-tags cortex	CellScape	0.92	0.88	0.85
	SpaGCN	0.79	0.69	0.65
	GraphST	0.81	0.72	0.68
STARmap PLUS	CellScape	0.90	0.85	0.82
	SEDR	0.75	0.62	0.58
Slide-seqV2 bulb	CellScape	0.88	0.83	0.80
	STAGATE	0.70	0.60	0.55

Dataset	Model	Macro F1
SEA-AD	HEIST	0.995 ± 0.016
	CellPLM	0.67 ± 0.08
	scGPT-spatial	0.59 ± 0.003
	STAGATE	0.33 ± 0.06

Case studies in brain (Slide-tags), Alzheimer's disease (STARmap PLUS), and olfactory bulb (Slide-seqV2/Stereo-seq) corroborate that model architectures integrating spatial, regulatory, and sometimes morphological contextualization resolve known anatomical and pathological features, including fine subtype separations (e.g., astrocyte subclusters; microglial DAM expansion in AD; layer-specific markers).

8. Implementation Recommendations and Limitations

Critical hyperparameters include neighborhood radius or adjacency criteria (for spatial graphs), masking ratios in training, number of clusters or dimensionality reduction targets, and loss balancing weights (e.g., λ for contrastive vs. reconstruction losses). Efficient batching, parallelization (GPU or multi-node), and scalable optimization (proximal ADMM, block coordinate descent, early stopping) are necessary for high-throughput datasets.

Limitations include domain- and platform-specific tuning requirements (e.g., for morphologically rich vs. poor tissues), potential over-smoothing or domain merging under aggressive spatial penalties, and the challenge of marker-based mapping when reference menu is incomplete. For topological and multiscale methods, threshold and window definitions require careful selection to avoid over- or under-segmentation.

References

"Uncovering spatial tissue domains and cell types in spatial omics through cross-scale profiling of cellular and genomic interactions" (Yan et al., 13 Feb 2026)
"SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model" (Zhan et al., 21 Jan 2026)
"Discovering Neuronal Cell Types and Their Gene Expression Profiles Using a Spatial Point Process Mixture Model" (Huang et al., 2016)
"A unified approach to spatial domain detection and cell-type deconvolution in spot-based spatial transcriptomics" (Koo et al., 9 Nov 2025)
"CellSymphony: Deciphering the molecular and phenotypic orchestration of cells with single-cell pathomics" (Acosta et al., 13 Aug 2025)
"HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data" (Madhu et al., 11 Jun 2025)
"SToFM: a Multi-scale Foundation Model for Spatial Transcriptomics" (Zhao et al., 15 Jul 2025)
"Multiscale topology classifies and quantifies cell types in subcellular spatial transcriptomics" (Benjamin et al., 2022)