HoloTea: 3D Tissue Expression Inpainting
- HoloTea is a 3D-aware framework that integrates adjacent histology sections for volumetric reconstruction of spatial transcriptomic profiles.
- The method employs flow-matching denoising with count-specific and spatial-empirical priors alongside ControlNet modulation to inject precise 3D context.
- Empirical results demonstrate reduced MSE and improved gene-level correlation, validating its potential for unbiased tissue profiling and clinical applications.
Holographic Tissue Expression Inpainting and Analysis (HoloTea) is a 3D-aware generative framework for imputing spot-level gene expression in spatial transcriptomics (ST) from stacks of H&E-stained serial histology sections. Unlike preceding methods that treat each slide independently or lack scalability and generativity, HoloTea explicitly conditions on adjacent tissue sections, enabling accurate volumetric reconstruction of spatial transcriptomic profiles. The approach fuses geometry-aware retrieval, flow-matching generative modeling, count-specific initialization, and scalable attention mechanisms, yielding anatomically coherent, high-resolution 3D molecular atlases suitable for large tissue volumes and diverse biological contexts (Sanian et al., 18 Nov 2025).
1. Core Architecture and Workflow
HoloTea’s pipeline operates on stacks of serial H&E-stained tissue sections, each discretized into a set of planar “spots” with known XY coordinates and section index . For each spot on a section , the system leverages three foundational modules:
- Spot Encodings: Each spot receives a learned image embedding derived from its associated H&E tile using a frozen pathology foundation encoder (e.g., UNI).
- Adjacent‐Slide Retrieval: For spot , a candidate pool is formed using -nearest neighbors (kNN) by planar proximity from adjacent tissue sections . These are re-ranked using both embedding-space cosine similarity and spatial affinity, then aggregated with attention-style weighting to produce a cross-sectional context token.
- Flow‐Matching Denoiser with ControlNet and Global Attention: A time-conditioned network transports samples from a biologically motivated start distribution to the true expression profile using learned flows. Cross-section cues are injected via a lightweight ControlNet, and context is propagated both locally (kNN Graph Attention) and globally (Global Set Attention block).
The explicit conditioning on adjacent sections enables enforcement of anatomical continuity along the -axis, surpassing prior 2D and non-generative 3D baselines (Sanian et al., 18 Nov 2025).
2. 3D-Consistent Flow Matching and Priors
The generative backbone of HoloTea is rectified (displacement) flow matching, following the paradigm of interpolating from a start distribution to the ground-truth gene expression for each spot via:
The time-conditioned denoiser minimizes the objective:
with encapsulating image embeddings, positional features, local kNN summaries, and adjacent-section context.
To ensure biological plausibility, HoloTea incorporates two 3D-consistent priors for :
- Learned ZINB Prior: A parameterization based on scVI, where a neural network predicts zero-inflated negative binomial (ZINB) parameters per gene, pretrained via negative log ZINB likelihood. Flow-matching initialization uses log-transformed ZINB random samples.
- Spatial-Empirical Prior: Spots are initialized by smoothing expression counts over adjacent sections with attention-based weighting, directly anchoring generative transport to observed local data.
Combining these priors ensures both count-awareness and 3D anatomical consistency throughout the flow process (Sanian et al., 18 Nov 2025).
3. Cross-Sectional Conditioning with Spot-Wise ControlNet
HoloTea employs a lightweight adaptation of ControlNet for per-spot injection of 3D contextual cues while preserving main denoiser architecture. The mechanism comprises:
- GeneMapBuilder: Constructs a coarse 2D grid over each slide from current latent states .
- Control Token Extraction: A CNN/MLP , conditioned on time , processes the GeneMap into a feature map. For each spot, a differentiable grid sample produces a control token .
- Transformer Modulation: At selected layers, activations are modulated in residual fashion: , where defines a warm-up schedule.
This approach follows ControlNet’s residual-modulation strategy but interfaces directly with flow-matching denoisers at the granularity of individual transcriptomic spots (Sanian et al., 18 Nov 2025).
4. Scalable Global Context Propagation
To propagate context efficiently across large slides, HoloTea employs a Global Set Attention (GSA) block, a two-stage inducing-point multi-head attention mechanism:
- Stage 1 (Read): Learnt inducing points () attend to all spot embeddings, reducing dimensionality and capturing global structure.
- Stage 2 (Write): All spot tokens attend to these summarizers, propagating information globally.
The resulting computational cost is versus for full attention, supporting whole-slide analysis with hundreds of thousands of spots without prohibitive resource demand (Sanian et al., 18 Nov 2025).
5. Training and Computational Considerations
HoloTea training proceeds in two distinct phases:
- Phase A: ZINB Pretraining
- Compute image embeddings.
- Predict and fit ZINB parameters; optimize negative log-likelihood via AdamW.
- Freeze the ZINB prior network post-convergence.
- Phase B: Flow Matching
- Sample spot-time pairs; initialize from frozen priors.
- Interpolate flow steps, compute kNN and adjacent-slide tokens.
- Extract and inject ControlNet tokens.
- Run the denoising transformer with GSA; backpropagate the flow-matching loss.
Hyperparameters typically include batch sizes around 2048 spots, inducing token count , , , blend , temperature , and flow steps –20. Single-NVIDIA H100 (80GB) GPUs suffice for slide-atlas scale (e.g., training with spots, $333$ GFLOPs, $26.9$ GB peak RAM; per-slide inference $3.7$ GB). Only two neighboring sections are required at inference, limiting memory footprint. Baseline ASIGN 3D cannot run on the same hardware due to out-of-memory limits (Sanian et al., 18 Nov 2025).
6. Empirical Performance and Biological Relevance
Benchmarking across whole-slide HVG panels and custom marker panels (HER2, ST-Data, in-house embryo) demonstrates HoloTea’s superiority over UNI and STFlow (2D flows):
- ST-Data top-250 HVG: MSE 0.364 (vs 0.443 for STFlow), gene-level PCC 0.612 (vs 0.540)
- Custom marker (211 genes): MSE 0.330 (vs 0.387 STFlow, 0.412 UNI), PCC 0.726
Imputed expression closely matches ground truth, accurately recapitulating spatial molecular niches (e.g., basal plate, cardinal vein). Clustering of imputed data yields higher neighbor consistency (0.88 vs 0.78 for ground truth) and normalized mutual information of 0.524 on unseen sections. Ablation studies verify the complementary contributions of the ZINB prior, ControlNet, and cross-slide cosine branch to overall accuracy (Sanian et al., 18 Nov 2025).
7. Applications and Prospects for Volumetric Tissue Profiling
HoloTea enables cost-effective, high-resolution 3D tissue profiling from partially sequenced histological stacks. Key implications include:
- Unbiased Mapping of 3D Cell–Cell Communication: Critical for understanding tumor microenvironments, development, and organ zonation.
- Accelerated Biomarker Discovery: Supports stratification of tissue niches in both preclinical and clinical settings, such as resolving tumor-stroma interfaces in HER2+ breast cancer or cardiac tissue substructures.
- Compatibility with Clinical Workflows: H&E stacks can be computationally augmented with virtual ST measurements, integrating seamlessly with digital pathology platforms.
HoloTea thereby establishes a framework for geometry-aware retrieval, generative transport with count-aware priors, and scalable global context integration at slide-atlas scale, offering a principled pathway to volumetric molecular tissue atlases (Sanian et al., 18 Nov 2025).