HEST-Library: Unified Spatial Omics Analysis
- HEST-Library is a Python package that unifies spatial transcriptomics and histological imaging data from diverse sources for integrated multi-omics analysis.
- It employs advanced segmentation, image alignment, and batch-effect correction methods to ensure accurate data harmonization across varied samples.
- The package streamlines workflows for downloading, preprocessing, visualization, and benchmarking, enabling reproducible and scalable multimodal research.
HEST-Library is a Python package engineered to unify, preprocess, and analyze spatial transcriptomics (ST) and histological image data, specifically designed to operate on the HEST-1k dataset—a resource of 1,229 ST profiles paired with H&E whole slide images (WSIs), encompassing diverse tissue types, species, and cancer states. It enables streamlined data access, multimodal alignment, quantitative morphomolecular analyses, and foundation model benchmarking, with utilities for downloading, preprocessing, visualization, and batch-effect correction in spatial multi-omics research (Jaume et al., 2024).
1. Design Objectives and Scope
HEST-Library was developed to address the integrative and computational demands of large-scale multimodal studies involving legacy and contemporary ST datasets combined with digital pathology images. The principal goals are:
- Data assembly and harmonization: Aggregating heterogeneous ST and histology data from diverse sources (153 cohorts, 26 organs, two species, 25 cancer types), seamlessly wrapping transcriptomics, WSIs, and metadata.
- Unified I/O: Metadata-driven sample download, image conversion to pyramidal TIFF (for scalable viewing), and multimodal spot-to-image alignment.
- Preprocessing utilities: Automated workflows for tissue and nuclear segmentation using DeepLabV3 and CellViT, patch extraction emulating Visium/Xenium layouts, magnification inference, and batch-effect mitigation (ComBat, Harmony, MNN).
- Support for advanced analysis: Enabling downstream applications such as foundation model benchmarking (“HEST-Benchmark”), biomarker/morphology-gene exploration, and multimodal representation learning.
2. Architecture and Module Organization
The HEST-Library structure is organized under the top-level hest namespace with modular subcomponents, facilitating both end-to-end workflows and granular task execution. The principal submodules and provided APIs are:
| Module | Key Functions/Classes | Purpose |
|---|---|---|
hest.io |
download_hest, list_samples, load_sample |
Data access and sample loading |
hest.core |
HestSample, to_anndata, to_pyramidal_tiff, etc. |
Core sample representation |
hest.preprocess |
align_visium, tissue_segmentation, tile_patches, |
Preprocessing and stats |
normalize_expression, compute_spatial_stats |
||
hest.batch |
plot_batch_effect, correct_batch_effect |
Batch-effect exploration/correction |
hest.benchmark |
run_hest_benchmark, BenchmarkResult |
Foundation model evaluation |
hest.utils |
find_spot_under_patch, visualize_overlay |
Utility functions |
Section 4 and Appendix Figure A1 of (Jaume et al., 2024) provide a full schematic of these modules and their interactions.
3. Principal Functionalities and API Patterns
HEST-Library exposes a high-level API for typical spatial omics workflows:
- Sample enumeration and download: Retrieve metadata (
list_samples) and download filtered subsets by species, organ, or pathology.1 2 3
from hest.io import list_samples, download_hest meta_df = list_samples() download_hest({'species':'Homo sapiens', 'organ':'Breast', 'cancer_type':'IDC'}, local_dir=Path('/data/hest1k/'))
- Sample loading and inspection: Encapsulated in the
HestSampleclass, integrating WSI objects, AnnData transcriptomics, alignment, contours, and nuclei segmentation:1 2 3
sample = load_sample('TENX111', data_dir=Path('/data/hest1k/')) adata = sample.to_anndata() slide = sample.to_pyramidal_tiff()
- Expression normalization and filtering: Total-count and log1p normalization of AnnData; gene filtering via Scanpy.
1 2 3
from hest.preprocess import normalize_expression adata = normalize_expression(adata, method='total_count') adata = normalize_expression(adata, method='log1p')
- Tissue and patch extraction: Automatic segmentation and Visium/Xenium-like patch assignment.
1 2 3
from hest.preprocess import tissue_segmentation, tile_patches mask = tissue_segmentation(sample) patches = tile_patches(sample, size_px=224, mag=20.0)
- Nuclear feature quantification: Extraction of per-nucleus morphometrics (area, perimeter, eccentricity).
1 2
masks, classes = sample.nuclei.load() df_feats = sample.nuclei.compute_features(classes_of_interest=['neoplastic'], features=['area'])
- Spatial-molecular correlation: Quantify relationships (e.g., PCC ~0.47 between GATA3 expression and nuclear area) and spatial statistics (e.g., Moran’s I).
1 2
from hest.preprocess import compute_spatial_stats morans_i = compute_spatial_stats(adata, gene='GATA3', neighbors=8, metric='morans_i')
- Visualization: Overlay gene expression or segmentation masks on WSIs for interpretability.
1 2
from hest.utils import visualize_overlay fig = visualize_overlay(slide, coords, expr, cmap='coolwarm', alpha=0.6)
4. End-to-End Analytical Workflows
The library supports comprehensive, protocolized analyses, with exemplar workflows (Sections 6 and 7):
- Biomarker exploration: Identify histomorphological correlates of expression in carcinoma samples by segmenting nuclei, averaging per-spot features, and correlating with transcript abundance (e.g., GATA3: nuclear area PCC ≈ 0.47).
- Multimodal representation learning: Construction of paired patch-expression datasets, used to fine-tune vision-language foundation models (e.g., CONCH) with contrastive losses (InfoNCE), enabling subsequent transfer and evaluation on external image cohorts for biomarker classification tasks.
These workflows are implemented with minimal boilerplate, leveraging HEST-Library’s integration with AnnData and major deep learning and visualization frameworks. Section 5 details model benchmarking interfaces (hest.benchmark), and Section 6 (Figure 1) illustrates biomarker studies.
5. Data Handling, Dependencies, and Performance Considerations
HEST-Library is optimized for high-throughput, interactive, and reproducible workflows:
- Data formats: WSIs are converted to pyramidal TIFF via OpenSlide; transcript matrices stored in AnnData compatible with
scanpy>=1.9. - Alignment pipelines: Spot-to-tissue registration employs YOLOv8 for Visium (“faster_fiducial”) and VALIS for Xenium (Appendix Figure A2).
- Preprocessing backends: Tissue and nuclear segmentation utilize DeepLabV3 and CellViT.
- Batch correction: ComBat (
pycombat), Harmony (harmonypy), and MNN (scanpy.external.pp.mnn_correct) are implemented for normalization across sample batches. - Scalability: Lazy WSI loading via OpenSlide; patch extraction and feature quantification support multiprocessing via
num_workersorjoblib.Parallel. - Software stack: Dependencies include
torch>=1.10,torchvision,yolov8,openslide-python,scikit-image,geopandas,scikit-learn, andxgboost.
A plausible implication is that the modularity and lazy evaluation scheme favor interactive as well as large-scale batch analyses for both computational biologists and machine learning practitioners.
6. References, Utility, and Broader Impact
HEST-Library is described in Section 4 (“HEST-Library”), with schematic and pipeline details in Appendix Figures A1–A2. It directly supports research in spatial genomics, digital pathology, and multimodal learning, evidenced by utility in the HEST-Benchmark (Section 5), biomarker analyses (Section 6), and multimodal foundation model research (Section 7, Table 7). The resource is fully open and accessible, with tutorials and code at https://github.com/mahmoodlab/hest, serving as a backbone for reproducible research and method development in spatial multi-omics (Jaume et al., 2024).