HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis

Published 23 Jun 2024 in cs.CV | (2406.16192v2)

Abstract: Spatial transcriptomics enables interrogating the molecular composition of tissue with ever-increasing resolution and sensitivity. However, costs, rapidly evolving technology, and lack of standards have constrained computational methods in ST to narrow tasks and small cohorts. In addition, the underlying tissue morphology, as reflected by H&E-stained whole slide images (WSIs), encodes rich information often overlooked in ST studies. Here, we introduce HEST-1k, a collection of 1,229 spatial transcriptomic profiles, each linked to a WSI and extensive metadata. HEST-1k was assembled from 153 public and internal cohorts encompassing 26 organs, two species (Homo Sapiens and Mus Musculus), and 367 cancer samples from 25 cancer types. HEST-1k processing enabled the identification of 2.1 million expression--morphology pairs and over 76 million nuclei. To support its development, we additionally introduce the HEST-Library, a Python package designed to perform a range of actions with HEST samples. We test HEST-1k and Library on three use cases: (1) benchmarking foundation models for pathology (HEST-Benchmark), (2) biomarker exploration, and (3) multimodal representation learning. HEST-1k, HEST-Library, and HEST-Benchmark can be freely accessed at https://github.com/mahmoodlab/hest.

Abstract PDF HTML Upgrade to Chat

Authors (11)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces HEST-1k, a comprehensive dataset linking 1,229 spatial transcriptomic profiles with H&E-stained whole slide images across 26 organs from two species.
It details an automated alignment pipeline using YOLOv8 and VALIS methods to standardize data processing and enable robust multimodal analysis.
Benchmark tasks demonstrate a logarithmic scaling law for gene expression prediction and show that fine-tuning large models enhances multimodal representation learning.

An Overview of HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis

The paper "HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis" introduces a novel and extensive dataset named HEST-1k, curated to facilitate the intersection of spatial transcriptomics (ST) and histology. This dataset addresses the constraints posed by high costs, rapidly evolving technology, and the lack of standards, which have limited computational methods in spatial transcriptomics to narrow tasks and small cohorts.

Dataset Composition and Characteristics

HEST-1k consists of 1,229 spatial transcriptomic profiles linked to corresponding hematoxylin and eosin (H&E)-stained whole slide images (WSIs), amassed from 153 cohorts representing 26 organs from two species (Homo sapiens and Mus musculus). Notably, it includes 367 cancer samples from 25 cancer subtypes. By processing these samples, the authors identified 2.1 million expression–morphology pairs and over 76 million nuclei, thereby providing a vast repository of data for further analysis. The dataset is further supported by the HEST-Library, a Python package designed to handle the dataset and facilitate the analysis of HEST samples.

Applications and Implications

The authors illustrate three primary applications of HEST-1k: benchmarking of foundation models for histology with the HEST-Benchmark, biomarker exploration, and multimodal representation learning. The HEST-Benchmark comprises nine tasks for gene expression prediction from histology, evaluated with eleven state-of-the-art models. Key metrics of the benchmark are based on Pearson correlation, using a ridge regression approach after PCA reduction to ensure a fair comparison across models with varying embedding dimensions. The benchmark revealed a logarithmic scaling law for model performance with respect to trainable parameters, emphasizing the potential of fine-tuning large models with precise data to achieve robust downstream performance.

In the field of biomarker exploration, the authors demonstrate the linkage between morphological characteristics, such as nuclear shape, and gene expression patterns using the dataset. Utilizing CellViT nuclear segmentation, notable correlations were found between nuclear size features and gene expressions like GATA3 in breast cancer samples, indicating potential prognostic biomarkers.

Moreover, HEST-1k also enables advancing multimodal representation learning by providing spatially resolved expression–morphology pairs. The fine-tuning of CONCH (a state-of-the-art model) using this dataset demonstrated an improved ability to encode molecularly relevant information, outperforming the non-finetuned version on specific tasks.

Methodological Contributions and Challenges

One of the significant methodological advancements is the automatic alignment of ST data with WSIs. The HEST-Library automates this alignment through methods such as YOLOv8-based fiducial detection for Visium samples or the VALIS pipeline for Xenium samples, thereby standardizing the data processing workflow.

Despite these contributions, the authors acknowledge several limitations. The inherent noise in transcriptomics measurements and batch effects due to technological and procedural variations across cohorts can be substantial. The authors suggest exploring batch effect mitigation further, facilitated by tools in the HEST-Library, to enhance data consistency and model performance.

Future Directions and Conclusion

The HEST-1k dataset stands as a substantial asset for exploring multimodal interactions in tissue samples, predicting gene expressions from histology, and unraveling the underlying biology of tissue morphology in disease contexts. The dataset is structured to be dynamic, anticipating updates with the inclusion of additional publicly available cohorts. This ongoing expansion presents burgeoning opportunities for breakthroughs in understanding the tumor microenvironment and developing predictive models for clinical outcomes.

Overall, HEST-1k and its associated tools hold promise for significantly advancing research in spatial transcriptomics and computational pathology, offering insights into the integration of morphological and molecular data at a previously uncharted scale.

Markdown Report Issue