Cell-ontology guided transcriptome foundation model (2408.12373v2)

Published 22 Aug 2024 in cs.LG and cs.AI

Abstract: Transcriptome foundation models TFMs hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, which are available in cell ontology graphs. We argue that effectively leveraging this ontology information during the TFM pre-training can improve learning biologically meaningful gene co-expression patterns while preserving TFM as a general purpose foundation model for downstream zero-shot and fine-tuning tasks. To this end, we present single cell, Cell-ontology guided TFM scCello. We introduce cell-type coherence loss and ontology alignment loss, which are minimized along with the masked gene expression prediction loss during the pre-training. The novel loss component guide scCello to learn the cell-type-specific representation and the structural relation between cell types from the cell ontology graph, respectively. We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses.

Summary

The paper introduces scCello, a transcriptome foundation model that leverages cell ontology during pre-training to enhance gene co-expression learning.
It employs masked gene prediction, supervised contrastive loss, and Personalized PageRank for relational alignment, resulting in superior zero-shot clustering and marker gene prediction.
The model shows robust performance in classifying novel cell types and predicting cancer drug response, highlighting its potential in genomics and personalized medicine.

Cell-ontology guided transcriptome foundation model

The paper "Cell-ontology guided transcriptome foundation model" introduces a new transcriptome foundation model (TFM) named scCello that leverages cell ontology during pre-training to enhance the biological relevance and application span of the model. Addressing limitations in current TFMs, which treat cells as independent samples and ignore inherent taxonomic relationships captured in cell ontology graphs, the authors present a novel approach to improve the learning of gene co-expression patterns.

Introduction

Modern single-cell RNA sequencing (scRNA-seq) techniques have enabled the generation of large-scale gene expression data at a single-cell resolution. This data presents an opportunity to develop TFMs that can decipher the transcriptomic language underlying various cell functions. Despite the richness of the scRNA-seq data, existing TFMs lack the integration of taxonomic relationships between cell types, which is crucial for understanding the biological context of gene interactions.

Methodology

scCello Architecture

The scCello model introduces three primary components to incorporate cell ontology knowledge into its pre-training process:

Masked Gene Prediction (MGP):
- The model uses a masked token prediction loss, akin to the BERT LLM, to predict masked genes within the gene expression profile of a cell. This objective aims to capture gene co-expression patterns.
Intra-Cellular Ontology Coherence:
- A supervised contrastive loss encourages the representations of cells belonging to the same cell type to aggregate. This coherence is further regularized by enforcing an affine transformation constraint on the cell type representation space to avoid class collapse.
Inter-Cellular Relational Alignment:
- Leveraging Personalized PageRank (PPR) scores to quantify structural similarities between cell types, this component ensures that the learned cell representations reflect the ontology relationships. Cells with closely related cell types are represented closer together in the latent space.

Pre-training and Data Preparation

The scCello model was pre-trained on 22 million cells from the CellxGene database, with cell-type labels aligned to the Open Biological and Biomedical Ontology (OBO) Foundry's cell ontology graph. The authors propose a unified pre-training objective that integrates the three components mentioned above, ensuring a comprehensive and biologically informed learning process.

Experiments

Zero-shot Cell Type Clustering

The scCello model was evaluated on in-distribution (ID) datasets and out-of-distribution (OOD) datasets, covering scenarios such as unseen cell types, tissues, and donors. The model demonstrated superior performance in zero-shot cell type clustering tasks, significantly outperforming other TFMs like Geneformer, scGPT, scTab, and UCE across various metrics such as normalized mutual information (NMI), adjusted rand index (ARI), and average silhouette width (ASW).

Fine-tuning for Cell Type Classification

In fine-tuning experiments, scCello continued to show remarkable performance improvements. When fine-tuned with subsets of pre-training data, scCello outperformed competitors in both classification and clustering metrics on the ID dataset, illustrating its enhanced transferability and generalization capabilities.

Novel Cell Type Classification

A key strength of scCello is its ability to classify novel cell types in a zero-shot fashion. By utilizing the cell ontology graph and pre-trained representations, scCello achieves high accuracy in annotating novel cell types, outperforming other TFMs by a significant margin in tasks involving the classification of numerous unseen cell types.

Marker Gene Prediction

In binary classification tasks for predicting cell-type-specific marker genes, scCello again excelled, demonstrating a higher area under the receiver operating characteristic curve (AUROC) compared to other TFMs. This highlights the model's capability to capture and generalize biologically meaningful gene interactions.

Cancer Drug Response Prediction

Using the DeepCDR framework for cancer drug response prediction, scCello achieved competitive performance, demonstrating that the learned cell representations can effectively transfer to specialized tasks.

Robustness to Batch Effects

The scCello model exhibited robustness against batch effects arising from different experimental conditions. It maintained high batch integration scores across various ID and OOD datasets, emphasizing its utility in handling heterogeneous scRNA-seq data.

Implications and Future Directions

The integration of cell ontology knowledge into the pre-training process of TFMs, as demonstrated by scCello, offers a promising direction for improving the biological relevance of learned representations. Practically, this approach enhances the model's ability to generalize to novel cell types and different biological contexts, making it a valuable tool in genomics and personalized medicine. The theoretical implications suggest that incorporating domain-specific knowledge into foundation models can significantly enhance their performance and application spectrum.

Future developments may explore scaling the model size for increased expressiveness, investigating efficient fine-tuning methods for continual learning of updated ontologies, and further refining the method to distinguish essential genes from marker genes more effectively.

Conclusion

The scCello model presents a significant advancement in the field of transcriptome foundation models by effectively integrating cell ontology into the pre-training process. Through comprehensive evaluation, the model has demonstrated superior performance in various biologically important tasks, showcasing its potential to facilitate scientific discoveries in genomics and personalized medicine. The approach of embedding domain-specific knowledge into foundation models opens new avenues for enhancing the biological applicability of machine learning models in life sciences.