- The paper introduces scCello, a transcriptome foundation model that leverages cell ontology during pre-training to enhance gene co-expression learning.
- It employs masked gene prediction, supervised contrastive loss, and Personalized PageRank for relational alignment, resulting in superior zero-shot clustering and marker gene prediction.
- The model shows robust performance in classifying novel cell types and predicting cancer drug response, highlighting its potential in genomics and personalized medicine.
Cell-ontology guided transcriptome foundation model
The paper "Cell-ontology guided transcriptome foundation model" introduces a new transcriptome foundation model (TFM) named scCello that leverages cell ontology during pre-training to enhance the biological relevance and application span of the model. Addressing limitations in current TFMs, which treat cells as independent samples and ignore inherent taxonomic relationships captured in cell ontology graphs, the authors present a novel approach to improve the learning of gene co-expression patterns.
Introduction
Modern single-cell RNA sequencing (scRNA-seq) techniques have enabled the generation of large-scale gene expression data at a single-cell resolution. This data presents an opportunity to develop TFMs that can decipher the transcriptomic language underlying various cell functions. Despite the richness of the scRNA-seq data, existing TFMs lack the integration of taxonomic relationships between cell types, which is crucial for understanding the biological context of gene interactions.
Methodology
scCello Architecture
The scCello model introduces three primary components to incorporate cell ontology knowledge into its pre-training process:
- Masked Gene Prediction (MGP):
- The model uses a masked token prediction loss, akin to the BERT LLM, to predict masked genes within the gene expression profile of a cell. This objective aims to capture gene co-expression patterns.
- Intra-Cellular Ontology Coherence:
- A supervised contrastive loss encourages the representations of cells belonging to the same cell type to aggregate. This coherence is further regularized by enforcing an affine transformation constraint on the cell type representation space to avoid class collapse.
- Inter-Cellular Relational Alignment:
- Leveraging Personalized PageRank (PPR) scores to quantify structural similarities between cell types, this component ensures that the learned cell representations reflect the ontology relationships. Cells with closely related cell types are represented closer together in the latent space.
Pre-training and Data Preparation
The scCello model was pre-trained on 22 million cells from the CellxGene database, with cell-type labels aligned to the Open Biological and Biomedical Ontology (OBO) Foundry's cell ontology graph. The authors propose a unified pre-training objective that integrates the three components mentioned above, ensuring a comprehensive and biologically informed learning process.
Experiments
Zero-shot Cell Type Clustering
The scCello model was evaluated on in-distribution (ID) datasets and out-of-distribution (OOD) datasets, covering scenarios such as unseen cell types, tissues, and donors. The model demonstrated superior performance in zero-shot cell type clustering tasks, significantly outperforming other TFMs like Geneformer, scGPT, scTab, and UCE across various metrics such as normalized mutual information (NMI), adjusted rand index (ARI), and average silhouette width (ASW).
Fine-tuning for Cell Type Classification
In fine-tuning experiments, scCello continued to show remarkable performance improvements. When fine-tuned with subsets of pre-training data, scCello outperformed competitors in both classification and clustering metrics on the ID dataset, illustrating its enhanced transferability and generalization capabilities.
Novel Cell Type Classification
A key strength of scCello is its ability to classify novel cell types in a zero-shot fashion. By utilizing the cell ontology graph and pre-trained representations, scCello achieves high accuracy in annotating novel cell types, outperforming other TFMs by a significant margin in tasks involving the classification of numerous unseen cell types.
Marker Gene Prediction
In binary classification tasks for predicting cell-type-specific marker genes, scCello again excelled, demonstrating a higher area under the receiver operating characteristic curve (AUROC) compared to other TFMs. This highlights the model's capability to capture and generalize biologically meaningful gene interactions.
Cancer Drug Response Prediction
Using the DeepCDR framework for cancer drug response prediction, scCello achieved competitive performance, demonstrating that the learned cell representations can effectively transfer to specialized tasks.
Robustness to Batch Effects
The scCello model exhibited robustness against batch effects arising from different experimental conditions. It maintained high batch integration scores across various ID and OOD datasets, emphasizing its utility in handling heterogeneous scRNA-seq data.
Implications and Future Directions
The integration of cell ontology knowledge into the pre-training process of TFMs, as demonstrated by scCello, offers a promising direction for improving the biological relevance of learned representations. Practically, this approach enhances the model's ability to generalize to novel cell types and different biological contexts, making it a valuable tool in genomics and personalized medicine. The theoretical implications suggest that incorporating domain-specific knowledge into foundation models can significantly enhance their performance and application spectrum.
Future developments may explore scaling the model size for increased expressiveness, investigating efficient fine-tuning methods for continual learning of updated ontologies, and further refining the method to distinguish essential genes from marker genes more effectively.
Conclusion
The scCello model presents a significant advancement in the field of transcriptome foundation models by effectively integrating cell ontology into the pre-training process. Through comprehensive evaluation, the model has demonstrated superior performance in various biologically important tasks, showcasing its potential to facilitate scientific discoveries in genomics and personalized medicine. The approach of embedding domain-specific knowledge into foundation models opens new avenues for enhancing the biological applicability of machine learning models in life sciences.