Papers
Topics
Authors
Recent
Search
2000 character limit reached

OKR-CELL: Open-world Single-Cell Foundation Model

Updated 16 January 2026
  • The paper introduces OKR-CELL, a robust foundation model that integrates large-scale biomedical language with scRNA-seq profiles for accurate cell-type annotation and clustering.
  • It employs a novel retrieval-augmented generation pipeline and a Cross-modal Robust Alignment loss to overcome noise and heterogeneity in single-cell data.
  • Empirical evaluations show significant improvements, with enhanced zero-shot and few-shot performance (e.g., ARI up to 0.415 and accuracy gains up to 12 percentage points) over existing models.

The Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL) is a foundation model for single-cell analysis, integrating large-scale open-world biological knowledge and robust cross-modal pre-training to achieve state-of-the-art annotation, clustering, and retrieval of single-cell RNA-seq data in the presence of noise and data heterogeneity. OKR-CELL introduces a cross-modal cell–language framework employing retrieval-augmented generation (RAG) to fuse external biomedical literature with structured cell metadata and a Cross-modal Robust Alignment (CRA) objective incorporating reliability assessment, momentum contrast, and curriculum learning. The model demonstrates significant advances over prior approaches in zero-shot and few-shot cell-type annotation, cell clustering, batch-effect correction, and knowledge-grounded cross-modal retrieval (Wang et al., 9 Jan 2026).

1. Cross-Modal Data Integration and Textual Enrichment

OKR-CELL utilizes an expansive cross-modal cell–language training paradigm to address the limitations of single-modality pre-trained LLMs (PLMs), which lack deep cell identity understanding and robust generalization. The data integration framework unifies scRNA-seq profiles and in-depth text descriptions sourced from both curation-derived metadata and open-world biomedical knowledge.

  • Metadata schema: Each cell is represented by both its normalized gene expression vector (top 1,200 highly variable genes) and natural language descriptions comprising at least nine metadata fields (e.g., cell type, tissue, disease, donor attributes, experimental protocol) drawn from SCxGEN-32M (curated from 1,350 studies, 32M cell–text pairs) (Wang et al., 9 Jan 2026).
  • Retrieval-augmented generation (RAG): Free-form cell descriptions are augmented using a RAG pipeline. First, ∼2 million PubMed abstracts are embedded (BioBERT) and indexed. For a given cell metadata query, the system retrieves relevant literature snippets, which are then synthesized, along with the original metadata, by an LLM (Deepseek-V3) into context-rich, 550–600 word cell narratives.
  • Reliability screening: Generated texts are embedded with Clinical-Longformer and cosine similarity with the original metadata description is computed; only those surpassing a threshold (parameter α\alpha) are retained, ensuring semantic consistency and factual fidelity of the augmented descriptions.

This pipeline allows OKR-CELL to inject up-to-date, open-world knowledge directly into the training process, enabling broad domain adaptation and continual refresh as new studies emerge (Wang et al., 9 Jan 2026). The approach draws on insights from earlier language–cell alignment strategies such as LangCell’s ontology-enriched metadata (Zhao et al., 2024) and graph-based knowledge augmentation as implemented in ReCellTy (Han et al., 24 Apr 2025).

2. Model Architecture

OKR-CELL is composed of modular encoders and alignment pipelines designed for scalable, robust cross-modal representation.

  • Cell encoder: The backbone is based on scGPT, utilizing transformer blocks (6 layers, 4 attention heads, 128-dimensional hidden states). Cells are encoded as sequences of highly variable gene tokens, each with expression bin and gene ID embeddings. A cls\langle \mathrm{cls} \rangle token pools cell-level features (Wang et al., 9 Jan 2026).
  • Text encoder: Clinical-Longformer encodes full cell descriptions (max length 4,096 tokens, 768-dim embedding), designed for long biomedical texts.
  • Cross-modal projector: A linear transformation maps the cell encoder's [CLS] output to the text embedding space.
  • Momentum memory bank: Both cell and text features are tracked via momentum encoders, feeding coupled circular buffers for large-scale, stable negative sampling during contrastive learning.

This architecture reflects, and extends, design features from LangCell’s dual encoder (cell and PubMedBERT text) with transformer cross-attention (Zhao et al., 2024) and ReCellTy’s explicit separation of query modules by data type (Han et al., 24 Apr 2025).

3. Cross-Modal Robust Alignment Objective

Robust cross-modal learning in OKR-CELL is driven by the Cross-modal Robust Alignment (CRA) loss, designed to address label noise, data misalignment, and unreliable supervision typical of large-scale single-cell corpora.

Three principal mechanisms constitute the CRA framework:

  1. Coupled Momentum Memory Bank (CMMB): Implements large, stable negative pools over cell and text embeddings via dual momentum encoders. At each step, NN batch features are enqueued; parameter updates follow θmτθm+(1τ)θ\theta^m \leftarrow \tau \theta^m + (1-\tau)\theta for temporal smoothing (Wang et al., 9 Jan 2026).
  2. Positive-pair reliability assessment: Each cell–text pair is assigned a dynamic weight based on (a) symmetry score—a softmax-normalized measure over both cell-to-text and text-to-cell similarities—and (b) temporal stability—the variance of the reliability score across recent epochs. The composite weight Wi,iposW_{i,i}^{\mathrm{pos}} modulates the loss contribution of each positive pair.
  3. Progressive sample weighting (curriculum learning): A power-law schedule γ(e)\gamma(e) controls the emphasis on “hard” negatives as training progresses, using negative-pair reweighting wi,jPSWw_{i,j}^{\mathrm{PSW}} constructed from contrastive loss magnitude and epoch schedule.

The full CRA loss structure is: LCRA=Lpos+λwi,jPSWLneg\mathcal{L}_{\mathrm{CRA}} = \mathcal{L}_{\mathrm{pos}} + \lambda\, w_{i,j}^{\mathrm{PSW}}\, \mathcal{L}_{\mathrm{neg}} where Lpos\mathcal{L}_{\mathrm{pos}} and Lneg\mathcal{L}_{\mathrm{neg}} represent reliability-weighted contrastive components; λ\lambda is a hyperparameter. The final objective combines these with gene expression prediction losses: L=LGEP+LGEPC+LCRA\mathcal{L} = \mathcal{L}_{GEP} + \mathcal{L}_{GEPC} + \mathcal{L}_{CRA} (Wang et al., 9 Jan 2026).

Ablation studies demonstrate the necessity of both LLM-augmented text and full CRA loss for optimal performance. Intermediate settings lacking either yield up to 6–7% lower accuracy and F1 on representative datasets.

4. Empirical Evaluation and Benchmarking

OKR-CELL’s efficacy is validated across a broad spectrum of supervised and unsupervised single-cell analytic tasks, compared to baseline PLMs (Geneformer, scBERT, scFoundation, scGPT, LangCell, scCLIP-GPT).

  • Cell clustering: On datasets such as Blood, Kidney, and hPancreas, OKR-CELL achieves adjusted Rand index (ARI) up to 0.415 and adjusted mutual information (AMI) up to 0.732.
  • Batch-effect correction: Using metrics such as AvgBIO-B and AvgBatch, OKR-CELL leads across all tested settings (e.g., hPancreas AvgBatch = 0.790).
  • Cell-type annotation: In-domain and few-shot annotation on Eye, Kidney, Small Intestine, Spleen, hPBMC, and Zheng68k yields 2–12 percentage point improvements in accuracy and macro-F1 over alternatives. In 9-shot Zheng68k, OKR-CELL achieves 0.787 accuracy versus scCLIP-GPT (0.554) and LangCell (0.190).
  • Zero-shot transfer: For completely novel tissue and species (Eye, Prostate, Great Apes), OKR-CELL yields zero-shot accuracy up to 0.511, vastly surpassing LangCell (0.145) (Wang et al., 9 Jan 2026).
  • Cross-modal retrieval: On SCxGEN-CT5K, bidirectional Recall@1/5/10 for cell→text is 3.32%/19.7%/42.9%; competitive models reach only 0.67% for R@1.

An overview of select results:

Task Metric OKR-CELL Closest Baseline
Cell clustering (Kidney) ARI 0.415
Batch effect (hPancreas) AvgBatch 0.790 scGPT (lower)
Few-shot annotation Acc (9-shot) 0.787 0.554 (scCLIP-GPT)
Zero-shot annotation Acc (Prostate) 0.511 0.145 (LangCell)
Cell→Text retrieval R@1 3.32% 0.67% (scCLIP-GPT)

These results demonstrate substantial improvements in open-world, generalizable annotation and semantic alignment, particularly for novel cell types and in highly noisy, batch-heterogeneous regimes (Wang et al., 9 Jan 2026).

5. Robustness to Noise and Data Distribution Shifts

Robustness to technical and biological noise is a defining characteristic of OKR-CELL.

  • Gene dropout: When up to 50% of genes are masked, OKR-CELL’s accuracy drops by ≤5.1 percentage points (pp), while competitive models degrade more (e.g., scCLIP-GPT by 5.6 pp). It maintains >97% accuracy up to 40% gene dropout on hPBMC (Wang et al., 9 Jan 2026).
  • Noisy cross-modal links: With random permutation of 30% of gene or metadata fields, OKR-CELL maintains or improves F1 (e.g., +5.5 pp on colon, unlike −6 pp for scCLIP-GPT).
  • Zero-shot under perturbation: Zero-shot F1 for abnormal Eye tissue increases from 63.5% to 77.1% when trained with simulated noisy modalities.

These results indicate that the structured reliability modeling in CRA, and the use of large memory banks and curriculum, are critical to consistency and performance under realistic, imperfect data.

6. Relation to Prior Foundation Models and Knowledge-Augmented Pipelines

OKR-CELL represents a convergence and extension of previous modalities for integrating domain knowledge and robust annotation:

  • LangCell (Zhao et al., 2024): Established cross-modal pre-training between cell expression transformers and cell-identity enriched natural language, leveraging multi-objective contrastive and matching losses. OKR-CELL generalizes this paradigm with open-world RAG, multi-layer reliability assessment, and curriculum weighting, showing higher zero-shot and few-shot transfer.
  • ReCellTy (Han et al., 24 Apr 2025): Developed a graph-augmented, retrieval-augmented LLM workflow for annotation, utilizing structured marker-feature knowledge graphs and modular LLM agents. OKR-CELL advances this with cross-modal deep encoders, dynamic negative mining, and end-to-end robust learning, while incorporating retrieval-augmented cell descriptions as input.
  • Technical trajectories: Joint embedding spaces, coupled positive/negative sample management, and explicit alignment objectives are recognized as essential for knowledge-aided open-world generalization.

A plausible implication is that the integration of multi-omic KGs, soft-prompt meta-adaptation, and user-in-the-loop continual updates—as suggested by ReCellTy—would further enhance the robustness, coverage, and explainability of foundation models for single-cell analysis.

7. Limitations and Prospects

While OKR-CELL establishes new benchmarks, current limitations include:

  • Focus on scRNA-seq plus text—extension to ATAC-seq, proteomics, and joint modality cell-level supervision is pending.
  • The RAG–LLM pipeline may still hallucinate facts; more faithful retrieval and human-in-the-loop validation are necessary for clinical-grade applications.
  • Compute costs remain high—parameter-efficient fine-tuning and distillation are anticipated future solutions.
  • Ongoing work aims to enable fully interactive, chat-based “CellWhisperer 2.0” LLMs capable of real-time reasoning and annotation over complex single-cell data streams (Wang et al., 9 Jan 2026, Han et al., 24 Apr 2025).

Together, these developments position OKR-CELL and its design lineage as central frameworks for knowledge-guided, open-world analysis across the single-cell omics ecosystem.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL).