Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 22 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 91 tok/s
GPT OSS 120B 463 tok/s Pro
Kimi K2 213 tok/s Pro
2000 character limit reached

CelloAI: LLMs for Cell Annotation & HPC Code

Updated 27 August 2025
  • CelloAI is a system leveraging advanced LLMs, RAG, and RL to deliver structured, explainable cell type annotation from scRNA-seq data and expert-level HPC code understanding.
  • It combines supervised fine-tuning with reinforcement learning to achieve global consistency, interpretable batch-level predictions, and robust code dependency analysis.
  • Designed for on-premises deployment, CelloAI ensures data privacy, secure code modification, and scalable performance in both computational biology and high-energy physics applications.

CelloAI is a series of systems and methodologies leveraging LLMs, retrieval-augmented generation (RAG), and reasoning-driven architectures to address two principal domains: (1) structured, explainable cell type annotation from single-cell RNA sequencing (scRNA-seq) data, and (2) expert-level code understanding and generation in high-performance computing (HPC), notably in High Energy Physics (HEP). CelloAI architectures are defined by their use of advanced LLMs, reinforcement learning for structured tasks, and rigorous contextualization through retrieval and dependency modeling, achieving state-of-the-art performance and interpretability across disparate scientific domains.

1. Foundations and Objectives

CelloAI’s origin is rooted in addressing fundamental bottlenecks in their target applications: (i) automating the labor-intensive and context-dependent annotation of cell types in scRNA-seq analysis, and (ii) overcoming the tremendous complexity found in legacy and sparsely-documented HEP codebases within HPC ecosystems (Fang et al., 3 Jun 2025, Atif et al., 22 Aug 2025).

For scRNA-seq analysis, CelloAI (also referenced as Cell-o1) reformulates standard annotation pipelines—which historically operated at the single-cell, independent level—into batch-level, structured reasoning tasks. The goal is to jointly assign unique, globally consistent cell type annotations to a cell batch, mimicking the context-driven strategy of human experts.

In HEP HPC codebases, CelloAI seeks to bridge the gap in code documentation, translation, and modification, ensuring data privacy and transparency by running locally and leveraging explicit code dependency information to enable safer code changes. This dual-domain design demonstrates CelloAI’s flexibility and impact.

2. Methodologies for Single-Cell Data Annotation

CelloAI’s approach to batch-level cell type annotation is structured as a two-phase model training pipeline:

Supervised Fine-Tuning (SFT)

  • The model is initialized by training on a distilled corpus of expert reasoning traces, which are generated by baseline models (e.g., OpenAI’s o1) via rejection sampling. Only those traces matching target label assignments and prompt formatting are admitted (acceptance rate ∼38.5%), ensuring high-quality learning signals.

Reinforcement Learning (RL) with Batch-Level Rewards

  • Following SFT, Cell-o1 is refined using RL to maximize batch-level annotation accuracy and global constraint satisfaction. The reward function grants a reward of 1 only if the batch prediction is entirely correct and formatted properly, with −1 for any invalid or misformatted output.
  • The RL objective is:

J(θ)=ExD,{yi}i=1Gπθold(x)[1Gi=1G{min(pi(θ)Ai,clip(pi(θ),1ϵ,1+ϵ)Ai)βDKL(πθπθref)}]J(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|x)} \left[ \frac{1}{G} \sum_{i=1}^G \left\{ \min\left( p_i(\theta) A_i, \operatorname{clip}(p_i(\theta), 1-\epsilon, 1+\epsilon) A_i \right) - \beta D_{\mathrm{KL}}(\pi_\theta || \pi_{\theta_\text{ref}}) \right\} \right]

where pi(θ)p_i(\theta) is the policy ratio, AiA_i is the normalized advantage, ϵ\epsilon is a stability threshold, and β\beta determines the KL-penalty strength (Fang et al., 3 Jun 2025).

This curriculum ensures (a) adherence to structured output, (b) efficient credit assignment in sparse-reward RL scenarios, and (c) emergence of globally consistent, interpretable solutions.

3. Technical Strategies in HPC Software Assistance

In the context of HEP codebases, CelloAI emphasizes advanced retrieval, context assembly, and LLM prompting protocols:

Retrieval-Augmented Generation (RAG)

  • CelloAI manages separate vector databases for code and textual documentation (manuals, papers, tutorials). Queries are embedded and matched in parallel using code-specific and text-specific models. A pattern-matching mechanism reranks candidates using exact symbol matches.

Syntax-Aware Code Chunking

  • Using Tree-sitter based parsing, CelloAI divides code into syntax-complete units (functions, classes, namespaces), avoiding the ambiguities introduced by fixed-window segmentation.

Callgraph Knowledge Integration

  • Doxygen-generated callgraphs provide explicit dependency summaries appended to prompts. This two-hop lineage improves prompt quality for code generation/refactoring by contextualizing each function’s dependencies, reducing the risk of inappropriate suggestions.

Retrieval and Prompt Assembly (Pseudocode)

1
2
3
4
5
Calculate query embeddings for both code and text
Retrieve top-N documents from each collection
Apply pattern matching for exact symbol matches
Merge and rank for a balanced context prompt
If enabled, append callgraph dependencies to prompt
The scoring can be expressed as:

Stotal=αSembedding+βSpatternS_\text{total} = \alpha \cdot S_\text{embedding} + \beta \cdot S_\text{pattern}

with α\alpha and β\beta empirically tuned (Atif et al., 22 Aug 2025).

4. Performance Evaluation and Emergent Behaviors

Cell Annotation

On the CellPuzzles benchmark, Cell-o1 (CelloAI) outperforms previous approaches (e.g., OpenAI o1 baseline) by over 73% in batch-level accuracy (Fang et al., 3 Jun 2025). This metric accounts for global annotation consistency, i.e., whether all cells in a batch are labeled correctly and uniquely in a single pass. The model consistently demonstrates high format validity and interpretable, structured outputs.

Emergent behaviors include:

  • Citational referencing of canonical gene markers and tissue metadata
  • Self-reflection and curriculum-style reasoning (revisiting ambiguous predictions and prioritizing less ambiguous cases first)
  • Maintenance of answer uniqueness and output readability

HPC Software

In documentation and code generation evaluations on ATLAS, CMS, DUNE applications (including FastCaloSim, Patatrack, P2R, WireCell), CelloAI demonstrates:

  • Enhanced kernel recall rates and code porting coverage using the retrieval-augmented, syntax-aware prompting pipeline
  • Success in Doxygen-style comment generation, filling documentation gaps in legacy code
  • Accurate context assembly for dependency-aware code modifications, particularly beneficial for porting GPU kernels and refactoring

5. Interpretability, Accessibility, and Security

A unique haLLMark of CelloAI is its focus on system interpretability and operational safety:

  • Interpretability: Emergent reasoning traces in annotation tasks provide exact logic, gene marker references, and context, making outputs auditable by human experts.
  • Accessibility: Local deployment bypasses cloud-based dependencies, ensuring operation even in data-sensitive environments; large-context operations are viable without incurring external costs.
  • Security and Data Privacy: By running entirely on-premises, CelloAI circumvents the data privacy issues ubiquitous in cloud LLM hosting—a crucial consideration for scientific research and clinical datasets.

6. Limitations and Future Prospects

Identified challenges include:

  • Reproducibility of LLM-aided code generation remains sensitive to decoding settings.
  • “Moderate” and “hard” code kernels, especially in memory mapping and low-level directives, pose significant difficulty.
  • In lengthy code contexts, adherence to user instructions may degrade, and generated code may not always be compile-ready.

Future directions include:

  • Inline, context-sensitive comment generation during code edits
  • Integration of static analyzers and co-generated unit test stubs for enhanced safety guardrails
  • Construction of shared repositories (e.g., HEP kernel pairs) to support continued fine-tuning and domain adaptation
  • End-to-end benchmarking frameworks that account for compilation, execution, and accuracy across targets
  • Expansion to incorporate retrieval-augmented annotation from external biomedical ontologies and knowledge bases (Fang et al., 3 Jun 2025, Atif et al., 22 Aug 2025)

Several contemporary systems, including InstructCell (Fang et al., 14 Jan 2025), inform and extend methodologies in CelloAI:

  • The adoption of multi-modal architectures combining natural language instructions with numerical gene expression data
  • Instruction tuning over diverse prompt templates, increasing robustness across user expertise and communication styles
  • Conditional pseudo-cell generation and drug sensitivity prediction, which expand the usability of AI co-pilots in biomedical research

A plausible implication is that future iterations of CelloAI will increasingly merge advances in multi-modal LLMs, instruction-following, and structured reasoning—enabling seamless application to both computational biology and scientific software development.


Summary Table: CelloAI System Dimensions

Domain Core Task Key Techniques
scRNA-seq Batch-level cell type annotation SFT, RL, structured reasoning
HEP Software Code documentation and generation RAG, syntax-aware chunking, callgraph integration
Multi-modal AI Instruction-based biological analysis Q-Former, LLMs, CVAE, robustness tuning

CelloAI thus constitutes a cross-disciplinary application of LLMs and structured retrieval techniques, offering interpretable and reliable solutions to high-complexity annotation and code generation challenges in modern science.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube