CelloAI: LLMs for Cell Annotation & HPC Code

Updated 27 August 2025

CelloAI is a system leveraging advanced LLMs, RAG, and RL to deliver structured, explainable cell type annotation from scRNA-seq data and expert-level HPC code understanding.
It combines supervised fine-tuning with reinforcement learning to achieve global consistency, interpretable batch-level predictions, and robust code dependency analysis.
Designed for on-premises deployment, CelloAI ensures data privacy, secure code modification, and scalable performance in both computational biology and high-energy physics applications.

CelloAI is a series of systems and methodologies leveraging LLMs, retrieval-augmented generation (RAG), and reasoning-driven architectures to address two principal domains: (1) structured, explainable cell type annotation from single-cell RNA sequencing (scRNA-seq) data, and (2) expert-level code understanding and generation in high-performance computing (HPC), notably in High Energy Physics (HEP). CelloAI architectures are defined by their use of advanced LLMs, reinforcement learning for structured tasks, and rigorous contextualization through retrieval and dependency modeling, achieving state-of-the-art performance and interpretability across disparate scientific domains.

1. Foundations and Objectives

CelloAI’s origin is rooted in addressing fundamental bottlenecks in their target applications: (i) automating the labor-intensive and context-dependent annotation of cell types in scRNA-seq analysis, and (ii) overcoming the tremendous complexity found in legacy and sparsely-documented HEP codebases within HPC ecosystems (Fang et al., 3 Jun 2025, Atif et al., 22 Aug 2025).

For scRNA-seq analysis, CelloAI (also referenced as Cell-o1) reformulates standard annotation pipelines—which historically operated at the single-cell, independent level—into batch-level, structured reasoning tasks. The goal is to jointly assign unique, globally consistent cell type annotations to a cell batch, mimicking the context-driven strategy of human experts.

In HEP HPC codebases, CelloAI seeks to bridge the gap in code documentation, translation, and modification, ensuring data privacy and transparency by running locally and leveraging explicit code dependency information to enable safer code changes. This dual-domain design demonstrates CelloAI’s flexibility and impact.

2. Methodologies for Single-Cell Data Annotation

CelloAI’s approach to batch-level cell type annotation is structured as a two-phase model training pipeline:

Supervised Fine-Tuning (SFT)

The model is initialized by training on a distilled corpus of expert reasoning traces, which are generated by baseline models (e.g., OpenAI’s o1) via rejection sampling. Only those traces matching target label assignments and prompt formatting are admitted (acceptance rate ∼38.5%), ensuring high-quality learning signals.

Reinforcement Learning (RL) with Batch-Level Rewards

Following SFT, Cell-o1 is refined using RL to maximize batch-level annotation accuracy and global constraint satisfaction. The reward function grants a reward of 1 only if the batch prediction is entirely correct and formatted properly, with −1 for any invalid or misformatted output.
The RL objective is:

$J(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|x)} \left[ \frac{1}{G} \sum_{i=1}^G \left\{ \min\left( p_i(\theta) A_i, \operatorname{clip}(p_i(\theta), 1-\epsilon, 1+\epsilon) A_i \right) - \beta D_{\mathrm{KL}}(\pi_\theta || \pi_{\theta_\text{ref}}) \right\} \right]$

where $p_i(\theta)$ is the policy ratio, $A_i$ is the normalized advantage, $\epsilon$ is a stability threshold, and $\beta$ determines the KL-penalty strength (Fang et al., 3 Jun 2025).

This curriculum ensures (a) adherence to structured output, (b) efficient credit assignment in sparse-reward RL scenarios, and (c) emergence of globally consistent, interpretable solutions.

3. Technical Strategies in HPC Software Assistance

In the context of HEP codebases, CelloAI emphasizes advanced retrieval, context assembly, and LLM prompting protocols:

Retrieval-Augmented Generation (RAG)

CelloAI manages separate vector databases for code and textual documentation (manuals, papers, tutorials). Queries are embedded and matched in parallel using code-specific and text-specific models. A pattern-matching mechanism reranks candidates using exact symbol matches.

Syntax-Aware Code Chunking

Using Tree-sitter based parsing, CelloAI divides code into syntax-complete units (functions, classes, namespaces), avoiding the ambiguities introduced by fixed-window segmentation.

Callgraph Knowledge Integration

Doxygen-generated callgraphs provide explicit dependency summaries appended to prompts. This two-hop lineage improves prompt quality for code generation/refactoring by contextualizing each function’s dependencies, reducing the risk of inappropriate suggestions.

Retrieval and Prompt Assembly (Pseudocode)

Calculate query embeddings for both code and text
Retrieve top-N documents from each collection
Apply pattern matching for exact symbol matches
Merge and rank for a balanced context prompt
If enabled, append callgraph dependencies to prompt

The scoring can be expressed as:

$S_\text{total} = \alpha \cdot S_\text{embedding} + \beta \cdot S_\text{pattern}$

with $\alpha$ and $\beta$ empirically tuned (Atif et al., 22 Aug 2025).

4. Performance Evaluation and Emergent Behaviors

Cell Annotation

On the CellPuzzles benchmark, Cell-o1 (CelloAI) outperforms previous approaches (e.g., OpenAI o1 baseline) by over 73% in batch-level accuracy (Fang et al., 3 Jun 2025). This metric accounts for global annotation consistency, i.e., whether all cells in a batch are labeled correctly and uniquely in a single pass. The model consistently demonstrates high format validity and interpretable, structured outputs.

Emergent behaviors include:

Citational referencing of canonical gene markers and tissue metadata
Self-reflection and curriculum-style reasoning (revisiting ambiguous predictions and prioritizing less ambiguous cases first)
Maintenance of answer uniqueness and output readability

HPC Software

In documentation and code generation evaluations on ATLAS, CMS, DUNE applications (including FastCaloSim, Patatrack, P2R, WireCell), CelloAI demonstrates:

Enhanced kernel recall rates and code porting coverage using the retrieval-augmented, syntax-aware prompting pipeline
Success in Doxygen-style comment generation, filling documentation gaps in legacy code
Accurate context assembly for dependency-aware code modifications, particularly beneficial for porting GPU kernels and refactoring

5. Interpretability, Accessibility, and Security

A unique hallmark of CelloAI is its focus on system interpretability and operational safety:

Interpretability: Emergent reasoning traces in annotation tasks provide exact logic, gene marker references, and context, making outputs auditable by human experts.
Accessibility: Local deployment bypasses cloud-based dependencies, ensuring operation even in data-sensitive environments; large-context operations are viable without incurring external costs.
Security and Data Privacy: By running entirely on-premises, CelloAI circumvents the data privacy issues ubiquitous in cloud LLM hosting—a crucial consideration for scientific research and clinical datasets.

6. Limitations and Future Prospects

Identified challenges include:

Reproducibility of LLM-aided code generation remains sensitive to decoding settings.
“Moderate” and “hard” code kernels, especially in memory mapping and low-level directives, pose significant difficulty.
In lengthy code contexts, adherence to user instructions may degrade, and generated code may not always be compile-ready.

Future directions include:

Inline, context-sensitive comment generation during code edits
Integration of static analyzers and co-generated unit test stubs for enhanced safety guardrails
Construction of shared repositories (e.g., HEP kernel pairs) to support continued fine-tuning and domain adaptation
End-to-end benchmarking frameworks that account for compilation, execution, and accuracy across targets
Expansion to incorporate retrieval-augmented annotation from external biomedical ontologies and knowledge bases (Fang et al., 3 Jun 2025, Atif et al., 22 Aug 2025)

Several contemporary systems, including InstructCell (Fang et al., 14 Jan 2025), inform and extend methodologies in CelloAI:

The adoption of multi-modal architectures combining natural language instructions with numerical gene expression data
Instruction tuning over diverse prompt templates, increasing robustness across user expertise and communication styles
Conditional pseudo-cell generation and drug sensitivity prediction, which expand the usability of AI co-pilots in biomedical research

A plausible implication is that future iterations of CelloAI will increasingly merge advances in multi-modal LLMs, instruction-following, and structured reasoning—enabling seamless application to both computational biology and scientific software development.

Summary Table: CelloAI System Dimensions

Domain	Core Task	Key Techniques
scRNA-seq	Batch-level cell type annotation	SFT, RL, structured reasoning
HEP Software	Code documentation and generation	RAG, syntax-aware chunking, callgraph integration
Multi-modal AI	Instruction-based biological analysis	Q-Former, LLMs, CVAE, robustness tuning

CelloAI thus constitutes a cross-disciplinary application of LLMs and structured retrieval techniques, offering interpretable and reliable solutions to high-complexity annotation and code generation challenges in modern science.

PDF Markdown Chat (Pro)

References (3)

Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning (2025)

CelloAI: Leveraging Large Language Models for HPC Software Development in High Energy Physics (2025)

A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following (2025)

CelloAI: LLMs for Cell Annotation & HPC Code

1. Foundations and Objectives

2. Methodologies for Single-Cell Data Annotation

3. Technical Strategies in HPC Software Assistance

Retrieval and Prompt Assembly (Pseudocode)

4. Performance Evaluation and Emergent Behaviors

Cell Annotation

HPC Software

5. Interpretability, Accessibility, and Security

6. Limitations and Future Prospects

Whiteboard

Follow Topic

Continue Learning

CelloAI: LLMs for Cell Annotation & HPC Code

1. Foundations and Objectives

2. Methodologies for Single-Cell Data Annotation

3. Technical Strategies in HPC Software Assistance

Retrieval and Prompt Assembly (Pseudocode)

4. Performance Evaluation and Emergent Behaviors

Cell Annotation

HPC Software

5. Interpretability, Accessibility, and Security

6. Limitations and Future Prospects

7. Relevance to Broader Landscape and Related Systems

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics