CellPuzzles Task: Batch Annotation

Updated 30 June 2025

CellPuzzles Task is a benchmark for batch-level cell type annotation in scRNA-seq, integrating donor metadata and gene expression to ensure global label consistency.
The framework challenges models with combinatorial reasoning by enforcing a one-to-one cell-to-label mapping that mirrors expert annotation workflows.
It enables rigorous evaluation using batch-level accuracy and transparent output, as demonstrated by Cell-o1’s superior performance over standard LLMs.

The CellPuzzles Task defines a rigorous benchmark for cell type annotation in single-cell RNA sequencing (scRNA-seq) analysis, emphasizing batch-level reasoning and the integration of biological context. Unlike traditional annotation methods that label cells independently, CellPuzzles requires the assignment of unique cell types across a batch, with a one-to-one correspondence between cells and candidate types, thereby capturing expert workflows and imposing global consistency constraints.

1. Definition and Motivation

CellPuzzles reformulates cell type annotation as a combinatorial reasoning task. For each batch—comprising $N$ cells characterized by their top-expressed genes and associated donor-level metadata— $N$ candidate cell types are provided in a shuffled order. The objective is to map each cell to a distinct candidate type, such that all available types are used exactly once per batch. This setup reflects expert annotation practices, where the joint distribution and contextual relationships of cells within biological samples are considered, rather than applying cell-by-cell or cluster-independent predictions. The framework thus compels models to leverage shared context, label exclusivity, and nuanced cell distinctions.

2. Challenges and Problem Structure

CellPuzzles introduces fundamental reasoning challenges absent from standard annotation protocols:

Batch-level Contextualization: Cell identity inference must be informed by batch metadata (e.g., tissue source, disease state, donor characteristics) and comparative analysis of the batch's constituent cells, emulating how experts consider both individual and collective features.
Label Uniqueness Constraint: Each candidate label must be assigned to exactly one cell per batch, precluding duplicate assignments and demanding global optimization rather than independent classification decisions.
Combinatorial Ambiguity: Many annotation instances involve subtle gene expression differences (e.g., naïve vs. memory T cells), requiring elimination-based reasoning across the candidate set.
Interpretability Requirement: The benchmark framework incentivizes not only accurate predictions, but also transparent, step-by-step reasoning akin to expert justifications based on marker genes and contextual logic.
Scalability and Generalization: CellPuzzles spans diverse tissues, diseases, and donor backgrounds; thus, annotation approaches must generalize beyond memorized patterns.

3. Model Approaches and Learning Mechanisms

CellPuzzles serves as a platform for evaluating and developing models capable of batch-level joint reasoning:

Standard LLMs: Off-the-shelf LLMs (e.g., OpenAI's o1) perform poorly on this task, attaining only 19.0% batch-level accuracy, largely due to their default cell-independent annotation and lack of constraint awareness.
Cell-o1 Model: To address these deficits, Cell-o1—a 7B-parameter LLM—was introduced, specifically fine-tuned and optimized for CellPuzzles.
- Supervised Fine-Tuning (SFT): The model is first trained on high-quality reasoning traces distilled from strong LLMs, filtered for format and correctness, to enable learning of structured, interpretable solutions.
- Reinforcement Learning with Batch-level Rewards: Training then proceeds with Group Relative Policy Optimization (GRPO), using batch-level reward functions that reward exact batch matches ( $+1$ ), penalize invalid outputs ( $-1$ ), and give zero for partial correctness.
- Prompt Engineering and Output Standardization: Structured prompt and output templates enforce both reasoning trace production and answer format validity.

The RL stage is initiated from supervised checkpoints (cold start SFT), as ablation studies demonstrate that direct RL is unstable and less effective.

4. Evaluation Metrics and Empirical Results

CellPuzzles evaluation employs multiple stringent metrics:

Batch-level Accuracy: Fraction of batches in which the entire mapping is exactly correct; serves as the primary metric due to the combinatorial nature of the problem.
Cell-level Accuracy: Proportion of individual cell assignments that are correct, averaged across all batches.
Format Validity: Percentage of model outputs adhering to the required answer format.
Answer Uniqueness: Ensures all labels are used exactly once per batch; reported as the proportion of unique labels assigned.

Cell-o1 achieves 32.88% batch-level accuracy (substantially outperforming OpenAI o1's 19.00%) and 68.49% cell-level accuracy. On batches from unseen diseases or tissues, Cell-o1 maintains strong generalization, with batch-level performance of 38.96% versus the next best model's 28.19%.

Metric	Cell-o1	OpenAI o1	o3-mini	Open-source LLMs
Batch-level Acc.	0.3288	0.1900	0.1352	<0.07
Cell-level Acc.	0.6849	0.6479	0.5650	<0.44

5. Emergent Reasoning and Expert-like Behavior

A haLLMark of Cell-o1 is its emergent ability to mimic expert annotation strategies:

Global Strategy: Rather than solving cell labels independently, Cell-o1 reasons jointly, allocating "easy" labels first based on distinctive markers and iteratively refining assignments for more ambiguous cells, referencing batch context and remaining candidates.
Constraint Tracking: The model explicitly uses process-of-elimination, ensuring compliance with uniqueness constraints and revisiting prior decisions when later information arises.
Transparent Justification: Outputs include detailed stepwise reasoning, highlighting marker expression, supporting metadata, and logical argumentation for each assignment.
Consistency and Self-correction: Cell-o1 exhibits self-reflective annotation, correcting interim mappings in light of new evidence—behavior consistent with expert practice.

These behaviors are confirmed through both automated analysis and expert raters, who observe superior interpretability and reasoning fidelity relative to baseline models.

6. Applications, Resource Availability, and Broader Implications

The CellPuzzles benchmark and Cell-o1 have direct application in the automated annotation of scRNA-seq data where label exclusivity and interpretability are desired, especially in heterogeneous or poorly characterized biological contexts. The approach facilitates:

Robust annotation across varied tissues and disease states.
Transparent, expert-traceable outputs suitable for clinical or large-scale tissue studies.
Benchmarking and comparison of future batch-level reasoning models.

All code, data, and annotation resources are openly available at https://github.com/ncbi-nlp/cell-o1, including full benchmark datasets, model weights, and reproducibility scripts.

7. Summary Table

Aspect	CellPuzzles / Cell-o1 Approach
Objective	Batch-level, unique cell type assignment with interpretability, global context
Core Challenge	Contextual reasoning, label uniqueness, combinatorial ambiguity
Model Framework	SFT on reasoning traces + RL with batch-level rewards (GRPO)
Performance	32.9% batch-level (Cell-o1), outperforming all baselines
Generalization	Robust across tissues, diseases, and donor metadata
Resources	Full code/data: github.com/ncbi-nlp/cell-o1

CellPuzzles establishes a paradigm for batch-level, context-aware annotation in single-cell omics, and Cell-o1 emerges as an empirically validated, interpretable model that operationalizes expert-like reasoning for solving single-cell annotation puzzles with batch-wide constraints.

PDF Markdown Chat (Upgrade)