- The paper addresses the evaluation gap by introducing a Virtual Cell abstraction and five natural language tasks that capture molecular, semantic, and causal reasoning.
- It employs a knowledge-augmented evaluation framework that integrates resources like Cell Ontology, UniProt, and NCBI to assess biologically faithful outputs.
- Empirical results reveal that current LLMs often generate fluent but biologically inaccurate responses, emphasizing the need for domain-specific model calibration.
SC-Arena: A Unified Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation
Motivation and Framework Design
SC-Arena addresses a fundamental gap in the evaluation of LLMs in single-cell biology: the lack of holistic, biologically grounded, and interpretable benchmarks. Existing frameworks are fragmented across task boundaries, rely on brittle string-matching or generic NLP metrics, and emphasize narrow evaluation types (primarily classification or multiple-choice formats). This results in metrics that fail to capture the complexity of biological reasoning and the nuanced requirements for in silico modeling of biological entities.
To overcome these deficits, SC-Arena formalizes a Virtual Cell abstraction, conceptualizing a cell as an object with both static (attributes) and dynamic (methods) properties. This design unifies the evaluation paradigm and enables the definition of five representative natural language tasks—cell type annotation, captioning, cell generation, perturbation prediction, and scientific QA—that jointly probe both molecular, semantic, and causal reasoning capabilities. By reframing benchmarks into open-ended QA formats and integrating knowledge-augmented evaluation grounded in external ontologies, marker databases, and literature, SC-Arena delivers an interpretable, evidence-based framework that distinguishes biologically plausible outputs from fluent but superficial responses.
Task Suite and Evaluation Methodology
The five core tasks in SC-Arena cover distinct reasoning modalities:
- Cell Type Annotation: Expression profiles are mapped to ontology-grounded cell type labels, testing hierarchical and semantic accuracy.
- Cell Captioning: Natural language descriptions are generated from expression profiles, emphasizing interpretability and biological relevance.
- Cell Generation: Given cell type semantics, plausible expression profiles must be synthesized, probing the ability to instantiate molecular signatures consistent with domain markers.
- Perturbation Prediction: Models predict gene expression changes in response to environmental perturbations, requiring mechanistic inference over molecular interactions.
- Scientific QA: Open-ended mechanistic questions sourced from literature are answered, demanding evidence-grounded reasoning and fact extraction.
The knowledge-augmented evaluation leverages curated resources such as Cell Ontology, UniProt, NCBI, CellMarker, and PubMed articles. Each model prediction is assessed by an LLM judge that conditions scoring on both external knowledge and structured rubrics, awarding high scores only to outputs grounded in experimentally validated facts and biologically plausible reasoning. Notably, evaluation measures semantic, ontological, and mechanistic alignment, transcending the limitations of traditional lexical overlap metrics.
Benchmarking across state-of-the-art general-purpose and domain-specialized LLMs reveals significant task asymmetry and clear limitations:
- The best-performing general models (Kimi-K2, DeepSeek-R1) achieve total scores below the reliability threshold for a virtual cell, indicating persistent gaps in mechanistic reasoning.
- Model scaling and family iteration (e.g., Qwen3 vs. Qwen2.5) provide consistent but uneven gains, with marked improvement in generation and captioning tasks but minimal advances in perturbation prediction and cell type annotation.
- Domain-specific models (C2S-Pythia for cell type annotation) outperform general-purpose LLMs on ontology-grounded tasks, while general models retain advantages in open-ended generation tasks.
- SC-Arena's knowledge-augmented scores correlate strongly with domain-specific validity metrics: Spearman correlation with ontology distance in annotation (r=0.6212, p<0.001), marker gene alignment in generation, DEG cosine similarity in perturbation prediction, and expert preference in QA and captioning.
These results expose a "fluent but not faithful" gap, where models generate coherent, linguistically plausible outputs but lack deep biological logic and accuracy.
Theoretical and Practical Implications
SC-Arena sets a new standard for biological LLM evaluation. The framework's integration of explicit domain knowledge into scoring enables biologically faithful, interpretable, and discriminative judgments, critical for both model development and error analysis in scientific applications. The empirical results underscore the necessity of domain specialization and structured resource integration, particularly in tasks requiring causal and mechanistic reasoning.
Beyond immediate benchmarking, SC-Arena's modular Virtual Cell abstraction and knowledge-augmented evaluation are extensible to emergent modalities (e.g., spatial transcriptomics, temporal analysis, multi-omics integration). The paradigm supports iterative benchmark refinement and adaptation to evolving biological technologies and reasoning demands. Future developments envisioned in the paper include ensemble judge calibration, dynamic database integration, and expansion to more complex biological scenarios, enabling alignment of evaluation criteria with advancing scientific understanding.
Conclusion
SC-Arena provides a unified, interpretable framework for evaluating LLMs in single-cell biology, operationalizing the Virtual Cell paradigm through a suite of biologically relevant tasks and knowledge-augmented metrics. The benchmark demonstrates that current frontier LLMs are limited in biological mechanistic reasoning, despite proficiency in text generation. The framework substantially advances the state of model evaluation in bioinformatics, offering not only a diagnostic tool for LLM competence but also a foundation for the principled construction and assessment of biology-aligned foundation models (2602.23199).