SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Published 26 Feb 2026 in cs.AI | (2602.23199v1)

Abstract: LLMs are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, and scientific QA) that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce knowledge-augmented evaluation, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that (i) under the Virtual Cell unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. SC-Arena thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper addresses the evaluation gap by introducing a Virtual Cell abstraction and five natural language tasks that capture molecular, semantic, and causal reasoning.
It employs a knowledge-augmented evaluation framework that integrates resources like Cell Ontology, UniProt, and NCBI to assess biologically faithful outputs.
Empirical results reveal that current LLMs often generate fluent but biologically inaccurate responses, emphasizing the need for domain-specific model calibration.

SC-Arena: A Unified Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Motivation and Framework Design

SC-Arena addresses a fundamental gap in the evaluation of LLMs in single-cell biology: the lack of holistic, biologically grounded, and interpretable benchmarks. Existing frameworks are fragmented across task boundaries, rely on brittle string-matching or generic NLP metrics, and emphasize narrow evaluation types (primarily classification or multiple-choice formats). This results in metrics that fail to capture the complexity of biological reasoning and the nuanced requirements for in silico modeling of biological entities.

To overcome these deficits, SC-Arena formalizes a Virtual Cell abstraction, conceptualizing a cell as an object with both static (attributes) and dynamic (methods) properties. This design unifies the evaluation paradigm and enables the definition of five representative natural language tasks—cell type annotation, captioning, cell generation, perturbation prediction, and scientific QA—that jointly probe both molecular, semantic, and causal reasoning capabilities. By reframing benchmarks into open-ended QA formats and integrating knowledge-augmented evaluation grounded in external ontologies, marker databases, and literature, SC-Arena delivers an interpretable, evidence-based framework that distinguishes biologically plausible outputs from fluent but superficial responses.

Task Suite and Evaluation Methodology

The five core tasks in SC-Arena cover distinct reasoning modalities:

Cell Type Annotation: Expression profiles are mapped to ontology-grounded cell type labels, testing hierarchical and semantic accuracy.
Cell Captioning: Natural language descriptions are generated from expression profiles, emphasizing interpretability and biological relevance.
Cell Generation: Given cell type semantics, plausible expression profiles must be synthesized, probing the ability to instantiate molecular signatures consistent with domain markers.
Perturbation Prediction: Models predict gene expression changes in response to environmental perturbations, requiring mechanistic inference over molecular interactions.
Scientific QA: Open-ended mechanistic questions sourced from literature are answered, demanding evidence-grounded reasoning and fact extraction.

The knowledge-augmented evaluation leverages curated resources such as Cell Ontology, UniProt, NCBI, CellMarker, and PubMed articles. Each model prediction is assessed by an LLM judge that conditions scoring on both external knowledge and structured rubrics, awarding high scores only to outputs grounded in experimentally validated facts and biologically plausible reasoning. Notably, evaluation measures semantic, ontological, and mechanistic alignment, transcending the limitations of traditional lexical overlap metrics.

Experimental Analysis and Numerical Performance

Benchmarking across state-of-the-art general-purpose and domain-specialized LLMs reveals significant task asymmetry and clear limitations:

The best-performing general models (Kimi-K2, DeepSeek-R1) achieve total scores below the reliability threshold for a virtual cell, indicating persistent gaps in mechanistic reasoning.
Model scaling and family iteration (e.g., Qwen3 vs. Qwen2.5) provide consistent but uneven gains, with marked improvement in generation and captioning tasks but minimal advances in perturbation prediction and cell type annotation.
Domain-specific models (C2S-Pythia for cell type annotation) outperform general-purpose LLMs on ontology-grounded tasks, while general models retain advantages in open-ended generation tasks.
SC-Arena's knowledge-augmented scores correlate strongly with domain-specific validity metrics: Spearman correlation with ontology distance in annotation ( $r = 0.6212$ , $p < 0.001$ ), marker gene alignment in generation, DEG cosine similarity in perturbation prediction, and expert preference in QA and captioning.

These results expose a "fluent but not faithful" gap, where models generate coherent, linguistically plausible outputs but lack deep biological logic and accuracy.

Theoretical and Practical Implications

SC-Arena sets a new standard for biological LLM evaluation. The framework's integration of explicit domain knowledge into scoring enables biologically faithful, interpretable, and discriminative judgments, critical for both model development and error analysis in scientific applications. The empirical results underscore the necessity of domain specialization and structured resource integration, particularly in tasks requiring causal and mechanistic reasoning.

Beyond immediate benchmarking, SC-Arena's modular Virtual Cell abstraction and knowledge-augmented evaluation are extensible to emergent modalities (e.g., spatial transcriptomics, temporal analysis, multi-omics integration). The paradigm supports iterative benchmark refinement and adaptation to evolving biological technologies and reasoning demands. Future developments envisioned in the paper include ensemble judge calibration, dynamic database integration, and expansion to more complex biological scenarios, enabling alignment of evaluation criteria with advancing scientific understanding.

Conclusion

SC-Arena provides a unified, interpretable framework for evaluating LLMs in single-cell biology, operationalizing the Virtual Cell paradigm through a suite of biologically relevant tasks and knowledge-augmented metrics. The benchmark demonstrates that current frontier LLMs are limited in biological mechanistic reasoning, despite proficiency in text generation. The framework substantially advances the state of model evaluation in bioinformatics, offering not only a diagnostic tool for LLM competence but also a foundation for the principled construction and assessment of biology-aligned foundation models (2602.23199).