Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
21 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
459 tokens/sec
Kimi K2 via Groq Premium
230 tokens/sec
2000 character limit reached

CellForge: Agentic Design of Virtual Cell Models (2508.02276v1)

Published 4 Aug 2025 in cs.LG, cs.AI, cs.CL, and q-bio.QM

Abstract: Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.

Summary

  • The paper introduces an autonomous multi-agent framework that transforms single-cell multi-omics data into executable virtual cell models.
  • It employs a graph-based expert discussion to synthesize innovative model architectures, outperforming state-of-the-art baselines.
  • Experimental evaluations show up to 40% error reduction and robust generalization across gene, drug, and cytokine perturbations.

Agentic Design and Automated Implementation of Virtual Cell Models with CellForge

Introduction

CellForge introduces a fully autonomous, agentic system for the design and implementation of virtual cell models, specifically targeting the prediction of single-cell responses to diverse perturbations. The framework leverages a multi-agent architecture that transforms raw single-cell multi-omics data and natural language task descriptions into optimized, executable computational models. The system is structured into three core modules: Task Analysis, Method Design, and Experiment Execution, each orchestrated by specialized agents that collaborate through a shared memory protocol. This essay provides a detailed technical summary of the CellForge framework, its methodological innovations, empirical performance, and implications for the future of AI-driven scientific discovery.

Problem Formulation and System Overview

CellForge addresses the challenge of predicting cellular responses to perturbations—such as gene knockouts, drug treatments, and cytokine stimulations—across multiple single-cell modalities (scRNA-seq, scATAC-seq, CITE-seq). The task is formalized as learning a mapping from a control cell state and perturbation condition to the resulting perturbed state in a high-dimensional gene expression space. The system is designed to generalize to unseen perturbations and cell states, requiring robust inductive reasoning and dataset-specific adaptation. Figure 1

Figure 1: (a) Virtual cell modeling as a perturbation mapping problem; (b) Training and prediction for unseen perturbations across modalities; (c) CellForge input/output interface; (d) Core intermediate outputs from analysis to code generation.

The architecture of CellForge is organized into three sequential phases:

  1. Task Analysis: Automated dataset characterization, literature retrieval, and extraction of task-specific constraints.
  2. Method Design: Collaborative, graph-based expert discussion to synthesize novel model architectures and research plans.
  3. Experiment Execution: Automated code generation, training, validation, and iterative refinement. Figure 2

    Figure 2: The CellForge architecture and workflow, illustrating the sequential phases and shared memory communication.

Task Analysis Module

The Task Analysis module integrates dataset profiling, literature-driven retrieval, and agentic collaboration. The Data Parser standardizes metadata across modalities, while the retrieval system combines a static corpus with dynamic PubMed and GitHub search, employing an alternating BFS/DFS strategy with Sentence-BERT embeddings for relevance scoring. Three specialized agents—Dataset Analyst, Problem Investigator, and Baseline Assessor—process the retrieved information, producing a structured analysis report that informs downstream model design. Figure 3

Figure 3: Example outputs from the three modules: analysis report, research plan, and code snippets.

Method Design: Multi-Expert Graph-Based Discussion

The Method Design module employs a graph-based, multi-agent discussion framework. Domain experts (e.g., Data, Model Architecture, Deep Learning, Pathway, Training) are instantiated via role-specific prompts and engage in iterative rounds of proposal, critique, and refinement. Each expert maintains a confidence score, updated via a weighted combination of historical confidence, critic agent evaluation, and peer feedback. The discussion terminates upon consensus or after a fixed number of rounds, yielding a research plan with detailed model architecture, preprocessing, and training strategies. Figure 4

Figure 4: Graph-based discussion workflow, showing iterative expert proposal refinement and confidence score updates.

Figure 5

Figure 5: Example of confidence score evolution for a domain expert during multi-round discussion.

The architectural search is not limited to hyperparameter tuning but focuses on emergent, dataset-specific model design. The system frequently converges on hybrid architectures (e.g., VAE + GNN + Transformer) tailored to the biological and technical characteristics of each dataset.

Experiment Execution: Automated Code Generation and Validation

The Experiment Execution module translates the research plan into executable code, orchestrating training, validation, and iterative refinement. The code generator produces production-ready scripts, with self-debugging capabilities that handle syntax and runtime errors via event stream analysis. Training is managed with best-practice safeguards (early stopping, cross-validation, adaptive learning rates), and validation agents monitor performance metrics (MSE, PCC, R2R^2), triggering hyperparameter tuning or retraining as needed.

Empirical Performance and Evaluation

CellForge was evaluated on six benchmark datasets spanning gene knockouts, drug treatments, and cytokine stimulations across scRNA-seq, scATAC-seq, and CITE-seq modalities. The models designed by CellForge consistently outperformed state-of-the-art baselines, achieving up to 40% reduction in prediction error and 20% improvement in correlation metrics. Notably, on the challenging scATAC-seq dataset, CellForge achieved a ~16-fold gain in Pearson correlation on differentially expressed genes compared to linear regression. Figure 6

Figure 6: UMAP visualizations of predicted and ground truth gene expression profiles under gene knockout, drug, and cytokine perturbations, demonstrating high fidelity in capturing cellular state distributions.

CellForge also demonstrated strong performance in recovering differentially expressed genes (DEGs), with recall rates exceeding 69% and ROC-AUC values above 0.65 on well-characterized genetic perturbation datasets. The system's ability to generalize to unseen perturbations and modalities was validated through stratified cross-validation and held-out perturbation scenarios.

Ablation and Component Analysis

Ablation studies revealed that both the agentic retrieval system and the graph-based expert discussion are critical for performance. The combination of these components yielded synergistic effects, with performance gains far exceeding their individual contributions. The system's robustness was further demonstrated by stable performance across all perturbation types and modalities.

LLM and Human Evaluation

CellForge's outputs were evaluated by both LLM-based judges and human experts across multiple scientific dimensions (validity, feasibility, innovation, experimental design, impact). The system consistently outperformed DeepResearch variants and single-LLM baselines, with strong alignment between agent confidence scores, LLM, and human expert evaluations. Figure 7

Figure 7: Comparative evaluation of CellForge and DeepResearch variants by LLM judges across key scientific dimensions.

Architectural Adaptation and Model Diversity

Post-hoc analysis of the architectures designed by CellForge revealed emergent, biologically plausible model-task pairings. Transformers dominated cytokine data, GNNs were favored for regulatory network-rich datasets, and hybrid or novel variants emerged through agentic debate and literature integration. Figure 8

Figure 8: Categorization and quantification of architectures designed by CellForge across six datasets.

Failure Modes and Resource Considerations

The most common failure modes were computation execution errors (41%), primarily due to tensor operation issues, and invalid type/operation errors (23%). The system incorporates self-debugging strategies, such as dynamic shape printing and error recovery, to mitigate these issues. Training infrastructure requirements are moderate (2x NVIDIA H20 GPUs, 16-core CPU, 150 GB RAM), with average per-experiment costs of \$5–\$20, significantly lower than manual expert labor. Figure 9

Figure 9: Distribution of failure modes in CellForge, highlighting the prevalence of computation and type errors.

Implications and Future Directions

CellForge demonstrates that autonomous, multi-agent systems can effectively integrate computational, biological, and statistical expertise to design and implement optimized models for complex scientific tasks. The framework's architecture-agnostic, literature-grounded, and collaborative reasoning approach enables adaptation to new modalities and tasks without manual intervention. The strong empirical results and robust evaluation suggest that agentic systems can serve as foundational tools for next-generation virtual cell modeling and, more broadly, for automated scientific discovery in data-rich domains.

Future developments may focus on extending CellForge to additional omics modalities (spatial transcriptomics, proteomics), enhancing novelty detection for de novo biological mechanisms, and integrating prospective wet-lab validation. The system's modular design and open-source availability facilitate community-driven improvements and cross-domain adaptation.

Conclusion

CellForge represents a significant advance in the automation of scientific model design, integrating agentic reasoning, domain knowledge retrieval, and collaborative architecture synthesis into a unified, end-to-end framework. Its consistent outperformance of state-of-the-art baselines, robust generalization across modalities, and alignment with human expert judgment underscore the potential of multi-agent AI systems to accelerate and democratize scientific discovery. The methodological innovations in agentic retrieval, graph-based expert discussion, and automated code generation provide a blueprint for future AI-driven research platforms in computational biology and beyond.

Github Logo Streamline Icon: https://streamlinehq.com

alphaXiv

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube