Papers
Topics
Authors
Recent
Search
2000 character limit reached

ScienceBoard: Benchmark for Scientific Workflows

Updated 23 June 2026
  • ScienceBoard is a unified benchmark for evaluating multimodal, autonomous agents across diverse scientific workflows.
  • It enables agents to interact via GUI, CLI, and augmented actions in live environments using real research tools.
  • The platform curates 169 expert-designed tasks, revealing state-of-the-art agent performance and highlighting challenges in grounding and planning.

ScienceBoard is a unified environment and benchmark for evaluating the capability of multimodal autonomous agents to perform realistic, domain-diverse scientific workflows. It enables systematic measurement of agent proficiency across biochemistry, algebra, automated theorem proving, geoinformatics, astronomy, and scientific documentation, mediated through live professional software running in an interactive virtual machine. ScienceBoard supports a broad action space—GUI interactions, command-line scripting, API calls—and multimodal state observations, thus exposing agents to the complexity and dynamism characteristic of real-world scientific research. The resulting benchmark, comprising 169 rigorously curated tasks, currently represents the most comprehensive testbed for computer-using agentic LLMs in science, revealing both the capabilities and limitations of current approaches (Sun et al., 26 May 2025).

1. Motivation and Objectives

ScienceBoard addresses the gap between LLM-centric evaluation (e.g., question-answering, code synthesis) and the embodied workflows of scientific discovery. Specialized software such as UCSF ChimeraX, KAlgebra, Lean 4, GrassGIS, Celestia, and TeXstudio constitute the digital substrate for research in diverse disciplines. Autonomous agents that can operate these tools, interpret multimodal feedback, and manipulate complex user interfaces stand to accelerate hypothesis testing, simulation, analysis, and documentation. ScienceBoard’s dual contributions are (i) a realistic, extensible environment embedding these professional tools and (ii) a validated, high-difficulty benchmark that stresses planning, grounding, and domain-specific reasoning (Sun et al., 26 May 2025).

2. ScienceBoard Environment: Domains, Software, and Interaction Modalities

Agents are evaluated within an Ubuntu-based VM hosting six open-source research applications:

  • Biochemistry: UCSF ChimeraX for 3D molecular structure visualization, including AlphaFold models.
  • Algebra: KAlgebra for symbolic computation, parametric plotting, and equation solving.
  • Automated Theorem Proving (ATP): Lean 4 proof assistant for constructing formal proofs.
  • Geographic Information Systems (GIS): GrassGIS for spatial data processing, vector/raster analysis.
  • Astronomy: Celestia for real-time 3D astronomical visualization and simulation.
  • Scientific Documentation: TeXstudio as a LaTeX IDE for technical writing and reporting.

Agents interact via:

  • GUI Actions: mouse movement/click/drag, keyboard events, text typing, scroll.
  • CLI Actions: shell commands (bash, Python), in-application scripts, application-API calls.
  • Augmented Actions: meta-actions such as ANSWER, DONE, FAIL, WAIT.

Observations provided to the agent include:

  • Full-resolution screenshots.
  • Accessibility tree (a11ytree) exposing semantic UI elements as structural text.
  • Combinations of screenshots with a11ytree for fused vision-language input.
  • Annotated mark sets linking bounding boxes to UI tree nodes.

A RESTful evaluation layer monitors intermediate and final system states for precise execution-based assessment.

3. Task Benchmark Construction and Characteristics

The 169-task benchmark is curated by domain experts via a pipeline:

  1. Tutorial Learning: Annotators assimilate tool handbooks and tutorials.
  2. Task Specification: Human-written, natural-language instructions covering initialization, configuration, simulation, question-answering, and domain operations.
  3. Formalization: LLM (ChatGPT) standardizes task formats; low diversity and ambiguous instructions are removed.
  4. Initial State Generation: Scripts load necessary contexts (molecules, map layers, proof skeletons).
  5. Automated Evaluation: State-based scripts return binary PASS/FAIL via predicates, comparisons, or tolerances.

Task statistics:

  • Domain Distribution: Biochemistry, algebra, theorem proving, GIS, astronomy, documentation.
  • Interaction Types: GUI-only (38), CLI-only (33), hybrid GUI+CLI (98).
  • Difficulty: 91 easy; 48 medium; 28 hard; 2 open-ended.
  • Mean Steps per Task: 9.
  • Representative Tasks: Protein centroid visualization in ChimeraX; symbolic solution and plotting in KAlgebra; vector layer manipulation in GrassGIS; LaTeX document generation referencing simulation results.

4. Evaluation Methodology and Empirical Findings

Agents—including GPT-4o, Claude 3.7, Gemini 2.0, Qwen2.5-VL, InternVL3, Open-source GUI actors (OS-Atlas-Pro, UGround, UI-TARS)—are evaluated under diverse observation settings. The principal metric, success rate (SR), is computed as:

SR=NsuccNtotal×100%\mathrm{SR} = \frac{N_{\mathrm{succ}}}{N_{\mathrm{total}}} \times 100\%

Domain-level SR and observation-specific SR are similarly reported. Key empirical results:

  • Overall agent performance: 0–15.8% maximum SR; human reference ~60%.
  • Domain-wise best (hybrid screenshot+a11ytree, GPT-4o):
    • Biochemistry: 37.9%
    • Algebra: 22.6%
    • GIS: 2.9%
    • ATP: 7.7%
    • Astronomy: 3.0%
    • Documentation: 12.5%
  • Observation impact: Combined screenshot and a11ytree significantly outperform vision-only/text-only inputs.
  • Error patterns: High failure rates in GUI-only, vision-only settings (<5% SR) due to poor spatial grounding and dense visual complexity.

5. Agent Design Insights, Failure Modes, and Recommendations

A detailed error taxonomy reveals four dominant failure modes:

  • Grounding errors: Mis-clicking or selecting incorrect UI elements.
  • Mis-invoked functions: Incorrect shell commands or application menu entries.
  • Syntactic/parameter errors: Faulty CLI scripts, malformed text input.
  • Visual scene complexity: GIS maps, molecular viewers, and star fields strain spatial reasoning capabilities.

Design insights:

  • Observation fusion (screenshot+a11ytree) enables superior agent localization, with consistent gains across models.
  • Planning/grounding decoupling: Separating high-level reasoning (e.g., using GPT-4o for planning; Qwen2.5 for grounding) raises SR from 0.8% to 17% in the screenshot setting.
  • Hybrid GUI-CLI: GUI+CLI tasks are more robustly handled by strong grounding models; loss of CLI access penalizes text-action agents disproportionately.

Recommendations for future systems include harmonizing domain-specific LLM knowledge with agentic control, modularizing agent architecture (multi-agent delegation for planning, GUI grounding, CLI scripting, domain reasoning), and extending to physical “lab-in-the-loop” automation.

6. Extensions and State-of-the-Art: CODA and Compositional Architectures

Emerging compositional models, such as CODA, leverage the ScienceBoard benchmark for evaluating modular “dual brain” approaches. CODA decouples a generalist planner (“Cerebrum,” using Qwen2.5-VL-32B) from a specialist executor (“Cerebellum,” UI-TARS-1.5) with a decoupled GRPO-based RL objective (Sun et al., 27 Aug 2025):

  • Stage 1 (Specialization): Per-domain planners are RL-trained using small seed sets and expert trajectories, with only the planner updated.
  • Stage 2 (Generalization): Aggregated trajectories across domains are used for SFT of a new generalist planner.
  • Empirical Results (ScienceBoard subset):
    • CODA (generalist) achieves 21% overall SR, exceeding both specialist and baseline models (Qwen2.5-VL-32B baseline: 7.6%). Per-domain improvements are consistently significant.
    • Execution precision (mean L1 error <5 pixels) and rollout depth (mean first-error step increases from 4.3 to 9.2).
    • Ablation studies confirm: absence of decoupled RL (end-to-end baseline) reduces performance by 18%; skipping Stage 2 (no generalization) reduces by 7%.

A plausible implication is that modular, trainable, and decoupled approaches may be critical for further progress on challenging agentic science benchmarks.

7. Significance and Future Directions

ScienceBoard quantifies the formidable gap between current multimodal agentic LLMs and human proficiency in scientific workflows. Despite rapid progress, with peak success rates still in the 15–21% range and persistent failure on complex task sequences, reliable digital “co-scientists” remain an unsolved challenge. The released environment, open benchmark, and leaderboard catalyze research into domain-aware, collaborative, and robust scientific agent architectures (Sun et al., 26 May 2025, Sun et al., 27 Aug 2025). Expanding the platform to physical lab automation and continual agent self-improvement remains an open direction essential for future advances in computational scientific discovery.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScienceBoard.