Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows (2505.19897v2)

Published 26 May 2025 in cs.AI, cs.CL, cs.CV, and cs.HC

Abstract: LLMs have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

Summary

  • The paper introduces ScienceBoard, a novel environment that benchmarks multimodal autonomous agents performing realistic scientific tasks on a dynamic Ubuntu VM with professional software.
  • The benchmark comprises 169 expert-curated tasks across six disciplines, exposing significant performance gaps between state-of-the-art agents and human researchers.
  • Experiments reveal that integrating visual and textual data improves performance, underscoring the need for modular agent designs and better scientific grounding.

The paper "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows" (2505.19897) introduces a novel environment and benchmark designed to evaluate the capabilities of multimodal autonomous agents in performing realistic scientific research tasks. The core idea is to move beyond static question-answering or coding tasks and assess agents' ability to interact with professional scientific software within a dynamic operating system environment.

The paper presents ScienceBoard, which consists of two main components:

  1. A realistic, multi-domain environment: This environment is built on an Ubuntu virtual machine (VM) pre-installed with a suite of professional scientific software. Agents can interact with this environment using both Graphical User Interfaces (GUI) and Command-Line Interfaces (CLI), mimicking how a human researcher would use a computer. The environment is dynamic and visually rich, supporting complex workflows. To enable rigorous evaluation, the software within the VM is adapted to expose its internal states via a lightweight server, allowing external scripts to query application states and verify task completion.
  2. A challenging benchmark of 169 tasks: These tasks are curated by domain experts across six scientific disciplines: biochemistry (using UCSF ChimeraX), algebra (using KAlgebra), theorem proving (using Lean 4), geographic information systems (using GrassGIS), astronomy (using Celestia), and scientific documentation (using TeXstudio). The tasks simulate realistic scientific workflows and require a range of agent capabilities, including visual and textual reasoning, tool manipulation, coding, mathematics, spatial understanding, and domain knowledge. Tasks are categorized by difficulty (easy, medium, hard, open problems) and interaction type (GUI, CLI, or mixed). A detailed annotation pipeline ensures task quality, diversity, and provides configuration and evaluation scripts for each task. Evaluation is fine-grained, based on VM internal states and key I/O correctness, using templates for various criteria like exact matching, range checks, and domain-specific success markers.

The researchers evaluated state-of-the-art LLMs and VLMs as agents on ScienceBoard under various observation settings: screenshot, a11ytree (structured text representation), combined screenshot + a11ytree, and Set-of-Mark (marked screenshots). The evaluated models included proprietary ones like GPT-4o, Claude-3.7-Sonnet, Gemini-2.0-Flash, and o3-mini, as well as open-source models like Qwen2.5-VL-72B-Instruct and InternVL3-78B, and specialized GUI action models like OS-Atlas-Pro-7B, UGround-V1-7B, and UI-TARS-72B-DPO.

The key findings from the evaluations highlight significant limitations of current agents:

  • Overall, even the best models (like GPT-4o and Claude) achieved an average success rate of only around 15%, far below human performance (which is over 60%). Open-source models generally performed worse, with success rates under 12%.
  • Performance varied across domains. Agents performed relatively better on Algebra and Biochemistry tasks (often supporting mixed GUI/CLI interaction) compared to GIS and Astronomy tasks (heavily reliant on GUI interactions and complex spatial reasoning). This suggests that current agents struggle more with visual grounding and complex 3D spatial reasoning required in GUI-heavy scientific applications.
  • The combination of screenshots and a11ytree generally yielded the best performance, indicating the benefit of integrating both visual and structured textual information. Pure screenshot observation was challenging, and Set-of-Mark sometimes introduced noise in complex visual environments.

Analysis of failures revealed issues such as poor grounding (e.g., clicking the wrong element), inability to correctly invoke software functions (e.g., guessing function names instead of browsing menus), and incorrect CLI command formulation. Experiments with a modular approach (using GPT-4o as a planner and specialized models like Qwen2.5-VL-72B as grounders) showed significant performance improvements on complex GUI tasks, suggesting benefits in separating planning and action capabilities. Comparing performance on tasks allowing both GUI and CLI versus GUI-only highlighted that some models rely heavily on CLI when available and struggle when forced to use only the GUI.

Based on these findings, the paper discusses future directions for building more capable AI co-scientists:

  • Harmonizing Domain Knowledge and Agentic Capability: Current agents lack sufficient scientific domain knowledge and need better ways to integrate it and retrieve external information.
  • Collaborative and Specialized Agents: A multi-agent system where different agents handle planning, action execution, and domain-specific reasoning could overcome individual agent limitations.
  • Extending to Physical Laboratory: The ultimate goal is to extend these digital agents' capabilities to control physical lab equipment and conduct experiments in the real world.

In summary, ScienceBoard provides a critical stepping stone for developing AI agents capable of autonomous scientific discovery by offering a realistic evaluation environment with diverse, human-curated tasks involving professional software. The results demonstrate that current agents are far from achieving human-level proficiency in such workflows, highlighting the need for significant advancements, particularly in visual grounding, domain knowledge integration, and potentially modular agent designs. The platform's infrastructure, based on VM states and fine-grained evaluation, offers a robust basis for future research and benchmarking in this area.