Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery (2410.05080v3)

Published 7 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The advancements of LLMs have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using ScienceAgentBench, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands CodeAct, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. In addition, we evaluate OpenAI o1-preview with direct prompting and self-debug, which can boost the performance to 42.2%, demonstrating the effectiveness of increasing inference-time compute but with more than 10 times the cost of other LLMs. Still, our results underscore the limitations of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.

Citations (1)

Summary

  • The paper introduces a benchmark of 102 tasks from 44 peer-reviewed studies, validated by experts to assess language agents in data-driven scientific discovery.
  • The evaluation framework employs metrics like valid execution, success rate (up to 34.3%), CodeBERTScore, and API cost to measure performance and efficiency.
  • The study highlights current limitations in automating scientific workflows and proposes expanding the benchmark and grading metrics for future improvements.

Toward Rigorous Evaluation of Language Agents in Scientific Discovery

The paper "ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery" introduces a benchmark aimed at evaluating the capabilities of LLMs in automating specific tasks within the scientific discovery process. This benchmark, ScienceAgentBench, is constructed from 102 tasks extracted from 44 peer-reviewed publications, covering four disciplines: Bioinformatics, Computational Chemistry, Geographical Information Science, and Psychology/Cognitive Neuroscience. These tasks are validated by nine subject experts to ensure scientific authenticity.

Key Features and Methodology

  1. Task Extraction and Validation: The benchmark includes tasks directly extracted from scientific workflows to ensure real-world relevance. Each task is supported by validated Python programs and datasets, with consistent output formats to facilitate machine evaluation.
  2. Evaluation Framework: The benchmark employs several evaluation metrics, including valid execution rate, success rate, CodeBERTScore, and API cost, to measure the quality and efficiency of the language-generated programs. This approach emphasizes both performance and cost-effectiveness in practical applications.
  3. Data Contamination Mitigation: To address concerns regarding LLM pre-training data contamination, strategies such as modifying test datasets and labels are implemented. This prevents models from leveraging memorized data shortcuts.
  4. LLM and Framework Evaluations: The paper evaluates five LLMs, including open-weight and proprietary models, across three frameworks: direct prompting, OpenHands CodeAct, and self-debug. The research particularly focuses on Claude-3.5-Sonnet, which showed the highest success rate but still highlighted the limitations of current models in fully automating scientific workflows.

Findings and Implications

The evaluation results demonstrate that even the best-performing language agents are currently unable to comprehensively automate scientific discovery tasks, achieving a maximum success rate of 34.3% when leveraging expert-provided knowledge. This underscores the challenges that language agents face in handling complex, data-driven scientific tasks requiring precise code generation and execution.

The paper argues for a more nuanced assessment of language agent capabilities, advocating a focus on individual task performance rather than end-to-end processes. This strategy provides a more detailed understanding of the specific areas where LLMs excel or need improvement.

Future Directions

The authors suggest expanding ScienceAgentBench to include additional disciplines and tasks, thereby enhancing its scope and robustness. They also propose developing new automatic grading metrics to further refine language agent evaluation.

In summary, ScienceAgentBench represents a step forward in the rigorous assessment of language agents within scientific domains. The benchmark aims to foster the development of more capable agents that can augment scientists' productivity by reliably automating complex elements of their workflows.

Youtube Logo Streamline Icon: https://streamlinehq.com