Towards a Realistic Long-Term Benchmark for Open-Web Research Agents (2409.14913v2)

Published 23 Sep 2024 in cs.CL, cs.IR, and cs.LG

Abstract: We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate agents on real-world "messy" open-web research tasks of the type that are routine in finance and consulting. In doing so, we lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. We built and tested several agent architectures with o1-preview, GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini. On average, LLM agents powered by Claude-3.5 Sonnet and o1-preview substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the performance of the LLM agents by inspecting their traces and reflecting on their observations. Our evaluation represents the first in-depth assessment of agents' abilities to conduct challenging, economically valuable analyst-style research on the real open web.

PDF HTML Abstract

Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

The paper "Towards a Realistic Long-Term Benchmark for Open-Web Research Agents" by Peter Mühlbacher, Nikos I. Bosse, and Lawrence Phillips, offers a notable contribution to the evaluation of LLM agents performing white-collar tasks. The authors embark on crafting a comprehensive benchmark that emphasizes real-world, economically impactful open-web research tasks. This essay provides an insightful overview of their methodology, results, and the implications of their research.

The core focus of this work is the development of a benchmark that accurately reflects the capabilities and limitations of LLM agents in handling complex, real-world tasks typically encountered in finance and consulting. Unlike existing benchmarks which often assess more straightforward tasks, this benchmark targets multifaceted research scenarios such as compiling lists of AI labs, tracing the origins of geopolitical relationships, and estimating economic metrics from open-web data. These tasks are intrinsically messy, reflecting real-world labor characteristics, thus providing a more rigorous and practical assessment.

Methodology

The authors delve into a multifaceted methodology, evaluating several LLM architectures equipped with different LLMs—GPT--$4$o, Claude--$3.5$ Sonnet, Llama 3.1 ($405$b), and GPT--$4$o-mini. They employ a ReAct architecture with subagent delegation capabilities and explore other architectures with varying degrees of explicit planning and delegation.

The chosen tasks span multiple research domains:

Geopolitical forecasting
Financial forecasting
Epidemiological forecasting
Competitor analysis
Market sizing

Each task involves a bespoke scoring rubric to account for partial successes, evaluating agents on a detailed set of criteria tailored to the complexities of the tasks. For instance, for the task of compiling a list of Chinese AI labs using over 1e24 FLOPs, the benchmark captures the ability to identify relevant organizations and compute figures from complex datasets.

Results

The paper presents compelling numerical results, highlighting the relative performances of various LLMs and their corresponding agent architectures. Clause--$3.5$ Sonnet-based agents show notable consistency, particularly those utilizing a ReAct architecture with subtask delegation. The most striking results include:

Claude--$3.5$ Sonnet's consistent performance across tasks
Claude--$3.5$ Sonnet ReAct agents with subtasks leading with top scores
o$1$--preview’s occasional superior performance, though less consistent

Comparatively, GPT--$4$o and Llama 3.1 ($405$b)-based agents underperform, exhibiting issues such as getting stuck in repetitive loops and failing to efficiently handle parallel tasks. Notably, no architecture across LLMs consistently resolved tasks involving complex planning or subtasks requiring substantial understanding and adaptation.

Failure Modes

The taxonomy of failures sheds light on weaknesses in both architectures and underlying models. Common failure modes include:

Suboptimal subtask delegation in ReAct architectures
Misalignment between generated subtasks and primary goals in planning architectures
Numeric computation errors, particularly in probabilistic estimates and statistical inference tasks
Over-reliance on surface-level web content without deeper diving into available datasets or sources

For instance, when tasked with finding original sources post the declaration of Sino-Russian "friendship without limits," agents often defaulted to superficial searches, missing out on deeper, more reliable government or organizational documents.

Discussion and Implications

The findings underscore the critical need for improvement in several areas:

Task Planning and Execution: Enhanced methods for agents to dynamically adjust plans based on interim results and evolving data.
Parallel Task Handling: More sophisticated subtask delegation and aggregation techniques to fully leverage the ReAct architecture's potential.
Numerical Robustness: Improved capabilities for handling complex numerical data and probabilistic computations accurately.

These results have far-reaching implications for developing AI systems capable of automating high-value tasks. A robust benchmark, continuously updated to reflect real-world tasks, can act as a bellwether for advancements, offering early indicators of transformative economic impacts. Furthermore, the research provides actionable insights into improving LLM agents’ architecture, emphasizing balanced growth across understanding, planning, and execution capabilities.

Future Work

The paper alludes to future developments, suggesting continued refinement of the benchmark and its task suite. This evolving benchmark will need to adapt to new challenges as LLM capabilities advance. Key areas for future research involve:

Enhancing the agents' capabilities in extracting and synthesizing information across diverse and multilingual web sources.
Developing better models for real-time adaptation to new data.
Expanding the task suite to cover emerging domains of significant economic importance.

In conclusion, Mühlbacher, Bosse, and Phillips present a robust framework for assessing LLM agents in realistic, high-stakes research tasks. Their benchmark is poised to be a vital tool in tracking and guiding the development of LLM technologies, ensuring they align closely with the nuanced demands of real-world applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Peter Mühlbacher (6 papers)
Nikos I. Bosse (5 papers)
Lawrence Phillips (11 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/dschwarz26/status/1839009859058151627