AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories (2504.08942v1)

Published 11 Apr 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

Collections

Summary

Examination of Automatic Evaluations in Web Agents Through AgentRewardBench

The paper "AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories" presents a novel benchmark specifically designed to evaluate the effectiveness of LLMs as automatic judges for web agents. Web agents, which perform tasks on web browsers via natural language instructions, represent a versatile domain of AI applications. Accurate evaluation of these agents is crucial to understand their capabilities and shortcomings in task completion. Traditional rule-based evaluation methods, while currently popular, have limitations in terms of scalability and adaptability to new tasks, often underreporting the success rates of web agents. This work introduces AgentRewardBench as a response to these challenges, aiming to standardize the evaluation process across diverse benchmarks and propose a reliable alternative through LLM-based evaluations.

AgentRewardBench is composed of 1302 web agent trajectories collected from five benchmarks. These trajectories are associated with four distinct LLMs. Expert annotators meticulously review each trajectory, focusing on three primary dimensions: task success, unintended side effects, and action repetition. By comparing LLM judgments with human annotations, the benchmark serves as a rigorous test bed for LLM effectiveness. The paper evaluates 12 LLM judges, revealing no single model consistently excels across all benchmarks. However, simplified input designs may lead to better alignment with expert annotations.

Key Insights and Numerical Findings:

Traditional rule-based evaluations were demonstrated to highly underestimate web agent success. For instance, rule-based methods showed up to a 30% discrepancy with human annotations on benchmarks like WebArena.
The LLM judges exceeded rule-based evaluations in precision, with a reported LLM precision rate of up to 70%, versus 55.9% recall from rule-based methods, which illustrates a notable gap in success perception.
Error analyses identified specific LLM judge weaknesses such as grounding mismatches and misleading agent reasoning, thus emphasizing areas for future research and enhancement of automatic evaluation techniques.

Implications and Future Directions:

The findings from AgentRewardBench highlight significant implications for both theoretical and practical advancements in AI. Theoretically, this work proposes a structured approach to comparing LLM-based evaluations against traditional rule-based methodologies, thereby contributing to the growing body of knowledge regarding LLM reliability and utility. Practically, the paper calls for enhanced LLM designs to better detect nuanced issues in agent trajectories that mimic expert evaluation standards.

For future research, emphasis should be placed on refining LLM architectures to address identified error categories effectively, such as missed instruction details and misunderstood action intents. The evolving landscape of web-based AI tasks demands continuous updates to evaluation models, ensuring they keep pace with the advancements in LLM capabilities. Additionally, the introduction of adaptive methods that integrate aspects of human decision-making can foster more accurate and robust evaluations.

In conclusion, AgentRewardBench establishes a critical benchmark for automatic evaluation of web agents, paving the way for more adaptive and accurate LLM judge designs in AI systems. Its comprehensive data set and meticulous attention to detail provide a foundational platform for future enhancements in the field of automatic trajectory evaluation, ultimately contributing to the refinement of web agents in terms of performance and practical applicability. As AI continues to advance, such benchmarks will play an imperative role in shaping the scope and depth of intelligent systems deployed for complex web-based tasks.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (10)

GitHub

AgentRewardBench

Tweets

https://twitter.com/erhanmeydan/status/1912480681038332014

https://twitter.com/_akhaliq/status/1912221643100897489

https://twitter.com/ADarmouni/status/1912638989028180476

https://twitter.com/xhluca/status/1942932372921704471