Papers
Topics
Authors
Recent
2000 character limit reached

BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks (2510.02418v2)

Published 2 Oct 2025 in cs.AI and cs.LG

Abstract: LLM web agents now browse and take actions on the open web, yet current agent evaluations are constrained to sandboxed environments or artificial tasks. We introduce BrowserArena, a live open-web agent evaluation platform that collects user-submitted tasks, runs Arena-style head-to-head comparisons, and uses step-level human feedback to surface failure modes. Collecting and analyzing step-level annotations on the agent traces, we identify three consistent failure modes: captcha resolution, pop-up banner removal, and direct navigation to URLs. By constructing targeted datasets to further study these tasks, we discover variations in how different LLMs navigate these failure modes. We find, for example, that o4-mini deploys a wider variety of strategies to circumvent captcha resolution than other models and DeepSeek-R1 consistently misleads users about pop-up banner closure. Our findings surface both the diversity and brittleness of current web agents. More broadly, our benchmarking methodology provides an approach to evaluating and understanding web agent failure modes at scale.

Summary

  • The paper introduces BrowserArena, a novel platform evaluating LLMs on real-world web navigation tasks using live user-submitted challenges.
  • It employs pairwise comparisons and human feedback to uncover practical strengths, identify common failure modes such as CAPTCHA and pop-up handling, and highlight performance disparities among models.
  • The study demonstrates that multimodal capabilities significantly impact web navigation success, paving the way for improved agent evaluation methodologies.

BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks

The paper "BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks" introduces a novel platform for evaluating LLMs in the context of web navigation tasks. This study focuses on assessing the practical capabilities of LLMs as web agents interacting with live websites, highlighting both the strengths and persistent failure modes of these agents.

Introduction

BrowserArena is developed to address the limitations of existing benchmarks for web agents that primarily operate within sandboxed, closed environments. By transitioning evaluations to the open web, BrowserArena facilitates the collection of diverse user-submitted tasks, offering a more realistic appraisal of LLM capabilities. The platform includes head-to-head comparisons and step-level human feedback to identify recurrent failure modes such as CAPTCHA resolution, pop-up banner removal, and direct URL navigation. Figure 1

Figure 1: An overview of the study procedure showing how users interact with BrowserArena.

Methodology

The BrowserArena utilizes a live open-web evaluation strategy. Users submit tasks which are processed by randomly selected LLM agents configured to navigate and interact with diverse web environments using the BrowserUse library. This allows for the assessment of agents on tasks with ambiguous specifications, using pairwise comparisons to model human preferences and evaluate agent outputs beyond predetermined success criteria.

The research encompasses a detailed examination of user-submitted tasks and human preference data to construct a model leaderboard, demonstrating a notable gap in Vision LLMs (VLMs) capability to model human preferences accurately.

Experimental Evaluation

The platform evaluates five different models, revealing variabilities in their performance across user-submitted tasks. The ranking based on user preferences shows that some models, like the non-multimodal R1, perform better unexpectedly. This highlights that while multimodal capabilities are beneficial, they are not the sole determinant of successful navigation tasks. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Average Win Rate across models.

Further, human evaluator agreement is assessed to measure consistency among human judges, which shows moderate agreement, with higher agreement observed when tie votes are filtered. This signifies differing thresholds among evaluators rather than inconsistent agent evaluation.

The study also highlights a substantial gap between human and VLM judgments, pointing out that multimodal inputs can adversely affect judge reliability, as demonstrated by input ablations that improve agreement when using trace-only evaluations.

Identification of Failure Modes

Step-level annotations from users led to the identification of three prominent failure modes: CAPTCHA solving, pop-up banner closure, and direct navigation. The paper details methodologies to construct targeted datasets reproducing these scenarios with high frequency to study variations in agent behavior. Figure 3

Figure 3: Pairwise agreements between the baseline labels, the new annotators, and two vision-languages models.

Captcha Solving: Different models exhibit varying strategies to circumvent CAPTCHA, with strategies ranging from direct link navigation to using proxies and archives. Notably, o4-mini demonstrates wider strategic diversity, suggesting better adaptation to CAPTCHA constraints.

Pop-Up Banner Closure: The evaluation reveals the necessity of multimodal capabilities for effective identification and closure of pop-up banners. Models with such capabilities perform noticeably better than those without.

Direct Navigation: Agents primarily favor invoking Google Search for information retrieval over direct navigation to targeted sites, emphasizing reliance on search engines for knowledge-intensive tasks.

Conclusions

The introduction of BrowserArena represents a significant step forward in benchmark methodologies for evaluating LLM web agents. By focusing on user-submitted tasks and employing pairwise comparisons, it uncovers insights into agent behavior in dynamic and real-world web scenarios. This study demonstrates that multimodal capabilities significantly impact web agent performance and highlights the methodological enhancements necessary for future evaluations.

Limitations and Future Work

The platform's evaluations are contingent upon the capabilities provided by the BrowserUse framework, implying that alternate agent configurations might yield different results. Additionally, failure modes are system-specific and could vary with different setups, suggesting a need for further exploration and refinement of configurations for universal applicability.

Overall, BrowserArena sets the groundwork for future advancements in LLM web agent evaluations, paving the way for more inclusive and realistic agent deployment frameworks.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.