Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites (2504.11543v2)

Published 15 Apr 2025 in cs.AI

Abstract: We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier LLMs achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.

An Evaluation and Benchmark Framework for Autonomous Web Agents with the REAL Model

The paper "REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites" introduces a comprehensive framework, REAL, aimed at addressing the challenges faced by LLMs in performing web-based tasks. While LLMs showcase potential capabilities in reasoning and planning, their real-world application, particularly in automating complex web interactions, is notably limited. The REAL framework serves as a benchmark to evaluate and enhance these capabilities by providing high-fidelity deterministic replicas of eleven commonly used web platforms spanning e-commerce, travel, and social networking domains, among others.

Core Contributions and Methodology

REAL offers a unique methodology that combines deterministic web simulations with a robust evaluation framework. This offers a level playing field for testing the efficiency, reliability, and safety of autonomous web agents. The framework comprises multiple features that are pivotal to the paper of LLMs in a web context:

  1. Deterministic Environments: The framework's high-fidelity simulations are crucial for consistent evaluation, mitigating the limitations posed by the ever-changing nature of real-world web platforms. These controlled environments facilitate reproducibility, enabling researchers to perform rigorous evaluations of an agent's capacity to carry out complex, multi-turn interactions never before replicable on dynamic live sites.
  2. Extensive Task Evaluation: REAL includes over 112 real-world-inspired tasks simulating everyday web tasks. The tasks are diverse and cover both information retrieval and interaction-based activities, challenging the agent not only to retrieve accurate data but also to enact changes, such as completing transactions or scheduling appointments. The complexity and diverse nature of these tasks ensure a thorough evaluation of an LLM's web navigation capabilities.
  3. Flexible Testing Harness: By supporting both open-source and proprietary systems, the evaluation framework within REAL can flexibly accommodate varying agent architectures. This allows research teams to test their agents without significant modifications and aligns with the real-world operations many AI systems encounter.
  4. Benchmarking Via Leaderboard: REAL provides a leaderboard, which is an accessible, centralized platform for comparisons. This feature allows the global research community to assess the performance of their systems against others, fostering an environment of transparency and ongoing improvement.

Empirical Evaluation Results

The paper reveals significant disparities in agent performance, with current LLMs achieving, at best, a 41.07% success rate under the framework's rigorous conditions. Notably, Claude-3.7-Sonnet achieves this rate, whereas systems like GPT-4 achieve comparatively lower success rates. These results underscore critical gaps in current agent capabilities, especially in effectively managing error states and negotiating complex workflows.

Implications and Future Research Directions

The introduction of REAL offers significant implications both practically and theoretically:

  • Practical Implications: For practitioners, REAL emphasizes the scalability and the economic potential of automating web tasks. It highlights areas where improvements in LLM capabilities could drive significant advancements in automation, impacting various sectors like e-commerce, communications, and more.
  • Theoretical Implications: From a theoretical standpoint, the framework catalyzes the refinement of LLMs for real-world applications by providing detailed feedback on task performance and aiding in the identification of shortcomings that must be addressed.
  • Future Developments in AI: As REAL enables evaluation and training under safe, reproducible conditions, it is likely to accelerate advancements in reinforcement learning and the development of improved LLM scaffolds that leverage deep learning techniques for multimodal interactions, better planning, and decision-making.

In conclusion, the REAL framework is poised to act as a pivotal tool for the advancement of autonomous web agents, providing researchers and developers with the infrastructure to push boundaries in applying AI to complex, real-world tasks. The introduction of this benchmark is a step towards bridging the gap between theoretical LLM capabilities and their practical applications in everyday digital interactions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Divyansh Garg (12 papers)
  2. Shaun VanWeelden (1 paper)
  3. Diego Caples (2 papers)
  4. Andis Draguns (8 papers)
  5. Nikil Ravi (2 papers)
  6. Pranav Putta (2 papers)
  7. Naman Garg (4 papers)
  8. Tomas Abraham (1 paper)
  9. Michael Lara (1 paper)
  10. Federico Lopez (2 papers)
  11. James Liu (7 papers)
  12. Atharva Gundawar (7 papers)
  13. Prannay Hebbar (1 paper)
  14. Youngchul Joo (1 paper)
  15. Charles London (6 papers)
  16. Christian Schroeder de Witt (49 papers)
  17. Sumeet Motwani (4 papers)
  18. Jindong Gu (101 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com