Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments (2506.08136v1)

Published 9 Jun 2025 in cs.CL

Abstract: We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple LLMs to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.

Analyzing EconWebArena: A Benchmark for Autonomous Agents in Economic Tasks

The paper "EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments" presents a substantial framework for evaluating autonomous agents tasked with navigating complex economic environments on the web. This benchmark, termed EconWebArena, encompasses a comprehensive suite of 360 tasks sourced from 82 authoritative websites, representing diverse domains such as macroeconomics, labor, finance, trade, and public policy. These tasks are meticulously curated to challenge agents to retrieve and interpret data from structured and visual content, requiring interaction with dynamic interfaces and the extraction of precise, time-sensitive data through multi-step workflows.

Key Contributions and Findings

The benchmark evaluates a range of multimodal LLMs, analyzing their performance and identifying failure cases. The paper notes significant performance limitations across these models, highlighting challenges in grounding, navigation, and multimodal understanding. The agents—such as o4-mini, Claude Sonnet 4, and Gemini 2.5 Flash—achieved varying success rates, with o4-mini achieving an average success rate of 46.9%, while human evaluators demonstrated a 93.3% success rate. Such disparities underscore the gap between current model capabilities and expert-level proficiency in economic web tasks.

An extensive error analysis of the o4-mini agent identifies common failure types, including access issues (25%), data extraction errors (25%), interaction failures (12.5%), navigation failures (23.4%), and visual understanding failures (14.1%). This classification provides insights into specific hurdles that autonomous agents face when interacting with economic data on real-world web platforms.

Through ablation studies, the paper discerns the relative importance of various configurations in enhancing agent performance. For instance, the paper determined that structured observations, visual grounding, and coordinate awareness are vital components for tasks requiring interactions with complex web environments.

Implications and Future Developments

The work posits EconWebArena as an essential testbed for advancing domain-specific web intelligence. The benchmark's realism and complexity make it a valuable asset for developing agents capable of high-fidelity economic data acquisition. The paper suggests future directions, such as improved multimodal reasoning, robust visual grounding, global planning, and hybrid methods integrating GUI-based exploration with structured API calls.

As agents evolve, these insights could lead to more refined models that bridge the gap between current capabilities and human proficiency. Improvements in these domains will likely contribute significantly to the development of AI systems that can seamlessly perform autonomous economic reasoning and data extraction tasks.

In conclusion, "EconWebArena" offers a rigorous evaluation framework that challenges autonomous agents to holistically integrate domain-specific reasoning with practical web-based interactions. The benchmark serves as a catalyst for future advancements in AI-driven economic research, providing a foundational step towards the development of agents that can efficiently navigate and interpret complex web environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Zefang Liu (13 papers)
  2. Yinzhu Quan (7 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com