AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents (2410.13825v1)

Published 17 Oct 2024 in cs.AI and cs.CL

Abstract: Autonomy via agents using LLMs for personalized, standardized tasks boosts human efficiency. Automating web tasks (like booking hotels within a budget) is increasingly sought after. Fulfilling practical needs, the web agent also serves as an important proof-of-concept example for various agent grounding scenarios, with its success promising advancements in many future applications. Prior research often handcrafts web agent strategies (e.g., prompting templates, multi-agent systems, search methods, etc.) and the corresponding in-context examples, which may not generalize well across all real-world scenarios. On the other hand, there has been limited study on the misalignment between a web agent's observation/action representation and the pre-training data of the LLM it's based on. This discrepancy is especially notable when LLMs are primarily trained for language completion rather than tasks involving embodied navigation actions and symbolic web elements. Our study enhances an LLM-based web agent by simply refining its observation and action space to better align with the LLM's capabilities. This approach enables our base agent to significantly outperform previous methods on a wide variety of web tasks. Specifically, on WebArena, a benchmark featuring general-purpose web interaction tasks, our agent AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively, and boosts the success rate by 26.6 points (+161%) over similar plain web agents with its observation and action space alignment. We achieve this without using in-context examples, new agent roles, online feedback or search strategies. AgentOccam's simple design highlights LLMs' impressive zero-shot performance on web tasks, and underlines the critical role of carefully tuning observation and action spaces for LLM-based agents.

PDF HTML Abstract

Overview of AgentOccam: A Baseline for LLM-Based Web Agents

This paper introduces AgentOccam, a novel baseline methodology for deploying web agents grounded in LLMs. Recognizing the potential challenges LLMs face in executing web-based tasks, this paper emphasizes optimizing the alignment between an agent's observation-action space and the capabilities inherent in LLMs. Fundamentally, the paper proposes practical strategies aimed at refining this space to enhance task performance.

Key Contributions and Methodology

The cornerstone of AgentOccam lies in its streamlined design, deliberately eschewing the deployment of complex agentic strategies such as in-context examples, new roles, or online search methodologies. Instead, AgentOccam focuses on refining observation and action spaces suitable for LLM utilization:

Action Space Simplification: The action space is critically analyzed to remove non-contributory actions and those demanding robust embodiment understanding. Actions like #1{hover} and #1{press} are simplified or consolidated. Additionally, new supportive actions like #1{note} and #1{stop} are integrated to enable better memory management and decision points.
Observation Space Refinement: The observation space is optimized by restructuring web content, eliminating redundancy, and emphasizing crucial elements through actionable rules in the web data hierarchy, enhancing the perception capabilities of LLMs.
Innovative Planning: Introduces planning actions like #1{branch} and #1{prune} that allow the agent to autonomously generate, navigate, and prune execution plans, effectively managing task decomposition and workflow in dynamic web environments.

Empirical Results

AgentOccam demonstrates superior performance on the WebArena benchmark, significantly elevating success rates from 37.2% to 43.1%, marking an improvement over previous state-of-the-art methods by up to 15.8%. This increment underscores the efficacy of aligning observation-action spaces to leverage the LLM’s pre-trained capabilities, emphasizing zero-shot performance.

Theoretical Implications

The insights from AgentOccam underscore the criticality of aligning machine representations and real-world task needs within LLM paradigms. By refining action and observation spaces, the paper highlights the adaptability and artificial reasoning potential of LLMs without necessitating extensive retraining or architectural overhauls.

Practical Implications

Practically, AgentOccam represents a pivotal step toward using LLMs in automating real-world web interactions, such as online shopping or database management, without requiring domain-specific adjustments. This aligns with the broader objective of harnessing LLMs for efficiency enhancements in repetitive and predictable web tasks.

Future Directions

Looking forward, integrating validated observation-action alignment strategies with potential agentic improvements — like role specialization or dynamic multi-agent interactions — could exponentially enhance both task execution quality and scope. Moreover, examining the scalability of such methodologies across increasingly complex and varied web environments offers avenues for future research.

AgentOccam not only sets a new baseline in LLM-based web agency but also illuminates a pragmatic pathway for leveraging foundational AI capabilities in practical automated settings.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ke Yang (152 papers)
Yao Liu (116 papers)
Sapana Chaudhary (8 papers)
Rasool Fakoor (26 papers)
Pratik Chaudhari (75 papers)
George Karypis (110 papers)
Huzefa Rangwala (57 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/EmpathYang/status/1847283717275865476

https://twitter.com/EmpathYang/status/1884386601779806630

https://twitter.com/EmpathYang/status/1848815568381743199

https://twitter.com/gm8xx8/status/1847123004767064236

https://twitter.com/arxivsanitybot/status/1847465595165351941

https://twitter.com/MachMindMusings/status/1849480132207005772