Beyond Browsing: API-Based Web Agents (2410.16464v3)

Published 21 Oct 2024 in cs.CL and cs.MA

Abstract: Web browsers are a portal to the internet, where much of human activity is undertaken. Thus, there has been significant research work in AI agents that interact with the internet through web browsing. However, there is also another interface designed specifically for machine interaction with online content: application programming interfaces (APIs). In this paper we ask -- what if we were to take tasks traditionally tackled by Browsing Agents, and give AI agents access to APIs? To do so, we propose two varieties of agents: (1) an API-calling agent that attempts to perform online tasks through APIs only, similar to traditional coding agents, and (2) a Hybrid Agent that can interact with online data through both web browsing and APIs. In experiments on WebArena, a widely-used and realistic benchmark for web navigation tasks, we find that API-Based Agents outperform web Browsing Agents. Hybrid Agents out-perform both others nearly uniformly across tasks, resulting in a more than 24.0% absolute improvement over web browsing alone, achieving a success rate of 38.9%, the SOTA performance among task-agnostic agents. These results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that API-based web agents significantly improve task performance compared to browsing-only methods.
It introduces a Hybrid Agent that combines API interactions with traditional browsing, achieving a 20% improvement over standard browsing agents.
The study highlights the importance of well-documented APIs and adaptive agent design for enhancing efficiency in real-world web environments.

Analyzing "Beyond Browsing: API-Based Web Agents"

The paper "Beyond Browsing: API-Based Web Agents" presents a novel approach to handling web-based tasks by integrating application programming interfaces (APIs) into web agents, traditionally limited to browser interactions. This shift from browsing to API interaction opens new avenues for performing web tasks more efficiently, particularly when API support is substantial.

Overview and Contributions

The researchers propose two types of agents: an API-based agent that exclusively utilizes APIs and a Hybrid Agent that combines both API interaction and traditional web browsing. The paper demonstrates that API-based interactions can significantly outperform browsing-only methods on tasks benchmarked by WebArena. Notably, the Hybrid Agent exceeds both purely browsing and API agents, achieving a 20% improvement over browsing agents with a success rate of 35.8%—establishing the state-of-the-art performance among task-agnostic agents.

Experimental Setup

The research evaluates agents using WebArena, a benchmark for real-world web tasks, encompassing sites such as Gitlab and Reddit. By analyzing API availability and quality across these domains, the paper categorizes APIs into good, medium, and poor, finding that comprehensive and well-documented APIs significantly enhance agent performance.

Strong Numerical Results

Key findings indicate that API-based agents consistently outperform browsing agents, especially on platforms with robust API support such as Gitlab, which features 988 endpoints with comprehensive documentation. Conversely, platforms like Reddit, with minimal API support, demonstrate the necessity for the Hybrid approach, where traditional browsing complements limited API functionality.

Hybrid Agent: A Superior Approach

The Hybrid Agent, by dynamically switching between APIs and web browsing, effectively addresses the limitation of API-only solutions. For example, it performs well on complex tasks in Shopping Admin by leveraging both modalities, while the API agent succeeds in structured data retrieval tasks on Gitlab. The flexibility of Hybrid Agents supports their superior performance across varied web tasks.

Implications and Future Directions

The implications of this work are substantial for the development of web agents:

Practical Implications: The adoption of API-based interactions in web agents shows promise for enhancing efficiency and accuracy, particularly in environments with rich API landscapes. This holds potential for industrial applications requiring complex web interactions.
Theoretical Implications: The research demonstrates how hybrid models can harness the strengths of both APIs and browsing to handle a wider range of tasks. It suggests a move towards more adaptive and context-aware agent architectures.
Future Speculation: Future developments could involve automating API discovery and generation, expanding the applicability of API-based agents. Techniques such as Agent Workflow Memory could be explored to further enhance flexibility and efficiency.

Conclusion

"Beyond Browsing: API-Based Web Agents" offers a compelling argument for the integration of API-centric approaches in web agents. By demonstrating the strengths of API and Hybrid Agents in performing complex tasks, the research sets a meaningful direction for advancements in web agent capabilities and interactivity. The findings pave the way for future enhancements in automated API handling and adaptive agent systems, potentially revolutionizing the execution of web tasks.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/yueqi_song/status/1849140469621747967

https://twitter.com/gneubig/status/1876173651437138243

https://twitter.com/CamelAIOrg/status/1873770875461509297

https://twitter.com/AkariAsai/status/1918701019619627044

https://twitter.com/realPUNKUAI/status/1909585152792994293

https://twitter.com/webagentlab/status/1881249802346004705

YouTube

Show All Videos