AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents (2407.18901v1)

Published 26 Jul 2024 in cs.SE, cs.AI, cs.CL, and cs.LG

Abstract: Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built $\textbf{AppWorld Engine}$, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created $\textbf{AppWorld Benchmark}$ (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces AppWorld as a benchmark framework featuring 9 realistic apps and 457 APIs to evaluate interactive coding agents on complex digital tasks.
It details a robust methodology including a stateful execution shell, procedural generation of 750 tasks, and state-based assertions to assess agent performance.
Evaluation shows that even advanced models like GPT-4 based ReAct struggle with API understanding and state management, indicating significant room for improvement.

Overview of Ap\underline{p}World: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Introduction

The paper introduces Ap\underline{p}World, a sophisticated framework designed to benchmark the capabilities of interactive coding agents in handling complex day-to-day digital tasks. Ap\underline{p}World provides a fully controllable and reproducible environment comprised of nine realistic applications, over which agents can perform tasks using 457 APIs. The framework aims to remedy the insufficiencies of existing benchmarks that fail to capture the complexities and requirements of real-world digital environments.

Components of Ap\underline{p}World

Ap\underline{p}World Engine

Ap\underline{p}World Engine consists of several key components:

Applications: Nine apps simulating real-world applications (e.g., Gmail, Venmo, Amazon) were developed using the FastAPI library. Each API closely mimics its real-world counterpart, supporting various operations with thorough documentation.
Execution Shell: The shell allows agents to write and execute code statefully, providing error traces and safe code execution. This supports both direct function calls and REST API interactions.
Database: The database underlying Ap\underline{p}World, termed 'Base DB,' is populated with data representing 100 fictitious users. This data simulates the digital activities and relationships among these users, enabling the benchmarking of realistic tasks.
Documentation: Detailed API documentation includes descriptions of API functions, parameters, and output schemas, ensuring that agents can understand and interact with the APIs effectively.

Benchmark Design

Task Generators

The benchmark framework provides 750 complex tasks spread across daily scenarios, requiring agents to perform API-based operations with robust code generation. Task scenarios, created using a procedural method, include setups that ensure tasks are well-defined, contain distractors and hurdles, and form contrast sets for rigorous generalization evaluation.

Evaluation Suite

The key innovation in evaluation lies in its programmatic and robust nature, based on state-based assertions. Evaluations validate whether the final state of the database matches a set of valid states, ensuring that tasks are completed without unintended side effects (collateral damage).

Results and Analysis

Several state-of-the-art models were evaluated using Ap\underline{p}World:

Models like ReAct, Plan-and-Execute, and FullCodeRefl showed varying effectiveness, with GPT-4 based ReAct achieving the highest Task Goal Completion (TGC) score of approximately 48.8% on the normal test set.
The benchmark revealed significant room for improvement, especially on the challenge test set where TGC scores considerably dropped, with the best score being 30.2%.
Analysis indicated that issues such as inadequate API understanding and improper interaction with the environment frequently led to failures.

Observations and Future Directions

The benchmark results underline fundamental challenges for interactive coding agents:

Interaction Requirements: Many tasks posed strong interaction requirements where agents needed to adapt based on intermediate outputs.
Robust Code Generation: Creating robust and adaptive code remains a major obstacle. Models often failed due to misunderstanding API calls or generating erroneous code.
State Management: Agents struggled with maintaining the task state and adapting to dynamic changes within the digital environment.

Implications

Ap\underline{p}World has significant implications for advancing interactive coding agents:

Theoretical Contributions: The development of sophisticated benchmarks like Ap\underline{p}World provides a clearer understanding of the limitations of current models in real-world applications, guiding future research directions.
Practical Contributions: In practical terms, the framework can enable the development of more reliable autonomous agents capable of handling complex, real-world digital tasks with a higher degree of safety and accuracy.

Conclusion

Ap\underline{p}World addresses a critical gap in the current landscape of AI benchmarking by providing a realistic, complex environment for evaluating interactive coding agents. The rigorous evaluation framework, along with the diverse and challenging task set, pushes the boundaries on what these agents can achieve. Future work will likely expand on this foundation, leading to advances in autonomous agents that can efficiently and safely manage an array of digital tasks in real-world applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1817750138166628818

https://twitter.com/harsh3vedi/status/1818311843976233198

https://twitter.com/fly51fly/status/1820107034047729751

https://twitter.com/harsh3vedi/status/1826303750589337928

https://twitter.com/gm8xx8/status/1817736075088990621

https://twitter.com/harsh3vedi/status/1826647290985275746

YouTube

Show All Videos