SteP: Stacked LLM Policies for Web Actions (2310.03720v4)

Published 5 Oct 2023 in cs.LG

Abstract: Performing tasks on the web presents fundamental challenges to LLMs, including combinatorially large open-world tasks and variations across web interfaces. Simply specifying a large prompt to handle all possible behaviors and states is extremely complex, and results in behavior leaks between unrelated behaviors. Decomposition to distinct policies can address this challenge, but requires carefully handing off control between policies. We propose Stacked LLM Policies for Web Actions (SteP), an approach to dynamically compose policies to solve a diverse set of web tasks. SteP defines a Markov Decision Process where the state is a stack of policies representing the control state, i.e., the chain of policy calls. Unlike traditional methods that are restricted to static hierarchies, SteP enables dynamic control that adapts to the complexity of the task. We evaluate SteP against multiple baselines and web environments including WebArena, MiniWoB++, and a CRM. On WebArena, SteP improves (14.9\% to 33.5\%) over SOTA that use GPT-4 policies, while on MiniWob++, SteP is competitive with prior works while using significantly less data. Our code and data are available at https://asappresearch.github.io/webagents-step.

Citations (14)

View on Semantic Scholar

Summary

The paper introduces the SteP framework, a dynamic, stack-based policy model for executing complex web tasks.
It demonstrates significant improvements in task success, achieving a 36.8% gain and reducing token usage by 2.3× in benchmark evaluations.
The modular, hierarchical design offers practical insights for scalable, adaptable AI agents in diverse real-world web environments.

Overview of "SteP: Stacked LLM Policies for Web Actions"

The paper, "SteP: Stacked LLM Policies for Web Actions," presents an innovative methodological approach to web task execution using LLMs. This research centers on developing the SteP framework, which allows dynamic policy composition for solving complex web-based tasks. The framework is defined as a Markov Decision Process (MDP) with a stack of policies that are representative of control states. This enables an adaptive, hierarchical decision-making process that addresses the intrinsic complexity of web tasks and their varied interfaces.

Key Methodological Contributions

SteP represents a departure from static policy hierarchies by introducing a dynamic stack-based control mechanism where policies can invoke others, even recursively. This flexibility enables SteP to adjust to varying task complexities, allowing LLMs to effectively manage long-horizon tasks and a wide array of web tasks without suffering from context drift or information overload. Specifically, SteP uses a library of distinct policies, each task-specific, which collectively form a comprehensive toolkit for effectively navigating and acting upon web interfaces.

The main innovation lies in the stack architecture of the decision-making model, which permits both hierarchical and peer policy interactions, thus combining the benefits of specialization and generalization. In addition to typical web interaction actions, the framework extends the action space with policy invocation and termination, facilitating modular task decomposition and management.

Empirical Evaluations

The authors rigorously evaluate SteP in comparison to various existing baselines, including single-policy methods and state-of-the-art LLM-based systems, across several benchmarks such as WebArena, MiniWoB++, and a custom Airline CRM simulator. Notably, in the WebArena domain, SteP improves task success rates significantly, achieving up to a 36.8% success rate compared to the best baseline at 23%. Furthermore, SteP demonstrates substantial efficiency gains, using approximately 2.3 times fewer tokens per scenario compared to flat model baselines, which translates into reduced computation costs and quicker inference times.

In the MiniWoB++ web environment, SteP maintains competitive performance using significantly fewer training examples than other approaches, which highlights its efficiency and potential for scalable deployment. The modularity and extensibility of the SteP framework are key to these improvements, as they allow for reduced cognitive load on the model for each task, thereby improving both accuracy and efficiency.

Theoretical and Practical Implications

The theoretical implications of SteP are substantial in advancing hierarchical control systems within AI, particularly for tasks requiring multi-level abstraction and decision-making. By allowing dynamic interactions among policies, SteP sets a precedent for future frameworks in both web-based AI applications and broader task-oriented AI systems.

Practically, SteP provides a blueprint for developing more adaptable and versatile AI agents capable of robustly handling real-world web environments' inherent complexity. This promises substantial impacts in automating web interactions in fields like e-commerce, customer service, and content management, simplifying routine tasks while maintaining high efficiency.

Future Directions

For future developments, the authors suggest exploring automatic policy discovery mechanisms to enhance adaptability further and reduce manual policy crafting overheads. Additionally, integrating more sophisticated reasoning and learning mechanisms into the stack structure could enhance its capability to operate effectively in even more diverse or less predictable environments.

In summary, SteP offers a significant contribution to the domain of LLM-assisted task automation by marrying flexibility with powerful policy control, setting a new standard for how complex, context-rich environments can be navigated by AI.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/xingyaow_/status/1882546742383837399

https://twitter.com/webagentlab/status/1877556892769112106

YouTube

Show All Videos