Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (2412.13194v1)

Published 17 Dec 2024 in cs.LG, cs.AI, and cs.CV

Abstract: The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent's skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator, an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena.To the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in https://yanqval.github.io/PAE/

Summary

The paper introduces PAE, enabling agents to autonomously propose, practice, and evaluate skills, thereby overcoming the limitations of manual skill specification.
It integrates a context-aware task proposer, a reinforcement learning-based agent policy with a reasoning step, and an autonomous evaluator to optimize performance.
Experiments demonstrate a significant boost in zero-shot generalization, with over a 30% success rate improvement on vision-based web navigation tasks.

The paper introduces Proposer-Agent-Evaluator (PAE), a system designed for autonomous skill discovery in foundation model agents. The primary motivation behind PAE is to overcome the limitations of manually specifying skills for generalist agents, which can be both expensive and insufficient to cover the wide range of real-world tasks.

PAE addresses this challenge by enabling agents to autonomously propose, practice, and evaluate new skills. The system comprises three main components:

Context-Aware Task Proposer: This component generates a diverse set of feasible tasks for the agent to practice. It uses contextual information from the environment to ensure that the proposed tasks are relevant and realistic. This context can range from user demonstrations to simply the name of the website the agent is interacting with. The task proposer is framed as a conditional autoregressive generator, utilizing foundation models like Claude-3-Sonnet and Qwen2VL-7B.
Agent Policy: This is the core of the system, representing the agent that attempts the proposed tasks. It interacts with the environment using a set of predefined actions, such as clicking links or typing text in a web browser. The agent policy is initialized from pre-trained VLMs like LLaVa-1.6-Mistral-7B and LLaVa-1.6-Yi-34B and improved through an online reinforcement learning (RL) loop. A key feature of the agent policy is the incorporation of a reasoning step before outputting actions, allowing the agent to reflect on its skills and improve generalization to unseen tasks.
Autonomous Evaluator: This component provides feedback on the agent's performance by evaluating the success of the attempted tasks. The evaluator outputs a sparse 0/1 reward based on the final outcome of the agent's actions. The evaluator only considers the final three screenshots and the agent's final answer. Similar to the task proposer, the evaluator is implemented using foundation models such as Claude-3-Sonnet and Qwen2VL-7B.

The learning process in PAE is formulated as a contextual Markov Decision Process (MDP), where the goal is to find a reward-maximizing policy. The key assumption is that the ground-truth task distribution and reward function are hidden during training. Instead, PAE relies on the task proposer and autonomous evaluator as proxies. The agent interacts with the environment, and the collected data is used to update the policy using an RL algorithm, specifically Filtered Behavior Cloning (Filtered BC).

The effectiveness of PAE is validated on vision-based web navigation tasks using WebVoyager and WebArena. Key findings from the experimental results include:

PAE significantly improves the zero-shot generalization capability of VLM Internet agents. Specifically, it leads to more than a 30% relative improvement in average success rate on WebVoyager and WebArena.
The LLaVa-1.6-7B based PAE agent achieves performance comparable with LLaVa-1.6-34B fine-tuned with demonstration data, despite using significantly less test-time compute.
PAE outperforms state-of-the-art open-source VLM agents, including Qwen2VL-72B, with an absolute performance gain of over 10% (from 22.6% to 33.0\%) on WebVoyager.
The improvements from PAE stem from the asymmetric capabilities of VLMs as agents versus task proposers/evaluators, and the system can leverage weaker models for proposing and evaluation to improve a stronger agent model.
The inclusion of a reasoning step in the agent policy significantly improves its generalization capability to unseen tasks.
PAE demonstrates favorable scaling properties, with similar performance gains observed when using a larger and more capable base VLM (LLaVa-1.6-34B).
The skills learned by PAE generalize to unseen websites, indicating the acquisition of general web browsing capabilities.
A human paper validates the effectiveness of the autonomous evaluator, demonstrating a high correlation with human judgments.
Error analysis reveals that PAE effectively addresses the major failure modes of the base models, such as visual hallucinations and low-level skill deficiencies.
Providing context information such as user demos to task proposal improves task quality and agent performance.

Overall, PAE represents a step toward developing foundation model agents that can autonomously acquire and refine skills, demonstrating strong potential for building more capable and generalizable AI systems.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/YifeiZhou02/status/1869402833856217363

https://twitter.com/TheTuringPost/status/1872791235976741068

https://twitter.com/dair_ai/status/1870821195974578439

https://twitter.com/gm8xx8/status/1869261524511908290

https://twitter.com/TheTuringPost/status/1872028052794011870

https://twitter.com/FelixFleku/status/1870882287413252333