Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BAGEL: Bootstrapping Agents by Guiding Exploration with Language (2403.08140v2)

Published 12 Mar 2024 in cs.CL

Abstract: Following natural language instructions by executing actions in digital environments (e.g. web-browsers and REST APIs) is a challenging task for LLM (LM) agents. Unfortunately, LM agents often fail to generalize to new environments without human demonstrations. This work presents BAGEL, a method for bootstrapping LM agents without human supervision. BAGEL converts a seed set of randomly explored trajectories or synthetic instructions, into demonstrations, via round-trips between two noisy LM components: an LM labeler which converts a trajectory into a synthetic instruction, and a zero-shot LM agent which maps the synthetic instruction into a refined trajectory. By performing these round-trips iteratively, BAGEL quickly converts the initial distribution of trajectories towards those that are well-described by natural language. We use BAGEL demonstrations to adapt a zero shot LM agent at test time via in-context learning over retrieved demonstrations, and find improvements of over 2-13% absolute on ToolQA and MiniWob++, with up to 13x reduction in execution failures.

Citations (7)

Summary

  • The paper introduces BAGEL, a method that bootstraps digital agents using iterative synthetic demonstrations to guide language-based exploration.
  • It employs a round-trip process between two language model components to refine trajectories, improving execution accuracy and cutting failures by up to 13×.
  • Experimental results on MiniWoB++ and ToolQA demonstrate performance gains of 2-13% over zero-shot baselines, underscoring BAGEL’s practical impact.

Bootstrapping Digital Agents with Synthetic Demonstrations via Language-Guided Exploration

Introduction

Recent advancements have triggered a proliferation of research aiming at bridging the gap between LLMs (LMs) and their practical applications in executing natural language instructions in digital environments. Despite the promising direction, a persistent challenge remains in generalizing LM agents to novel environments where explicit human demonstrations are scarce or unavailable. This paper presents a novel methodology, BAGEL (Bootstrapping Agents by Guiding Exploration with Language), which introduces a framework to bootstrap digital agents via synthetic demonstrations, generated through iterative round-trips between two LM components. BAGEL demonstrates significant improvements in execution accuracy and a reduction in execution failure rates in digital environments without resorting to human-generated datasets.

BAGEL Methodology

BAGEL leverages a seed set of trajectories obtained from either exploratory actions in an environment or from synthetic instructions. These trajectories are then iteratively refined through a round-trip process involving an LM labeler, which assigns synthetic instructions to the trajectories, and an LM-based agent, which attempts to generate refined trajectories based on these instructions. The core idea is to progressively steer the trajectory distribution towards those well-described by natural language by exploiting the LMs' noisy behavior.

Both components operate under noisy conditions, yet their imperfections counterbalance through iteration, gradually improving the relevance and executability of the derived trajectories. Notably, BAGEL does not require pre-trained agents or concrete information regarding potential instructions beforehand, making it a particularly adaptive method for generating instructional demonstrations.

Generating Demonstrations

BAGEL embarks on the demonstration generation process by initially engaging in an exploration phase, either trajectory-first or instruction-first. This phase is pivotal as it lays the foundational set upon which iterative refinement is performed. The iterative process involves relabeling trajectories to form coherent instruction-trajectory pairs, which are subsequently filtered and added to a pool of synthetic demonstrations if deemed satisfactory.

Practical Implications and Theoretical Insights

The methodology enriches the field by showcasing that digital agents can learn and improve by leveraging synthetic demonstrations devoid of human supervision. The iterative refinement process encapsulates two vital aspects:

  • Optimization of trajectory distribution towards natural language describability.
  • Improvement in the agent's understanding of environmental dynamics.

Furthermore, BAGEL's approach aligns with insights from Hindsight Experience Replay (HER), extending its principles to the language domain by utilizing full trajectory relabeling and addressing the limitations of zero-shot components through iterative noise reduction.

Experimental Evaluation

The paper evaluates BAGEL across two domains: MiniWoB++ and ToolQA, demonstrating improvements over a zero-shot baseline by 2-13% and a reduction in execution failures by up to 13×. Notably, these outcomes underscore BAGEL's capability to significantly understand environment dynamics through synthetic demonstrations, asserting its utility as an unsupervised learning tool for digital agents. The experiments underscore BAGEL's adaptability and effectiveness, offering a novel approach to synthesizing demonstrations for language-conditioned policies in digital environments.

Conclusion and Potential Directions

The introduction of BAGEL marks a significant stride toward autonomous digital agents capable of learning and executing tasks in novel environments. By harnessing the synergy between two key LM components through iterative refinement, BAGEL opens a gateway to significantly enhancing agent performance without direct human intervention. The findings provide a fertile ground for further exploration, particularly in enhancing the diversity of synthetic demonstrations and narrowing the gap between generated tasks and test-time instructions. The ongoing development in this field holds promise for more refined, autonomous, and versatile digital agents capable of understanding and executing complex tasks across varied digital landscapes.

Acknowledgements

The research acknowledges support from numerous contributors and emphasizes the importance of responsible deployment considering potential risks associated with real-world applications. The commitment to advancing the field while mindful of ethical considerations sets a conscientious precedent for future research endeavors.