- The paper introduces BAGEL, a method that bootstraps digital agents using iterative synthetic demonstrations to guide language-based exploration.
- It employs a round-trip process between two language model components to refine trajectories, improving execution accuracy and cutting failures by up to 13×.
- Experimental results on MiniWoB++ and ToolQA demonstrate performance gains of 2-13% over zero-shot baselines, underscoring BAGEL’s practical impact.
Bootstrapping Digital Agents with Synthetic Demonstrations via Language-Guided Exploration
Introduction
Recent advancements have triggered a proliferation of research aiming at bridging the gap between LLMs (LMs) and their practical applications in executing natural language instructions in digital environments. Despite the promising direction, a persistent challenge remains in generalizing LM agents to novel environments where explicit human demonstrations are scarce or unavailable. This paper presents a novel methodology, BAGEL (Bootstrapping Agents by Guiding Exploration with Language), which introduces a framework to bootstrap digital agents via synthetic demonstrations, generated through iterative round-trips between two LM components. BAGEL demonstrates significant improvements in execution accuracy and a reduction in execution failure rates in digital environments without resorting to human-generated datasets.
BAGEL Methodology
BAGEL leverages a seed set of trajectories obtained from either exploratory actions in an environment or from synthetic instructions. These trajectories are then iteratively refined through a round-trip process involving an LM labeler, which assigns synthetic instructions to the trajectories, and an LM-based agent, which attempts to generate refined trajectories based on these instructions. The core idea is to progressively steer the trajectory distribution towards those well-described by natural language by exploiting the LMs' noisy behavior.
Both components operate under noisy conditions, yet their imperfections counterbalance through iteration, gradually improving the relevance and executability of the derived trajectories. Notably, BAGEL does not require pre-trained agents or concrete information regarding potential instructions beforehand, making it a particularly adaptive method for generating instructional demonstrations.
Generating Demonstrations
BAGEL embarks on the demonstration generation process by initially engaging in an exploration phase, either trajectory-first or instruction-first. This phase is pivotal as it lays the foundational set upon which iterative refinement is performed. The iterative process involves relabeling trajectories to form coherent instruction-trajectory pairs, which are subsequently filtered and added to a pool of synthetic demonstrations if deemed satisfactory.
Practical Implications and Theoretical Insights
The methodology enriches the field by showcasing that digital agents can learn and improve by leveraging synthetic demonstrations devoid of human supervision. The iterative refinement process encapsulates two vital aspects:
- Optimization of trajectory distribution towards natural language describability.
- Improvement in the agent's understanding of environmental dynamics.
Furthermore, BAGEL's approach aligns with insights from Hindsight Experience Replay (HER), extending its principles to the language domain by utilizing full trajectory relabeling and addressing the limitations of zero-shot components through iterative noise reduction.
Experimental Evaluation
The paper evaluates BAGEL across two domains: MiniWoB++ and ToolQA, demonstrating improvements over a zero-shot baseline by 2-13% and a reduction in execution failures by up to 13×. Notably, these outcomes underscore BAGEL's capability to significantly understand environment dynamics through synthetic demonstrations, asserting its utility as an unsupervised learning tool for digital agents. The experiments underscore BAGEL's adaptability and effectiveness, offering a novel approach to synthesizing demonstrations for language-conditioned policies in digital environments.
Conclusion and Potential Directions
The introduction of BAGEL marks a significant stride toward autonomous digital agents capable of learning and executing tasks in novel environments. By harnessing the synergy between two key LM components through iterative refinement, BAGEL opens a gateway to significantly enhancing agent performance without direct human intervention. The findings provide a fertile ground for further exploration, particularly in enhancing the diversity of synthetic demonstrations and narrowing the gap between generated tasks and test-time instructions. The ongoing development in this field holds promise for more refined, autonomous, and versatile digital agents capable of understanding and executing complex tasks across varied digital landscapes.
Acknowledgements
The research acknowledges support from numerous contributors and emphasizes the importance of responsible deployment considering potential risks associated with real-world applications. The commitment to advancing the field while mindful of ethical considerations sets a conscientious precedent for future research endeavors.