- The paper introduces Agent Workflow Memory (AWM), a framework that abstracts recurring sub-task routines to guide LM agent actions and improve adaptation.
- The paper demonstrates that both online and offline workflow induction modes yield significant performance gains, including a 51.1% success rate increase on WebArena benchmarks.
- The paper shows that abstracted workflows enhance cross-domain generalization by reducing task steps and compensating for domain shifts through continual memory augmentation.
Agent Workflow Memory: Inducing and Leveraging Reusable Task Workflows for LM Agents
The proliferation of LM-based agents for web and digital workflow navigation has exposed critical deficiencies in handling long-horizon, complex tasks spanning variable domains and shifting contexts. While agents traditionally rely on training with fixed example sets or in-context demonstrations, such paradigms fail to equip agents with robustness against distributional shift and environment changes—most notably, they lack mechanisms for extracting, abstracting, and continually leveraging recurrent task routines shared across tasks. Inspired by cognitive mechanisms of human expertise, "Agent Workflow Memory" (AWM) (2409.07429) proposes a framework for continual induction and utilization of workflows—abstracted sub-task routines—embedded in agent memory to guide future action selection and enhance adaptation.
Workflow Induction: Representation and Mechanism
AWM operationalizes workflows as memory objects with two key components: (1) an NL description summarizing the high-level routine, and (2) a sequence of parameterized steps (each consisting of environment observation, agent reasoning, and executable action). The induction process leverages LM prompting to abstract common sub-trajectories from past experiences (agent trajectories), systematically replacing instance-specific context with generic variables to maximize workflow generality. This abstraction facilitates both intra-task reuse and cross-task generalization, distinguishing workflows from concrete training examples, which can bias agents towards overly specific behaviors.
Workflow induction is contextualized in two modes:
- Offline: Workflows are abstracted from annotated training examples pre-inference, augmenting agent memory uniformly during test-time.
- Online: Workflows are extracted dynamically from successful test trajectories judged by an LM evaluator, continuously integrated into memory in a streaming manner to enable adaptation to new task distributions.
Experimental Evaluation: WebArena and Mind2Web
AWM is empirically validated across WebArena and Mind2Web benchmarks, both renowned for task complexity and domain diversity. In WebArena, AWM operated exclusively in the online mode due to the absence of high-quality training data and demonstrated marked gains over both autonomous agents (e.g., BrowserGym) and agents with hand-written workflows (e.g., SteP). Specifically:
- WebArena: AWM achieved a 51.1% relative increase in total success rate over the top published autonomous agent, and outperformed human-supervised workflows by 7.9%. It also reduced solution trajectory length by 2 steps on average, confirming both task efficacy and efficiency.
Cross-template subset analysis indicated that workflows induced via AWM are not mere template memorization; they exhibit robust generalization across divergent task instantiations. Case studies illustrate the progressive construction of increasingly complex workflows, wherein prior acquired routines serve as subgoals for new composite workflows (e.g., "find a place by name" facilitating "get the zip code of a place").
- Mind2Web: Offline AWM workflows induced from training data secured the highest step and task success rates compared to leading baselines (Synapse, MindAct), primarily by enhancing element selection accuracy (+9.0 points over MindAct with GPT-4). The abstract, sub-routine-based workflow representation conferred substantial improvements over augmentation with full concrete examples, indicating superior reusability and less biasing in agent decision-making.
Notably, online AWM demonstrated robust cross-task, cross-website, and cross-domain generalization, increasing step success rates by $8.9$ to $14.0$ absolute points as train-test distribution gaps expanded. Unlike offline workflow induction, online AWM adaptation was observed to compensate for domain misalignment between training and test, thereby maximizing agent performance in unseen environments.
Ablations and Design Variants
AWM's efficacy was further scrutinized through ablation studies:
- LM-Based vs. Rule-Based Induction: LM-based workflow induction provided finer granularity, reducing unnecessary steps and improving efficiency compared to rule-based induction, although absolute task success rate differences remained marginal on WebArena but were more pronounced (+2.8 points) on Mind2Web.
- Representation Format: Verbalizing action trajectories into NL text versus code format yielded comparable performance, confirming that workflow representation flexibility does not critically degrade memory augmentation.
- Environment State in Workflows: NL descriptions proved more effective than filtered HTML in workflow steps, as the latter increased context length and introduced irrelevant elements, negatively impacting step success rate.
- Workflow Utilization via Action Space Expansion: Integrating workflows as callable high-level actions in the agent's action space (AWMAS​) conferred a minor additional gain in step success rate, but agents exhibited hesitance in utilizing new actions under dynamic environment conditions, suggesting future research directions in enabling dynamic execution loops.
Practical and Theoretical Implications
AWM establishes a scalable strategy for autonomous agent adaptation, moving beyond monolithic example-based training towards incremental, continual refinement of memory. The empirical results highlight the importance of abstracting and leveraging reusable routines as a core competency for agents facing real-world task generalization and environment variability. Practically, AWM's compatibility with both offline and online workflows, combined with demonstrable cross-domain robustness, suggests its utility for deployments in environments where training data is scarce or unreliable.
From a theoretical perspective, AWM connects cognitive science mechanisms of expertise abstraction with LM-driven decision-making, providing an operational framework for memory-driven agentic behavior. The observed snowball effect in workflow accumulation and utilization underscores potential avenues for recursive skill composition, hierarchical planning, and dynamic memory management in future LLM-based agent architectures.
Future Directions
AWM's modularity invites extensions in:
- Optimal workflow induction strategies (e.g., hybrid rule-LM induction, clustering-based meta-workflow abstraction)
- Real-time workflow adaptation under non-stationary and adversarial environments
- Integration with multimodal perception and closed-loop feedback mechanisms
- Dynamic action space manipulation for highly flexible reasoning and execution
Furthermore, embedding AWM within advanced agent architectures could unlock more general and efficient approaches to open-ended task completion, continual learning, and domain adaptation.
Conclusion
Agent Workflow Memory offers a formal framework for inducing, representing, and leveraging reusable workflows in LM-based agents, yielding substantial empirical gains in task success and generalization across web navigation benchmarks. Its principled abstraction of experience into workflows alleviates context bias, enhances cross-domain adaptability, and accelerates efficient task-solving, constituting a foundational advance in dynamic agent memory and adaptation mechanisms.