Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Agent Foundation Models

Updated 4 August 2025
  • Agent Foundation Models are unified frameworks that integrate large language and vision-language models to enable language-guided planning, semantic reasoning, and hierarchical action execution.
  • They leverage a visual-to-language encoder, LLM-driven task decomposition, and a language-conditioned policy network to achieve efficient exploration and skill reuse in reinforcement learning.
  • Empirical results in robotic manipulation show marked improvements in sample efficiency, transfer learning, and one-shot imitation, underscoring their potential for complex tasks.

Agent Foundation Models are unified frameworks that embed large-scale pre-trained models—particularly LLMs and vision-LLMs (VLMs)—at the core of intelligent, goal-directed agent architectures. These models provide semantic reasoning, planning, and skill composition capabilities, enabling agents to address traditionally challenging reinforcement learning (RL) tasks such as exploration in sparse-reward environments, skill reuse, efficient data transfer, and imitation from observations. The concept is operationalized by tightly coupling language-based decomposition of tasks, perceptual grounding via vision-language similarity, and hierarchical policy execution, demonstrating substantial improvements over conventional RL architectures in sample efficiency, transfer, and zero-shot generalization.

1. Unified Framework Architecture

The foundational design integrates three principal modules within the agent:

  • Visual-to-Language Encoder: A VLM (e.g., CLIP) encodes visual observations oto_t and compares them with candidate language captions lil_i using an embedding similarity: y=ϕI(ot)ϕT(li)y = \phi_I(o_t) \cdot \phi_T(l_i). When y>γy > \gamma (typically γ=0.8\gamma = 0.8), the observation is considered to satisfy the linguistic description. This provides a semantic bridge from pixel space to natural language subgoal recognition.
  • LLM–Driven Planner: An LLM (e.g., FLAN-T5) takes as input a high-level instruction such as "Stack the red object on the blue object" and outputs a curriculum of natural language sub-goals (e.g., "Pick up the red object", "Place the red object on the blue object"). Prompting is performed with few in-context examples, and output subgoal sequences are both semantically rich and syntactically consistent for subsequent parsing.
  • Language-Conditioned Policy Network: The agent’s policy is parameterized as a Transformer that maps the current state sts_t and active sub-goal gig_i to an action ata_t:

at=fθ(st,gi)a_t = f_\theta(s_t, g_i)

This network is trained both with sparse environment rewards (external) and dense "internal" rewards, computed by checking the VLM similarity for subgoal completion.

The agent operates under a “Collect–Infer” learning paradigm: distributed agents gather interaction episodes, successful trajectories are selected (either by external rewards or VLM-inferred subgoal achievement), and the policy is updated via behavioral cloning.

2. Language as the Core Reasoning Tool

Language serves as the explicit interface for planning, instruction following, skill composition, and progress monitoring:

  • Task decomposition and curriculum generation are handled by the LLM, autonomously producing ordered sub-goals that reflect human-interpretable semantics and facilitate transparent monitoring.
  • The LLM outputs are parsable into concrete objectives, and the downstream policy learns to “ground” natural language into modulable actions.
  • VLM feedback is exploited to provide subgoal-level “internal rewards” for both learning and exploration, calculated as ϕI(ot)ϕT(gi)>γ\phi_I(o_t) \cdot \phi_T(g_i) > \gamma. This turns language-guided planning into a dense reward structure even in settings with sparse extrinsic feedback.

3. Addressing Classical RL Challenges

Agent Foundation Models address multiple pain points in RL:

Challenge Mechanism Empirical Finding
Efficient Exploration LLM curriculum + VLM rewards guide stepwise progress in sparse tasks Drastic reduction in required steps for success
Data Reuse and Transfer Offline buffer relabeling by VLM allows transfer to new compositions Sequential task learning with fewer steps
Skill Reuse and Scheduling LLM-decomposed skills scheduled by VLM detection of completion Composed skills solve new tasks
Learning from Observation Agent maps video demonstration frames to subgoals, imitating unseen skills One-shot imitation achieved

Specific experiments in MuJoCo robotics stacking (e.g., creating and executing multistep stacking curricula in triple-object scenarios where random exploration requires 106\sim 10^6 steps for success) show that curriculum-based subgoal reward massively accelerates exploration.

4. Experimental Methodology and Metrics

The framework is tested on simulated robotic manipulation with sparse rewards—for instance, stacking colored objects with success only granted if the final configuration is correct. The environment presents:

  • Observation: 128×128×3 RGB images from two cameras.
  • Action Space: Continuous (x, y) control for pick-and-place; state space includes end-effector and object 3D positions.
  • Sparse Rewards: Only successful task completion yields reward (+1); all other steps yield zero.
  • Metrics: Evaluation includes proportion of successful episodes (success rate), sample efficiency (steps to threshold performance), and data efficiency (reuse of old successful trajectories on new tasks).

Key empirical results include strong sample-efficiency gains and demonstrable transfer through internal episode relabeling and curriculum-driven exploration acceleration.

5. Skill Scheduling, Reuse, and Imitation

The agent decomposes composite tasks into ordered sub-skills, each one monitored for completion by VLM similarity thresholds:

ϕT(gn)ϕI(ot)>γ\phi_T(g_n) \cdot \phi_I(o_t) > \gamma

Upon recognition of attainment, the agent automatically advances to the next sub-goal, effectively modularizing behavior. For imitation, observed expert videos are segmented—each frame is assigned to the closest subgoal using the VLM, allowing the agent to reconstruct the skill sequence and directly imitate multi-step expert demonstrations in a one-shot fashion.

6. Generalization and Implications

By leveraging strong world priors in LLMs/VLMs and structuring agents around language-driven skill decomposition and semantic visual grounding, Agent Foundation Models provide:

  • Enhanced exploration in sparse or unstructured environments.
  • Unified mechanisms for transfer, scheduling, and imitation, eliminating the need for multiple specialized algorithms.
  • Interpretable and composable skills that can be scheduled and reused for novel, previously unseen tasks.

This unified approach, validated with robotics manipulation, establishes a high baseline for sample efficiency, reuse of prior knowledge, and rapid adaptation, illustrating how pre-trained foundation models enable practical, general-purpose agent systems. It extends current RL paradigms by making language the backbone of agent reasoning, task decomposition, progress monitoring, and skill acquisition.