ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning (2510.12693v1)

Published 14 Oct 2025 in cs.AI

Abstract: Recent advances in embodied AI highlight the potential of vision LLMs (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present \textit{Embodied Reasoning Agent (ERA)}, a two-stage framework that integrates prior knowledge learning and online reinforcement learning (RL). The first stage, \textit{Embodied Prior Learning}, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of-environment datasets. In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization. Extensive experiments on both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate that ERA-3B surpasses both prompting-based large models and previous training-based baselines. Specifically, it achieves overall improvements of 8.4\% on EB-ALFRED and 19.4\% on EB-Manipulation over GPT-4o, and exhibits strong generalization to unseen tasks. Overall, ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.

Summary

The paper demonstrates that a two-stage approach, combining Embodied Prior Learning (EPL) and online reinforcement learning (RL), effectively equips compact VLMs with robust reasoning and control in interactive environments.
EPL enriches small VLMs with structured perceptual and reasoning priors from curated trajectory, environment, and external data, laying a strong foundation for high-level planning and low-level control.
Online RL with turn-level policy optimization and self-summarization enhances action precision and overall performance, outperforming larger models on challenging simulated benchmarks.

ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

Motivation and Problem Setting

The paper addresses the challenge of equipping compact vision-LLMs (VLMs) with robust embodied reasoning and control capabilities, enabling them to function as agents in interactive environments. While large-scale VLMs can be prompted to perform complex embodied tasks, their computational cost and inefficiency hinder practical deployment. In contrast, smaller VLMs lack the necessary embodied priors and reasoning skills, resulting in a significant performance gap. The core problem is to bridge this gap by developing a scalable, data- and compute-efficient training framework that enables small VLMs to generalize across both high-level planning and low-level control tasks in embodied environments.

ERA Framework Overview

ERA (Embodied Reasoning Agent) is a two-stage training pipeline designed to endow compact VLMs with generalizable embodied intelligence. The first stage, Embodied Prior Learning (EPL), injects structured perception and reasoning priors via supervised finetuning on curated datasets. The second stage applies online reinforcement learning (RL) to further refine the agent's policy through interactive environment rollouts, leveraging process-level rewards and turn-level policy optimization.

Figure 1: (a) Overview of the ERA framework: EPL finetunes on diverse data sources to provide foundational knowledge, and online RL further improves the agent. (b) ERA (EPL+RL) boosts a 3B base model to surpass GPT-4o on hold-out evaluation sets.

Embodied Prior Learning: Data Curation and Supervised Finetuning

EPL is motivated by the observation that small VLMs, after generic pretraining, lack environment-specific grounding and step-level reasoning. The paper introduces a principled taxonomy of prior data sources:

Trajectory-Augmented Priors: Raw agent trajectories are enriched with step-level reasoning traces, including visual descriptions, reflections, and plans, generated by prompting a strong VLM (e.g., GPT-4o). For low-level tasks, rule-based methods ensure accurate visual descriptions aligned with simulator state.
Environment-Anchored Priors: Auxiliary data, such as masked action modeling and action sequence reordering (for high-level tasks), or absolute/relative/combined spatial grounding (for low-level tasks), provide additional supervision grounded in the environment but not tied to agent-collected trajectories.
External Knowledge Priors: Large-scale, out-of-environment datasets (e.g., OpenO1-SFT for reasoning, SpaceThinker for spatial understanding) transfer general skills to the agent.
Figure 2: Illustration of Embodied Prior Learning (EPL). EPL leverages three data sources: Augmented trajectory priors, environment-anchored priors, and external knowledge priors.

The EPL stage employs a sequential curriculum: environment-anchored and external priors are used to establish foundational skills, followed by trajectory-based finetuning to align with downstream agent tasks.

Online Reinforcement Learning: Agent Design and Policy Optimization

The RL stage addresses three challenges: efficient context management, sparse reward propagation in long-horizon tasks, and stable policy optimization for sequence-generating agents.

Agent Framework: The agent processes language instructions, visual observations, and a compact interaction history. Structured outputs interleave reasoning traces and executable actions, following a ReAct-style format.
Self-Summarization Context Management: To avoid context explosion, the agent is trained to summarize its entire history into a single reflection at each step, reducing context length to $\mathcal{O}(1)$ while preserving essential information.
Reward Design: The reward function combines sparse success-based rewards, subgoal-based rewards (for intermediate progress), and behavior-shaping rewards (for penalizing invalid actions or poor perception).
Turn-Level Policy Optimization: Instead of token-level credit assignment, the policy and value functions operate at the turn level, aligning the unit of optimization with the agent's interaction granularity. This reduces variance and improves stability in PPO updates.
Figure 3: (a) ERA agent framework, and (b) comparison of advantage estimation in turn-level GAE and token-level GAE.

Experimental Results

ERA is evaluated on EmbodiedBench, covering both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation). The 3B-parameter ERA model is compared against prompting-based large models (e.g., GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro) and training-based baselines (e.g., RL4VLM, VAGEN, Reinforced Reasoner, Robot-R1).

Key results:

ERA-3B achieves 65.2% average success on EB-ALFRED and 48.3% on EB-Manipulation, outperforming GPT-4o by 8.4 and 19.4 points, respectively.
On unseen task subsets, ERA demonstrates strong generalization, with gains of up to 38 points over VAGEN on spatial reasoning tasks.
EPL alone provides a strong foundation, but the addition of RL yields substantial further improvements, especially on out-of-distribution tasks.
Figure 4: Examples of ERA performing step-by-step reasoning and actions: (a) on EB-ALFRED, it identifies and reflects on earlier errors; (b) on EB-Manipulation, it accurately places the star into the correct slot of the shape sorter.

Ablation Studies and Analysis

Prior Data Ablations:

Trajectory-augmented priors yield the largest individual gains in generalization, especially on unseen tasks.
Environment-anchored priors provide balanced improvements across seen and unseen tasks.
External knowledge priors are most beneficial for unseen tasks, though their effect is less than trajectory augmentation.
Combining trajectory-augmented and environment-anchored priors produces the best overall performance.
Figure 5: Performance correlation between stage 1 (EPL) and stage 2 (EPL+RL) on both seen and unseen sets. The Pearson correlation coefficient is reported in the top-left corner.

RL Design Ablations:

Dense reward shaping (subgoal + behavior) is critical for long-horizon tasks, yielding a 14-point improvement on EB-ALFRED.
Turn-level GAE outperforms token-level and bi-level GAE, improving success rates by 12 points on EB-ALFRED and 8.4 points on EB-Manipulation.
Figure 6: (a)(b) Reward design ablations and (c)(d) Value estimation method comparisons in ERA. Average success rates on the unseen subsets of EB-ALFRED and EB-Manipulation.

Context Management:

Self-summarization enables efficient and effective context usage, outperforming longer history windows and reducing input length.

Error and Case Analysis

Error analysis reveals that EPL reduces all error types (perception, reasoning, planning), while RL is especially effective at reducing reasoning and planning errors. Case studies show that ERA can recover from earlier mistakes via reflection and adjust plans dynamically, a capability absent in EPL-only agents.

Figure 7: Comparing error statistics in two benchmarks.

Figure 8: Reflection error Example in EB-Manipulation. ERA successfully identified the correct target object: the azure triangular prism, while EPL mistakenly selected the yellow cube.

Figure 9: Successful reflection Example in EB-Manipulation. Both agents were able to identify the silver star as the target object.

Figure 10: Planning error example in EB-ALFRED. ERA successfully identified the second spray bottle as SprayBottle_2 while EPL repeatedly located the same SprayBottle.

Figure 11: Successful reflection example in EB-ALFRED. Both agents were able to identify the FloorLamp as the target object.

Implementation Considerations

Resource Requirements: EPL and RL stages are optimized for efficiency, with the 3B model trained on 2 H200-140GB GPUs (EPL: 2–5 hours; RL: ~12 hours per task).
Scalability: The framework supports large-scale parallel rollouts and is compatible with distributed training and memory optimizations (DeepSpeed, BF16, gradient checkpointing).
Deployment: The agent's structured output format and context management facilitate integration into real-world or simulated robotic systems.
Limitations: All evaluations are in simulation; real-world transfer remains an open challenge.

Implications and Future Directions

ERA demonstrates that compact VLMs, when equipped with structured embodied priors and refined via online RL, can match or surpass much larger models on challenging embodied tasks. The framework's modularity and data taxonomy provide a blueprint for scalable embodied agent development. The strong correlation between EPL and RL performance suggests that future work should focus on improving prior data quality and diversity, as well as exploring real-world deployment and sim-to-real transfer. The turn-level RL paradigm and self-summarization context management are likely to be broadly applicable to other agentic domains involving long-horizon, multimodal reasoning and control.

Conclusion

ERA establishes a practical and effective recipe for transforming small VLMs into generalizable embodied agents. By systematically integrating diverse prior knowledge and principled RL design, ERA achieves strong performance and generalization with modest computational resources. The framework's insights into data curation, context management, reward shaping, and policy optimization are expected to inform future research in embodied AI, agent alignment, and scalable multimodal learning.