Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Embodied Planner-R1: Adaptive Task Planning

Updated 3 July 2025

Embodied Planner-R1 is a family of task planning approaches that leverages pure reinforcement learning and interactive policy optimization for closed-loop decision making.
It employs group rollouts and sparse, completion-driven rewards to robustly learn and execute tasks in partially observed environments.
Empirical results on benchmarks like ALFWorld and ScienceWorld demonstrate a 15–28% performance boost over traditional methods with minimal human supervision.

Embodied Planner-R1 defines a family of embodied task planning approaches that leverage advanced reasoning capabilities—originally developed within LLMs—to enable interactive, adaptive planning in physically grounded or simulated agents. Embodied Planner-R1 systems bridge the gap between static, open-loop LLM-based planners and the closed-loop, feedback-driven demands of real-world or partially observed environments. The core framework uses pure reinforcement learning in interaction with the environment, group rollouts for experience collection, sparse outcome-driven rewards, and an Interactive Policy Optimization (IPO) algorithm tailored for efficient policy improvement from grouped trajectories.

1. Framework Architecture and Learning Principles

Embodied Planner-R1 is formulated around the embodied task planning problem as a Partially Observable Markov Decision Process (POMDP), explicitly modeling the agent’s interaction with the environment:

States ( $S$ ), Actions ( $A$ ), Observations ( $O$ ): The agent observes $o_t$ , reasons over history (including the original task and all previous actions/observations), and outputs the next action $a_t$ .
Policy ( $\pi_\theta$ ): The policy maps the accumulated trajectory ( $q, o_0, a_0, o_1, \ldots, a_{t-1}, o_t$ ) to the next action.
ReAct-style Trajectory: The agent cycles through explicit "Thought → Action → Observation" steps— $\tau_t = (q, o_0, a_0, o_1, ..., a_{t-1}, o_t)$ —promoting grounded reasoning interleaved with action execution.

Interaction is fully closed-loop: the agent explores, reasons, and updates on-line based on the latest environmental state, rather than following a precomputed or static plan.

Group Rollout Mechanism:

Multiple environment-agent instances run in parallel, each independently generating trajectories for up to a fixed maximum number of steps. This enables exposure to a diverse set of interaction paths per RL update and provides the statistical power necessary for group-based policy optimization.

Sparse, Completion-Driven Reward:

The environment signals only binary success at the end of an episode; there are no intermediate or heuristically shaped rewards. This design enforces robust causal learning and leads the agent to directly optimize for successful task completion.

2. Policy Optimization: Interactive Policy Optimization (IPO)

IPO is tailored for learning from group rollouts with sparse rewards.

Advantage Normalization: For every group of rollouts (say, $n$ rollouts per task), the advantage for trajectory $i$ at step $t$ is:

$\hat{A}_{i,t} = \frac{r_i - \mu_r}{\sigma_r}$

where $r_i$ is the completion reward for trajectory $i$ , and $\mu_r$ , $\sigma_r$ are the mean and standard deviation over the group.

Probability Ratio and Clipped Policy Objective: The policy is updated using a PPO-like surrogate loss, but the ratio and advantage are computed over the group, without reliance on a critic:

$Pr_t(\theta) = \frac{\pi_{\theta}(\phi_t, a_t | \tau_{i, < t})}{\pi_{\theta_\text{old}}(\phi_t, a_t | \tau_{i, < t})}$

$\mathcal{J}_\text{IPO}(\theta) = \mathbb{E}_{\text{trajectories}} \left[ \frac{1}{n} \sum_{i=1}^n \frac{1}{|\tau_i|} \sum_{t=1}^{|\tau_i|} \left\{ \min\Big[ Pr_t(\theta) \hat{A}_{i,t}, \operatorname{clip}(Pr_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_{i,t}\Big] - \beta D_\text{KL}[\pi_{\theta} || \pi_{\text{ref}}] \right\} \right]$

No Actor-Critic or Value Head: Unlike standard PPO or actor-critic methods, all credit assignment is relative across group rollouts, robustifying learning with sparse binary rewards.

This approach ensures that the policy consistently prefers better group-level behavior while maintaining diversity and avoiding overfitting to spurious experience.

3. Empirical Evaluation and Metrics

Embodied Planner-R1 was validated on two prominent text-based embodied agent benchmarks:

ALFWorld: Home-environment tasks with object-centric multi-step goals.
ScienceWorld: Science-lab themed tasks requiring reasoning and interaction across multiple rooms and apparatus.

Metrics:

Task Completion Rate (Primary): Percentage of tasks completed successfully, as reported by the environment (binary success/failure).
Generalization Gap: Absolute accuracy drop from training/seen environments to held-out/unseen test environments.

Method	ALFWorld	ScienceWorld
Qwen2.5+SFT	~80%	~62%
Qwen2.5+ETO	~82%	~56%
Embodied Planner-R1 (Ours)	97.8%	79.9%

Embodied Planner-R1 outperformed both supervised fine-tuning (SFT) and previous RL-based agents such as ETO by 15–28 percentage points on these benchmarks.

4. Generalization and Robustness

A defining feature of Embodied Planner-R1 is its strong generalization capability:

Generalization Gap (ALFWorld): –3.33% averaged over all tasks, with many tasks (e.g., "Pick Two") exhibiting positive transfer to unseen environments.
Robustness: The agent rapidly adapts to distributional shifts, avoiding overfitting to idiosyncratic artifacts in the training environments (evidenced by successful navigation to rare object locations not present in the training set).
Sample Efficiency: The group rollout and sparse reward regime enables meaningful learning without large demonstration corpora or curated supervision.

5. Technical Innovations

Embodied Planner-R1’s key technical contributions include:

Elimination of Human Supervision: Unlike imitation or SFT-based approaches, the policy is derived solely through environmental interaction and outcome-based reward.
Pure RL with Grouped Trajectories: Enables learning from direct agent-environment experience, even with extremely sparse supervision.
Explicit Thought-Action Reasoning: The policy reasons in an interleaved "Thought → Action → Observation" cycle, reminiscent of ReAct, but all reasoning chains are learned adaptation to environmental complexity.
Completion-Driven Optimization: Rewards only reflect true task success, eliminating the potential for "reward hacking" or shortcut learning endemic to dense or shaped reward paradigms.

6. Comparative Analysis with Prior Approaches

Dimension	Embodied Planner-R1	SFT/NAT/ETO Baselines
Supervision Required	None (pure RL)	Large labeled datasets
Learning Signal	Sparse, binary	Shaped, often dense
Adaptivity / Online Update	Yes (closed loop)	No (static or batched)
Grouped Rollout/Processing	Yes	No
Generalization Gap (ALFWorld)	–3.33%	–7.34% (ETO), higher for SFT
High-Complexity Tasks	Maintains high SR	Performance collapse (<60%, sometimes <10%)

Most notable is the absence of any reliance on human-annotated episodes for training. Comparative drop to unseen tasks is nearly halved versus prior approaches. Embodied Planner-R1 also demonstrates markedly better performance in challenging tasks with complex dependencies or less frequent goal configurations.

7. Significance and Outlook

Embodied Planner-R1 establishes a new evidence-based paradigm for planning agents in partially observed, interactive environments:

Autonomous environmental exploration and implicit causal structure learning without hand-crafted supervision.
Efficient, effective group-based policy optimization supporting high success and fast adaptation.
Template for next-generation embodied agents in both simulated and real-world domains, with anticipated applicability to more richly multi-modal, physical, and visually grounded settings beyond text.

These advances set the stage for scalable, robust, and generalizable embodied planning across application domains, including domestic robotics, science automation, and interactive digital assistants, where human demonstration data is scarce or expensive but environmental feedback is plentiful and reliable.

PDF Markdown Chat (Upgrade)