Papers
Topics
Authors
Recent
Search
2000 character limit reached

EB-ALFRED: Embodied AI Benchmark

Updated 8 June 2026
  • EB-ALFRED is a benchmark for language-conditioned agents that perform complex, multi-step household tasks using egocentric vision and environmental feedback.
  • It leverages a structured task taxonomy, detailed expert demonstrations, and precise evaluation metrics like Task Success Rate to quantify performance.
  • The ERA framework employs a two-stage approach—embedding trajectory priors and online reinforcement learning—to enhance generalization in unseen tasks.

Embodied ALFRED (EB-ALFRED) is a benchmark for evaluating language-conditioned agents in long-horizon, compositional household tasks requiring instruction following and visuomotor reasoning in simulated 3D environments. Building on the ALFRED framework, EB-ALFRED emphasizes mapping free-form natural language directives and egocentric vision to sequences of low-level actions, including navigation and complex object manipulation. It catalyzes research in grounded language understanding, persistent memory, generalization, and robust planning for embodied AI.

1. Benchmark Motivations and Design Principles

EB-ALFRED was introduced to highlight and assess the gap between static vision-and-language tasks and the challenges of real-world, language-driven robotics. The core research question is whether an agent can execute detailed, free-form instructions—such as “Rinse off a mug and place it in the coffee maker”—using only egocentric visual input and environmental feedback, with no privileged state information or oracle state access. Agents must reason over sequences of irreversible state changes (e.g., slicing an object, toggling appliances), under partial observability, and with randomized object appearances and placements to minimize overfitting to scene layouts (Shridhar et al., 2019).

The simulation platform is AI2-THOR 2.0, comprising 120 fully interactive indoor scenes, equally split among kitchens, living rooms, bedrooms, and bathrooms, with realistic physics and object interaction capabilities.

2. Task Taxonomy, Data, and Instruction Collection

EB-ALFRED encodes seven high-level household task types:

  1. Pick & Place
  2. Stack & Place
  3. Pick Two & Place
  4. Clean & Place
  5. Heat & Place
  6. Cool & Place
  7. Examine

Each instance is parameterized by specific object classes, target receptacles, and scene indices; stacking tasks further specify a recipient base object. For each parameterization, three expert demonstrations are generated via a classical PDDL planner.

Datasets are split into train, validation (seen/unseen), and test (seen/unseen) folds. In total, there are 25,743 language directives and 8,055 unique expert trajectories, yielding 428,322 (image, action) pairs. Each directive is authored through a dual-pipeline involving Mechanical Turk crowdworkers, with quality control by two or three independent validators. Goal statements average 15–20 tokens; step-by-step instructions span about 50 tokens, mapping to an average of 7.5 subgoals and 50 low-level environment actions.

3. Action Space, State Representation, and Partial Observability

At each timestep, the agent selects among 13 discrete actions:

  • Navigation: MoveAhead, RotateLeft, RotateRight, LookUp, LookDown
  • Manipulation: Pickup, Put, Open, Close, ToggleOn, ToggleOff, Slice
  • Termination: Stop

Manipulation actions additionally require a pixel-wise interaction mask (224×224) specifying the object in the RGB observation to affect. The environment maintains full object pose, open/closed, on/off, and state (slice/clean/dirty/heated) flags, many of which are irreversible within an episode. Agents receive only egocentric RGB observations, inducing significant partial observability and non-Markovian challenges (Shridhar et al., 2019).

4. Evaluation Metrics and Baselines

Performance is quantified over sets of NN episodes, each with KiK_i atomic goal-conditions per episode. Key metrics:

  • Task Success Rate:

TaskSuccess=1Ni=1N1{all Ki goal-conditions satisfied in episode i}\text{TaskSuccess} = \frac{1}{N}\sum_{i=1}^{N}\mathbb{1}\{\text{all } K_i \text{ goal-conditions satisfied in episode } i\}

  • Goal-Condition Success Rate:

GoalCondSuccess=1Ni=1N1Kij=1Ki1{condition j met in episode i}\text{GoalCondSuccess} = \frac{1}{N}\sum_{i=1}^{N}\frac{1}{K_i}\sum_{j=1}^{K_i}\mathbb{1}\{\text{condition } j \text{ met in episode } i\}

  • Path-Weighted Versions: Each metric is weighted by Lmax(L,L^)\frac{L^*}{\max(L^*,\,\hat L)}, discouraging unnecessarily long execution.

A strong Seq2Seq baseline, with frozen ResNet-18 visual encoder, Bi-LSTM language encoder, soft attention mechanism, and auxiliary L2 regression heads for trajectory and sub-goal progress monitoring, is evaluated in both vanilla and progress-monitor augmented (PM) forms. Human performance serves as an upper bound for comparison:

Model / Metric Test Seen Task Test Seen GoalCond Test Unseen Task Test Unseen GoalCond
Seq2Seq 2.1% 7.4% 0.5% 7.1%
Seq2Seq+PM 4.0% 9.4% 0.4% 7.0%
Human ~91%

Performance drops sharply on unseen scenes, demonstrating limited generalization. Major sources of error include failure to track subgoal progress, repeated or spurious actions, and mask localization failures (Shridhar et al., 2019).

5. Algorithmic Challenges and ERA Advances

Key challenges for EB-ALFRED include:

  • Long Horizons and Compositionality: Task directives average 50 steps distributed across multiple dependent subgoals.
  • Irreversible State Changes: Mistakes can be unrecoverable, requiring forward-only reasoning.
  • High Branching Factor: 13 actions, mask prediction, and compositional environments amplify search complexity.
  • Partial Observability: Necessitating the formation of persistent, structured memory.

To address these issues, ERA (Embodied Reasoning Agent) proposes a two-stage framework (Chen et al., 14 Oct 2025):

Stage 1: Embodied Prior Learning (EPL)

  • Trajectory-Augmented Priors: Each trajectory step is augmented with GPT-4o-generated structured traces (current observation, reflection on history, next-step plan).
  • Environment-Anchored Priors: Auxiliary tasks such as Masked Action Modeling and Action Sequence Reordering inject explanation and ordering supervision.
  • External Knowledge Priors: Large, out-of-domain corpora are used for additional high-level and spatial reasoning.
  • Supervised Curricular Finetuning: The agent is first trained on environment/external priors, then on demonstration trajectories.

Stage 2: Online Reinforcement Learning (RL)

  • Self-Summarization: At each step, the model compresses all prior context into a reflection token, maintaining an O(1)\mathcal{O}(1) context window.
  • Dense Reward Shaping: Final success, subgoal, and behavior shaping rewards explicitly address the credit assignment and exploration bottlenecks.
  • Turn-Level Policy Optimization: RL credit assignment is performed at the full (reasoning+action) turn, not at the token level, enhancing stability.

These strategies enable small (3B parameter) VLM agents to match or exceed very large prompting-based models on EB-ALFRED.

6. Empirical Results and Ablation Studies

ERA-3B, trained via EPL plus RL, achieves 65.2% average task success on EB-ALFRED—surpassing both GPT-4o (8B) and other 3B RL baselines—even on previously unseen task types. Unseen-task performance is bolstered most by Trajectory-Augmented and Environment-Anchored priors (up to +24% compared to raw trajectory-only). Turn-level credit assignment provides the greatest stability.

Model Avg Base Complex Visual Common Spatial
GPT-4o (8B) 56.8 64 68 46 54 52
Claude-3.5-Sonnet (70B) 66.4 72 76 60 66 58
RL4VLM (3B) 51.2 70 70 56 32 28
ERA-3B (EPL only) 56.0 68 66 52 44 50
ERA-3B (EPL+RL) 65.2 72 72 62 54 66

Ablation studies confirm that self-summarization (compact history) outperforms longer or no context windows without prohibitive computational cost. Dense subgoal and behavior shaping rewards further increase unseen-task success rates (Chen et al., 14 Oct 2025).

7. Generalization, Limitations, and Open Problems

While EB-ALFRED and ERA demonstrate major progress, several limitations and research frontiers remain:

  • Reality Gap: All training and evaluation occurs in simulated environments. Real-world deployment introduces unmodeled noise, sensor imperfections, and “sim-to-real” transfer issues.
  • Data Generation Cost: Collecting high-quality environment-anchored priors is labor-intensive.
  • Multi-agent and Dynamic Environments: Current pipelines assume a single agent in a static environment; extensions to multi-agent or temporally varying tasks are needed.
  • Complex Memory and Reasoning: Future work is needed on world models, hierarchical planning, richer multi-modal sensory integration (e.g., audio, tactile), and more elaborate manipulation and language grounding schemes.

A plausible implication is that augmenting agents with robust structured priors and dense RL feedback at the reasoning-action granularity produces stable, generalizable performance, reducing the reliance on model scale alone and motivating new benchmarks and architectures for scalable embodied intelligence.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EB-ALFRED.