Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EB-ALFRED Benchmark

Updated 1 July 2025
  • EB-ALFRED (ALFRED) is a benchmark that evaluates embodied AI agents on complex, language-guided household tasks requiring long action sequences, pixel-wise object manipulation, and visual understanding in simulated environments.
  • Baseline models on ALFRED achieve very low task success rates (<5%), highlighting significant challenges in long-horizon planning, compositional generalization, fine-grained language grounding, and sim2real transfer.
  • ALFRED serves as a key testbed for developing and assessing advanced methods in embodied AI, including sequence-to-sequence models, cross-modal alignment techniques like BAS, and LLM-based planning frameworks such as LoTa-Bench.

The EB-ALFRED Benchmark, commonly referenced simply as the ALFRED benchmark within the research literature, is a standardized suite for evaluating embodied agents on complex household instruction-following tasks. It is designed to close the gap between existing research datasets and the requirements for real-world, language-driven robotics by imposing unique challenges in task complexity, language grounding, action generation, and visual understanding.

1. Definition and Core Objectives

ALFRED (Action Learning From Realistic Environments and Directives) is a benchmark that evaluates agents for their ability to map natural language instructions and egocentric vision to sequences of low-level actions within simulated household environments. Its principal goal is to support the development and assessment of learning systems that can perform long-horizon, multi-step, and compositional tasks requiring real object manipulation and non-reversible changes of state (e.g., "rinse off a mug and place it in the coffee maker") (1912.01734).

ALFRED spans seven primary household task types—Pick & Place, Stack, Heat, Cool, Clean, Examine, Pick Two & Place—applicable to 84 unique object classes in 120 visually varied environments. Each task instance consists of an expert demonstration (image/action sequence), a high-level natural language goal, and step-by-step low-level instructions.

The underlying simulation environment is AI2-THOR 2.0, notable for enabling realistic, interactive physics and partial observability, bringing agent training closer to real-robot settings.

2. Task Complexity: Sequence Length, Action Space, and Language

ALFRED is distinguished by its significant divergence from prior navigation and vision-language benchmarks in three primary aspects: sequence horizon, action space, and instruction richness.

a. Sequence Length and Compositionality

  • Demonstrations average approximately 50 low-level actions per episode, totaling over 428,000 image-action pairs.
  • Each episode typically comprises about 7.5 sub-goals, representing physical interactions and state changes distributed over a sequence (e.g., navigation, pickup, heating, placement).

b. Action Space

  • The action space consists of 13 discrete actions: 5 navigation primitives, 7 manipulation primitives (including object pickup, place, open/close, toggle, slice), and a stop action.
  • A defining characteristic is the requirement for pixelwise interaction masks for object manipulation; the agent must select specific instances within egocentric RGB frames, rather than acting on object classes or bounding boxes.
  • The environment is partially observable, and many actions are irreversible, increasing planning difficulty.

c. Natural Language Instructions

  • Instructions are free-form, crowd-sourced text, encompassing both high-level goals and detailed stepwise directives with extensive variability and compositional structure.
  • Language frequently references object attributes, past events, spatial configurations, and requires reasoning over action history.

The benchmark's complexity in these domains outpaces navigation-only settings like R2R and synthetic action-scripting environments such as VirtualHome.

3. Expert Demonstrations and Dataset Structure

Expert demonstrations in ALFRED are generated using AI planning (with PDDL and the FF planner), producing ground-truth action sequences that fulfill the stated goal from a randomized initial configuration. Each demonstration contains:

  • A trajectory of egocentric images paired with corresponding low-level actions and pixelwise object masks for manipulation.
  • Fully replayable logs, facilitating precise imitation learning and benchmarking.
  • Triple language annotations per demonstration, totalling 25,743 unique language directives.
  • Deliberate variation in initial positions to promote dataset heterogeneity and mitigate statistical bias.

These properties make ALFRED suitable for robust learning and evaluation across a wide spectrum of embodied vision-LLMing approaches.

4. Baseline Model Performance and Central Challenges

The ALFRED benchmark paper evaluates a supervised sequence-to-sequence (Seq2Seq) agent, comprising a CNN visual encoder, a bi-directional LSTM for language, and an LSTM decoder that jointly predicts actions and manipulation masks. Auxiliary losses such as progress monitoring and sub-goal completion are also investigated.

Key results:

  • Full task success rates for the baseline remain below 5% across all validation and test splits (seen and unseen environments), while human users achieved >90% (1912.01734).
  • Partial (goal-condition) success rates do not exceed 10%.
  • The model demonstrates competence on limited sub-tasks (e.g., heating or cooling in familiar settings) but fails to generalize navigation and object manipulation to new environments.
  • Failures are attributed to long-horizon planning difficulties, poor compositional generalization, inadequate pixelwise grounding, and weak alignment between language and visual inputs.

Representative formulas for the baseline agent:

  • Attention mechanism:

zt=(Wxht1)x,αt=Softmax(zt),x^t=αtxz_t = (W_x h_{t-1})^\top x,\quad \alpha_t = \mathrm{Softmax}(z_t),\quad \hat{x}_t = \alpha_t^\top x

  • Action decoding:

ut=[vt;x^t;at1],ht=LSTM(ut,ht1)u_t = [v_t; \hat{x}_t; a_{t-1}],\quad h_t = \mathrm{LSTM}(u_t, h_{t-1})

  • Action and mask prediction:

at=argmax(Wa[ht;ut]),mt=σ(deconv[ht;ut])a_t = \arg\max (W_a [h_t; u_t]),\quad m_t = \sigma(\mathrm{deconv} [h_t; u_t])

  • Evaluation metric (path-weighted success):

ps=s×Lmax(L,L^)p_s = s \times \frac{L^*}{\max(L^*, \hat{L})}

where ss denotes task or goal-condition success, LL^* is the expert action count, and L^\hat{L} is the length of the agent's action trajectory.

5. Modalities Alignment and the Boundary Adherence Score

Subsequent work has shown that a limiting factor in agent performance is cross-modal alignment: ensuring that the agent's actions are consistently mapped to the appropriate instruction segments. The Boundary Adherence Score (BAS) was introduced as an intrinsic metric to diagnose this alignment (2110.05665). BAS measures whether the selected action at each timestep corresponds to the correct instruction step per the ground truth:

B=1Lsi=1Ls1[f(vi)=fM(vi)]B = \frac{1}{L_s}\sum_{i=1}^{L_s} \mathbb{1}[f(v_i) = f_M(v_i)]

where f(vi)f(v_i) is the true and fM(vi)f_M(v_i) the model's inferred instruction index for each observation.

Findings indicate that standard architectures (e.g., MOCA, Seq2Seq) often yield BAS values barely above random (e.g., 0.44 vs. a random baseline of 0.3\approx 0.3 in unseen settings), revealing a lack of fine-grained text-action grounding. The introduction of a neural program counter and an auxiliary alignment loss (LpcL_{pc}) improves BAS and, in some cases, task performance, confirming the significance of explicit cross-modal alignment mechanisms.

6. Recent Extensions and Large Model Planning in ALFRED-like Settings

The LoTa-Bench framework evaluates the effectiveness of LLMs as planners for embodied tasks, using ALFRED as a primary evaluation setting (2402.08178). LoTa-Bench operationalizes planners as follows:

  • The LLM, conditioned on task instruction and prior action history, selects the next executable skill via autoregressive prompting.
  • Evaluation is automatic: skills output by the planner are executed in simulation, and outcomes are graded against ALFRED's goal conditions.

Important observations include:

  • Planning success scales with LLM model size (e.g., GPT-3 and GPT-4, with GPT-4 reaching up to 40.38% success on ALFRED valid-seen).
  • Selection of in-context examples for prompting, allowing natural language feedback and replanning, and domain fine-tuning all impact success rates.
  • Fine-tuning on ALFRED can lead to significant gains in in-domain performance, but often reduces generalization to new domains.
  • The use of external libraries (such as "Guidance") to constrain action outputs yields computational speedup.

Through LoTa-Bench, the ALFRED benchmark remains a foundational testbed for research into LLM-driven embodied planning.

7. Realism and Photorealistic Extensions: ReALFRED

The ReALFRED benchmark adapts the ALFRED protocol to 3D-scanned, photo-realistic indoor environments (2407.18550). Key modifications include:

  • Multi-room, house-scale layouts in 150 real-world homes, with significantly larger and more navigation-complex spaces.
  • Object interactability is increased by manual segmentation and replacement of background assets with realistic, physically modeled objects.
  • Photorealism is quantified via Frechet Inception Distance (FID) and Kernel Inception Distance (KID), and the environment demonstrates a reduced domain gap relative to simulation.
  • Evaluated methods from ALFRED show drastic drops in success rates: for instance, the ABP agent achieves 3-4% on unseen test environments, exposing the overfitting and optimism in synthetic environments.

Human baselines retain high scores (~85% success), but agents trained solely in simulated ALFRED settings fail nearly completely in ReALFRED, especially for navigation-heavy tasks. ReALFRED thereby foregrounds the urgent challenge of sim2real transfer and the need to ground embodied reasoning in photorealistic, large-scale spaces.

Summary Table: ALFRED versus Prior and Successor Benchmarks

Feature ALFRED (EB-ALFRED) ReALFRED Navigation-Only Datasets (R2R) VirtualHome
Scene Types Synthetic, 1 room Real scans, multi-room Synthetic/Real, limited rooms Synthetic
Obj. Interaction Pixelwise, mask-based Mask-based, real objects None Class selection
Instruction Free-form, compositional Free-form, real language Route-based Templates/NL
Task Types 7, state-changing 7+, more combinations Navigation Action scripts
Visual Realism Synthetic, game-engine Photo-realistic, RGB-D Variable 3D-rendered
Agent Performance <5% baseline, >90% human <4% recent SOTA, ~85% human >80% navigation-only >90% simple tasks

Significance and Impact

The EB-ALFRED benchmark, and its extensions, constitute a rigorously designed suite for developing and evaluating the next generation of grounded vision-LLMs for embodied agents. By combining complex, compositional instructions, pixelwise manipulation, partial visual observability, and diverse, realistic environments, ALFRED and its successors reveal the limitations of current inference, planning, and generalization strategies, emphasizing the necessity for hierarchical architectures, explicit cross-modal alignment, structured reasoning modules, and robust sim2real capabilities in the pursuit of deployable household robotics.