EmbodiedBench: Vision-Driven Agents Benchmark

Updated 15 March 2026

EmbodiedBench is a comprehensive benchmarking suite that evaluates vision-driven embodied agents through standardized, multi-modal tasks in simulation.
It integrates four simulated environments—EB-ALFRED, EB-Habitat, EB-Navigation, and EB-Manipulation—to assess high-level reasoning, navigation, and precise manipulation.
Evaluation protocols measure task success, planner efficiency, and visual feedback, providing actionable insights for advancing embodied AI research.

EmbodiedBench is a standardized, large-scale benchmarking suite designed for rigorous evaluation of vision-driven embodied agents based on multi-modal LLMs (MLLMs). It provides a unified framework for comparing proprietary and open-source MLLMs on their ability to perform high-level reasoning, complex planning, and fine-grained visuomotor control in diverse simulated environments. EmbodiedBench targets both long-horizon manipulation and navigation, featuring meticulously curated capability subsets and compositional tasks that span the full spectrum of embodied cognition challenges (Yang et al., 13 Feb 2025).

1. Benchmark Structure and Environments

EmbodiedBench consists of four distinct simulated task environments, each eliciting complementary agent capabilities:

EB-ALFRED: Built on AI2-THOR and ALFRED, focused on high-level household tasks (e.g., pick up, open, slice) with realistic object-rich scenes and dynamic action spaces (171–298 tokens). Task success requires semantic scene understanding and complex skill composition.
EB-Habitat: Based on Habitat 2.0 and the Language Rearrangement suite, centered on high-level rearrangement objectives, with multi-step navigation and object placement, and an action vocabulary emphasizing object-centric interactions.
EB-Navigation: AI2-THOR-based suite for low-level, atomic navigation control. Action space comprises fine-grained translations, rotations, and camera tilts. There is no GPS; agents must rely on egocentric RGB, potentially with minimal feedback signals.
EB-Manipulation: CoppeliaSim-based simulator with a Franka Panda robot arm executing 7-dimensional atomic actions (position, orientation, gripper state). The task structure emphasizes precise control, spatial relations, and object pose estimation.

In total, the test set contains 1,128 tasks (600 high-level; 528 low-level), and each environment is further divided into rigorously designed capability-oriented subsets.

2. Capability Subsets and Task Design

EmbodiedBench tasks are partitioned into six core capability subsets per environment, built to probe fundamental dimensions of embodied cognition:

Base: Standard compositional tasks (e.g., “Put washed lettuce in the refrigerator.”)
Commonsense Reasoning: Indirect referencing and tasks requiring world knowledge (e.g., identifying usage from indirect object cues).
Complex Instruction Understanding: Long or distractor-rich scenarios with multiple steps or disambiguation requirements.
Spatial Awareness: Tasks demanding relational or geometric reasoning (e.g., “Place the cup to the left of the plate”).
Visual Perception: Object-attribute-centric tasks, focusing on color, size, or shape.
Long-Horizon Planning: Tasks that require many sequential actions, including search or invisible targets, often exceeding 15 steps.

Subsets are balanced across environments (e.g., 50 instances per subset in EB-ALFRED and Habitat), providing statistically robust, standardized conditions for model comparison (Yang et al., 13 Feb 2025).

3. Evaluation Protocols and Metrics

Primary evaluation employs the task success rate (SSR):

$SSR = \frac{1}{N}\sum_{i=1}^N r_i,$

where $r_i = 1$ if the agent reaches the goal and $0$ otherwise. High-level tasks may also score subgoal completion via PDDL clause matching.

Secondary metrics include mean numbers of planner steps ( $\bar{p}$ ) and environment steps ( $\bar{e}$ ), measuring efficiency and interaction complexity.

For navigation tasks, additional metrics—the success weighted by path length (SPL) and path length (PL)—can be calculated:

$SPL = \frac{1}{N} \sum_{i=1}^N \mathbb{1}[d_i \leq \delta] \frac{d_i}{\max(d_i,\,p_i)}$

where $d_i$ is shortest-path, $p_i$ is actual length, and $\delta$ is a success threshold.

Each model is evaluated in a zero-shot, vision-driven agentic pipeline: the agent receives an egocentric observation, parses language goals, proposes a stepwise plan, and executes discrete or continuous primitives in the environment. Feedback signals, such as valid/invalid action indicators or low-level success flags, are provided to facilitate model adaptation and planning (Yang et al., 13 Feb 2025, Wang et al., 2 Dec 2025).

4. Core Findings and Model Performance

Extensive evaluation of 24 leading MLLMs (6 proprietary, 7 open-source) produces robust cross-model insights:

Model	EB-ALFRED	EB-Habitat	EB-Navigation	EB-Manipulation
GPT-4o	56.3	59.0	57.7	28.9
InternVL2_5-78B	37.7	49.0	30.7	18.0

High-level tasks (ALFRED, Habitat): Proprietary MLLMs excel (μ ≈ 50–68%). Open-source models close the gap (~40–60%), especially with visually grounded prompts.
Low-level navigation/manipulation: All MLLMs struggle—navigation shows moderate SSR (μ ≈ 20–58%), while manipulation remains challenging (SSR often < 30%).
Visual input is crucial: Vision ablation drops navigation SSR from 57.7 to 17.4 and manipulation from 28.9 to 16.2.
Long-horizon and spatial reasoning are universal bottlenecks; SSR can degrade by 20–40% on long-horizon subsets.
Visual in-context learning can yield gains up to +16.7% on manipulation with two image exemplars (Yang et al., 13 Feb 2025, Wang et al., 2 Dec 2025).

EmbodiedBench is distinct from prior and contemporary embodied AI benchmarks in several critical aspects:

Breadth: It unifies high-level semantic planning and low-level control across both manipulation and navigation in a consistent protocol.
Granularity: Task subsets diagnose specific failure modes, from commonsense deficits to spatial/visual ambiguities.
Automation: The agent pipeline integrates model prompting, multi-modal perception, and feedback-driven replanning.
Public Availability: APIs, tasks, and evaluations are open-source, supporting reproducibility and downstream finetuning (Yang et al., 13 Feb 2025).

Related benchmarks include RoboBench (System 2 cognition capabilties, planning/affordance/failure diagnosis, multi-view manipulation) (Luo et al., 20 Oct 2025), EmboCoach-Bench (agentic code-generation and closed-loop policy engineering) (Lei et al., 29 Jan 2026), and Embodied4C (multi-embodiment navigation and open-form VQA) (Sohn et al., 19 Dec 2025). EmbodiedBench is distinguished by its focus on vision-first multi-modal embodied agents, unified environment/task suite, and capability-driven diagnostic design.

6. Challenges, Limitations, and Future Directions

Empirical evidence from EmbodiedBench uncovers persistent limitations for MLLMs:

Low-level Control: MLLMs have difficulty with precise kinematic prediction, continuous vector output, and fine coordinate grounding for manipulation.
Long-Horizon Planning: Multi-step task consistency and global objective tracking remain failure points.
Multi-View/Temporal Fusion: Models are weak at aggregating state information from sequential observations or bridging occluded contexts—with pronounced drops on spatial/long subsets.
Commonsense and Visual Reasoning: Indirect language, attribute-based references, and natural visual ambiguity remain major sources of error.

Recommended scientific directions include hierarchical planning, memory-augmented modules, embodied chain-of-thought training, visual in-context learning, and end-to-end fine-tuning with embodied trajectory data. The platform supports incorporation of richer evaluation metrics, adversarial scenario testing, and emerging paradigms such as diffusion policy learning or world-model-based control (Yang et al., 13 Feb 2025).

7. Benchmark Impact and Role in Embodied AI

EmbodiedBench has established itself as the canonical diagnostic suite for vision-driven embodied MLLMs by providing exhaustive coverage of high-level reasoning, spatial understanding, and complex visuomotor control. It standardizes task composition, supports detailed ablation analyses, and creates a direct empirical substrate for both model-architecture and training-recipe advancement. Its continuous influence can be observed in the reported advances for memory-control module design (Dorbala et al., 28 Jan 2026), reinforcement-augmented planning (Liu et al., 16 Oct 2025, Chen et al., 14 Oct 2025), and brain-inspired memory frameworks (Lei et al., 2 Aug 2025). As such, it will remain integral for benchmarking and advancing embodied AI systems capable of generalization, adaptive planning, and robust real-world utility.