BEHAVIOR: Benchmark for Embodied AI

Updated 29 July 2025

The benchmark’s key contribution is its logic-symbolic task formalism, which enables precise evaluation of long-horizon, multi-object activities.
It leverages high-fidelity simulation environments with realistic physics, photorealistic rendering, and diverse sensory outputs to mimic human conditions.
Comprehensive metrics benchmark success and efficiency by comparing agent performance against extensive human VR demonstrations.

The BEHAVIOR Benchmark for Embodied AI defines a rigorous suite of tasks and methodologies designed to evaluate the spatial, semantic, and physical competence of embodied agents in realistic and ecologically valid simulation environments. Its development traces the need to move beyond low-level navigation or manipulation benchmarks, promoting progress on long-horizon, multi-object, multi-skill activities while enabling quantitative comparison to human performance. The benchmark leverages a logic-symbolic activity description formalism, advanced simulation platforms, and comprehensive evaluation metrics, and has been repeatedly extended and adapted into large-scale scene sets, task frameworks, and open-source platforms for multimodal embodied learning and foundational model evaluation.

1. Activity Formalization and Logical Abstraction

The core innovation of BEHAVIOR is the formal, logical, and object-centric activity specification. Activities are declared via the BEHAVIOR Domain Definition Language (BDDL), which expresses initial and goal conditions as sets of logical predicates over scene objects and their properties (e.g., $onTop$ , $inside$ , $cooked$ , $stained$ ) (Srivastava et al., 2021). An activity $\tau$ is specified as $\tau = \{S_{\tau,0},\, S_{\tau,g}\}$ , where $S_{\tau,0}$ and $S_{\tau,g}$ are the sets of admissible initial and goal states, respectively. The predicates are grounded through simulated physical states such as object pose, temperature, and cleanliness.

This object-centric, logic-based abstraction supports:

Declarative modeling of diverse and multi-path activities (not just single trajectories or plans);
Simulator-agnostic adaptation, since logical predicates are not tied to any simulation backend;
Modular instantiation—a given activity definition can automatically sample diverse initializations and allowable goals.

The formalism also allows for complex quantification and logical operators ( $\wedge$ , $\vee$ , $\forall$ , $\exists$ , $\mathrm{for}_n$ ), which are algorithmically "flattened" into conjunctions of atomic goals. This approach supports precise, partial-credit, and multi-branch evaluation.

2. Simulation Environment and Scene Realism

BEHAVIOR mandates simulation environments capable of physically and visually realistic instantiations:

Physics: Rigid-body (and, in later expansions, deformable and fluid) dynamics at high frequency (e.g., 1/300 s steps, pyBullet/PhysX 5), supporting nuanced object manipulation, stacking, and tool use (Srivastava et al., 2021, Li et al., 14 Mar 2024).
Perception: Sensory output includes RGB, depth, semantic/instance segmentation, LiDAR, scene flow, and surface normals.
Agents: Bimanual humanoid avatars (24 DOF, congruent with human embodiment), as well as mobile manipulators, such as the Fetch robot (12–13 DOF), with action spaces allowing both primitive and temporally extended “primitives” (e.g., grasp, placeOnTop).
Scene assets: Realistic 3D home scans (e.g., 15 homes in iGibson), extended in later versions to 50 varied scenes (houses, gardens, offices) and over 9,000 objects with rich property annotations in BEHAVIOR-1K (Li et al., 14 Mar 2024).
Rendering: Physically-based, photorealistic ray/path tracing (OmniGibson).

This physical and semantic diversity, coupled with infinite randomized activity instantiations, challenges agents to generalize across material, geometric, and contextual variations.

3. Evaluation, Metrics, and Human Baseline Data

BEHAVIOR introduces a multi-part evaluation protocol:

Primary success: Computed as the fraction of goal literals satisfied, using the formula $Q = \max_{C_i} \frac{|\{l_j \mid l_j \text{ True}\}|}{|C_i|}$ , where $C_i$ is a conjunction of atomic goal literals after flattening logical specifications (Srivastava et al., 2021).
Efficiency metrics: Simulated time ( $T_\mathrm{sim}$ ), kinematic disarrangement ( $D_k$ —total environmental disturbance), logical disarrangement ( $D_\ell$ —change in symbolic state), navigation length ( $L_\mathrm{body}$ ), and hand displacement ( $L_\mathrm{left}$ , $L_\mathrm{right}$ ).
Human normalization: All efficiency metrics can be normalized with respect to a dataset of 500 human VR demonstrations (totaling ~758.5 minutes, with full trajectory and sensor logs) (Srivastava et al., 2021). This enables meaningful "human-centric" performance analysis, highlighting gaps between artificial and naturalistic execution.

In BEHAVIOR-1K, these protocols are extended to support 1,000 activities, long-horizon sequential planning, and state tracking across a broader class of everyday human-relevant tasks (Li et al., 14 Mar 2024).

4. Simulator Independence and Portability

A distinctive feature is the capability to port BEHAVIOR activities across simulators using logic-level task definitions (Liu et al., 2022). BEHAVIOR tasks are represented using logical predicates and synsets, not simulator-dependent URDFs or APIs; this supports:

Cross-platform experiments (e.g., instantiating tasks in both iGibson 2.0/OmniGibson and Habitat 2.0 for speed or asset diversity);
Abstracting evaluation away from the quirks or limitations of individual simulation engines;
Encouraging general methods and representations for embodied agents, compatible with multiple digital worlds.

This property is critical for scaling comparative research, facilitating adaptation to new sim platforms, and preparing for transfer to real robotics.

5. Extensions and Derivatives

Several major extensions have been built upon the BEHAVIOR foundation:

BEHAVIOR-1K (Li et al., 14 Mar 2024): Scales the benchmark to 1,000 activities, grounds definitions in survey data of over 1,400 humans, and integrates OmniGibson for enhanced physical and visual realism (including fluids, cloth, and non-rigid objects). Initial sim2real experiments report that even advanced baselines (RL-VMC, RL-Prim.) underperform due to perception and actuation gaps between sim and real, particularly in visual domain transfer and grasp execution.
Mini-BEHAVIOR (Jin et al., 2023): Offers a computationally efficient, symbolic GridWorld version with procedural activity generation and logical evaluation. This enables rapid hypothesis-testing, open-ended plan variation, and logic-level benchmarking for symbolic and reinforcement learning research.
EmbodiedEval (Cheng et al., 21 Jan 2025): Focuses on benchmarking multimodal LLMs with interactive embodied tasks, supporting navigation, object and social interaction, and both spatial/attribute QA. Performance of state-of-the-art LLMs (e.g., GPT-4o) lags significantly behind humans (24.88% vs. 97% success), with failures concentrated in spatial grounding, efficient exploration, and plan generation.
MFE-ETP (zhang et al., 6 Jul 2024): Introduces a systematic, multi-capability evaluation—object understanding, spatio-temporal perception, task understanding, and embodied reasoning—incorporating BEHAVIOR-100 tasks for compositional multi-modal reasoning assessment. Models show strong deficits in embodied reasoning (highest reported planning success ~19%).

These derivatives expand the scope from household tasks to multi-agent, multimodal, open-ended, and logic-symbolic benchmarking, as well as facilitate research on large vision-LLMs and their embodied performance gaps.

6. Impact and Challenges for Embodied AI Research

BEHAVIOR and its descendants have fundamentally shaped embodied AI research in several respects:

Long-horizon, multi-object evaluation: Rather than evaluating “micro-tasks,” BEHAVIOR tests skills such as sequential planning under uncertainty, combinatorial manipulation, and state-dependent scene understanding akin to intelligent household service.
Human-centricity: Survey-driven activity selection and VR demonstration datasets enable research directly relevant to human-assistive robotics.
Sim2Real challenges: Empirical results highlight sim2real gaps—particularly in perception (texture, lighting, domain shift) and grasp stability—that are not readily addressed by existing RL methods, motivating domain adaptation and robust perception research.
Generalization: Infinite instance and goal sampling, realistic object and layout variation, and logical description enforce generalization, not memorization.

However, even state-of-the-art RL and imitation learning approaches (e.g., Soft Actor-Critic, PPO, RL with action primitives) largely fail to solve BEHAVIOR's most complex activities, even given privileged control or observation signals. Most methods struggle with:

Extended temporal horizon (hundreds of sequential actions required);
Multi-property state transitions (e.g., cleaning, cooking, assembly, and spatial placement);
Robustness to variations in initial state and object configuration.

A plausible implication is that cognitive-inspired architectures with memory, hierarchical abstraction, and richer learning from human data will remain necessary focal points for future research.

7. Public Availability, Resources, and Ecosystem

The BEHAVIOR benchmark, scene assets, activity definitions, and VR demonstration datasets are freely available at https://behavior.stanford.edu (Srivastava et al., 2021, Li et al., 14 Mar 2024). The platform supports replication, fair cross-method comparison, and community-driven extension. Integration into major simulators (iGibson, OmniGibson, Habitat 2.0) and benchmarking frameworks (LEGENT for MLLM evaluation (Cheng et al., 21 Jan 2025)) encourages reproducibility and comparative evaluation.

Table: Major BEHAVIOR Benchmark Milestones

Release	Activities	Environment	Features/Notes
BEHAVIOR v1	100	iGibson 2.0	Logic-symbolic BDDL, 15 home scans, VR demos
Habitat 2.0 port	45	Habitat 2.0	Simulator-independent, kinematic-only adaptation
BEHAVIOR-1K	1,000	OmniGibson	Human-surveyed goals, fluid/cloth/rigid, 50 scenes
Mini-BEHAVIOR	subset	GridWorld (MiniGrid)	Procedural generation, fast symbolic decision-making
EmbodiedEval	328	LEGENT/unified sim	Interactive, MLLM-centric, spatial/social QA

Summary

The BEHAVIOR Benchmark for Embodied AI operationalizes a rigorous, human-centric evaluation paradigm for embodied intelligence by uniting logical activity definition, physically plausible simulation, and comprehensive success and efficiency measurement. Its public resources, scale, task diversity, and extensibility have motivated methodological advances and underscored the complexity gap between artificial agents and human demonstrators—especially on long-horizon and physically realistic tasks. The framework's adaptation across different simulation platforms and research focuses (from general physical manipulation to multimodal LLM evaluation) marks BEHAVIOR as a pivotal resource for the long-term advancement of robust, generalizable, and human-serving embodied AI.