ENACT Benchmark: Embodied Cognition

Updated 27 November 2025

ENACT Benchmark is a world-modeling framework that evaluates embodied cognition by leveraging a POMDP formulation to assess sequence reasoning in vision-language models.
It employs forward and inverse world modeling tasks that test agents’ ability to simulate prospective interactions and infer past actions with precise permutation matching.
The benchmark integrates simulation-based data pipelines with dynamic key-frame sampling and rigorous evaluation metrics to highlight significant human–model performance gaps.

ENACT Benchmark designates a suite of methods and datasets united by the acronym “ENACT” but spanning divergent domains. Most notably, ENACT is a benchmark for embodied world modeling in vision-LLMs (VLMs), recently formalized by Zhai et al. (Wang et al., 26 Nov 2025). It evaluates the capacity of artificial agents to perform sequence reasoning over egocentric interaction data, probing the core prerequisites for embodied cognition. This article focuses on the embodied cognition ENACT benchmark, detailing its motivation, mathematical formulation, task constructs, dataset pipeline, evaluation metrics, and major empirical insights.

1. Theoretical Underpinnings: ENACT as World Modeling POMDP

ENACT formalizes the evaluation of embodied cognition as a Partially Observable Markov Decision Process (POMDP). The core elements are:

State space $S$ : Symbolic scene graphs derived from the simulator, encoding object nodes (e.g., “plate₉₄”) and relational predicates (e.g., OnTop(plate₉₄, table)).
Action space $A$ : Scene-graph differences, $a_t = \delta(s_t, s_{t-1})$ , capturing the minimal predicate transitions.
Observation space $O \subset \mathbb{R}^{H \times W \times 3}$ : Egocentric RGB images, $o_t$ , rendered from the simulation viewpoint.
Transition function $T$ : $T(s, a) = s'$ if and only if $\delta(s', s) = a$ ; deterministic, given full simulator knowledge.
Observation function $\mathcal{O}$ : Deterministically maps $s$ to $o = \mathcal{O}(s)$ .
Reward: Not used (zero-sum VQA problems supplant reinforcement learning).

For any key-frame trajectory $\pi = (i_0, ..., i_{L-1})$ , ENACT defines a finite-horizon POMDP fragment $M_\pi = \langle S_\pi, A_\pi, T, \mathcal{O} \rangle$ where $S_\pi = (s_{i_0}, ..., s_{i_{L-1}})$ , $A_\pi = (a_0, ..., a_{L-2})$ , and each $a_k$ is the visible subset of the state transition. This structure inherently enforces partial observability, crucial for embodied cognition.

2. Sequence Reordering: Task Definitions and Formalization

ENACT introduces two distinct but related sequence reordering tasks, each reflecting a fundamental cognitive process in action-conditioned prediction:

Forward World Modeling (prospective simulation): Given an initial observation $o_0$ , the ordered action sequence $A = (a_0, ..., a_{L-2})$ , and a shuffled set of subsequent observations $O' = (o'_1, ..., o'_{L-1})$ , the model must output a permutation $\sigma$ such that $(o'_{\sigma(1)}, ..., o'_{\sigma(L-1)}) = (o_1, ..., o_{L-1})$ . The formal maximum is:

$\sigma^* = \arg\max_\sigma \mathbb{I}[(o'_{\sigma(1)}, ..., o'_{\sigma(L-1)}) = (o_1, ..., o_{L-1})]$

Inverse World Modeling (retrospective inference): Given $o_0$ , the ordered observation sequence $(o_1, ..., o_{L-1})$ , and a shuffled set of actions $A' = (a'_0, ..., a'_{L-2})$ , the model outputs a permutation $\tau$ such that $(a'_{\tau(1)}, ..., a'_{\tau(L-2)}) = (a_0, ..., a_{L-2})$ :

$\tau^* = \arg\max_\tau \mathbb{I}[(a'_{\tau(1)}, ..., a'_{\tau(L-2)}) = (a_0, ..., a_{L-2})]$

Solving these tasks necessitates recognition of affordances, action-effect reasoning, embodied awareness, and integration of temporally-partial egocentric observations, all without reliance on pixel-level image synthesis.

3. Dataset Creation Pipeline: Simulation, Segmentation, and Question Synthesis

Dataset curation leverages the BEHAVIOR simulator in a fully automated pipeline consisting of:

3.1 Segmented Frame Extraction

From simulation replays $\mathcal{T} = \{ (o_t, s_t) \}_{t=1}^T$ , frames are selected when the predicate delta $\delta(s_t, s_{t-1}) \neq \emptyset$ persists for at least 40 time steps (approx. 1.3 s). One-hot “signature” vectors for each $\delta$ prevent near-duplicates via a cosine similarity threshold $< 0.97$ , yielding $M$ unique, temporally-persistent “key frames”.

3.2 Key-Frame Trajectory Sampling (KFTS Algorithm)

A directed acyclic graph (DAG) is constructed over the key frames. Dynamic programming computes valid sampled key-frame index trajectories of target length $L$ , balancing coverage and diversity.

3.3 VQA Question Pair Synthesis

For each sampled trajectory $\pi = (i_0, ..., i_{L-1})$ : - Forward task: Provide $o_0$ , ordered $A_\pi$ , and shuffled future observations. - Inverse task: Provide $o_0$ , ordered observations, and shuffled actions. Actions are rendered in natural language or symbolic templates. Applied to 29 activities and $L \in [3, 10]$ , this process results in 8,972 QA pairs. An additional real-world subset (960 QAs) is produced from kitchen/workspace scenes.

4. Evaluation Metrics and Verification Protocols

Given the combinatorial nature of possible correct orderings, ENACT employs specialized metrics and verification procedures:

Task Accuracy (TA): Fraction of QA items where the predicted permutation exactly matches or semantically covers the ground truth, as judged by an online verifier. For forward, a predicted state-change must include all ground-truth visible predicates; for inverse, each predicted action must be a subset of the reference actions:

$TA = \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \mathbb{I}[\text{accepted}(x)]$

Pairwise Accuracy (PA): Micro-averaged score over consecutive pairs. For each step $i$ in a predicted sequence, if the atomic state difference $C_i$ satisfies $C_i \subseteq \hat{Y}_i$ (forward) or $\hat{Y}_i \subseteq F_i$ (inverse), it is counted as correct:

$PA = \frac{\sum_x \# correct\_pairs(x)}{\sum_x L_x}$

This dual-level verification allows the assessment of both holistic sequence reordering and local temporal consistency.

5. Experimental Results: Human–Model Gap, Task Asymmetries, and Biases

Key empirical findings highlight fundamental limitations in current VLMs’ embodied cognition:

Human–Model Performance Gap:

Humans achieve $TA \approx 95\%$ and $PA \approx 95\%$ across horizon lengths ( $L = 3$ to $10$). GPT-5 (proprietary) and InternVL3.5-241B (open) start at $PA \approx 86\%$ for $L=3$ , dropping to $\approx 46\%$ (forward) and $\approx 56\%$ (inverse) for $L=10$ . For $L \geq 8$ , model metrics approach random chance, while human performance remains high.

Forward/Inverse Asymmetry:

VLMs consistently excel on inverse world modeling versus forward (GPT-5 $PA$ at $L=6$ : $\approx 70\%$ inverse vs. $\approx 64\%$ forward). This indicates stronger retrospective reasoning (language-driven action inference) than prospective visual simulation.

Anthropocentric Biases:
- Camera Intrinsics: Increasing aperture from the human baseline (40°) to 60°, 80°, or fisheye decreases $PA$ (statistically significant, $p < 0.01$ ); 30° shows no difference.
- Camera Height: Raising the camera by $+0.5$ m reduces forward $PA$ by $-13\%$ ; lowering $-0.25$ m has negligible effect.
- Robot Appearance: Varying the gripper’s color has no significant effect, denoting body-shape invariance.
- Handedness: Both models and humans show higher precision and recall for RightGrasping compared to LeftGrasping. Cross-misattribution rates mirror human population right-handedness priors.
Sim-to-Real Consistency:

When evaluated on real-world scenes, model $PA$ trends are reproducible, and absolute values are consistent within $|\Delta| < 5\%$ . Varied rendering (realistic, stylized, path-traced) does not have a statistically significant impact ( $p > 0.2$ ), implicating shortcomings in world modeling logic, not low-level visual realism.

6. Technical Significance and Public Resources

ENACT exposes significant gaps in long-horizon, partially observable sequence reasoning for current VLMs, setting a high bar for embodied cognition. The design precludes shortcutting by low-level synthesis or bias exploitation and focuses on genuine action-effect, affordance, and memory integration.

The benchmark, code, and data are made fully public for reproducibility and extension:

Website, leaderboards, and VQA demos: enact-embodied-cognition.github.io
Code and pipeline: github.com/mll-lab-nu/ENACT
Dataset: huggingface.co/datasets/MLL-Lab/ENACT

Supplementary material includes all evaluation scripts, prompts, verifiers, human annotation UIs, as well as detailed ablations and error analyses (Wang et al., 26 Nov 2025).

7. Context Within ENACT Nomenclature

While ENACT is sometimes used as an acronym in other domains (e.g., ENACT-Heart: ensemble-based CNN/ViT assessment for heart sounds (Han et al., 24 Feb 2025); entropy-based attention clustering in object detection (Savathrakis et al., 2024)), within embodied AI, ENACT uniquely denotes the embodied cognition world-modeling benchmark as detailed above. This specificity is critical to avoid conflation with unrelated ENACT methods in audio and vision transformer efficiency.