Papers
Topics
Authors
Recent
Search
2000 character limit reached

STEVO-Bench: Standard Evaluation Protocol

Updated 3 July 2026
  • STEVO-Bench is a standardized evaluation protocol that defines controlled action and observation signals to assess video models' internal state evolution.
  • It uses synthetic occlusions and controlled prompts to measure state progress, physical plausibility, and temporal coherence in real-world processes like melting or pouring.
  • Empirical results indicate that current models often stall state evolution during occlusion, stressing the need for explicit, decoupled internal state representations.

STEVO-Bench is a standardized evaluation protocol designed to measure the ability of video world models to decouple internal state evolution from observation, with specific emphasis on naturalistic, occlusion-interrupted processes. By formalizing “observation control” as an explicit benchmark variable, STEVO-Bench enables systematic assessment of whether state-evolving events—such as melting, pouring, or motion—are faithfully modeled by generative video systems in the absence of visual access.

1. Formal Definition and Task Construction

STEVO-Bench formalizes its benchmark scenario as follows. Given a video sequence x1:Tx_{1:T} consisting of TT frames, a control signal c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}}) is introduced. cactionc^{\mathrm{action}} triggers a state-evolving event (e.g., “pour water”), and cobsc^{\mathrm{obs}} applies an observation interruption via occlusion (e.g., by inserting a cardboard) or by commanding the camera to look away. The resultant masked sequence is defined as

x1:T(c)=O(x1:T,cobs)x_{1:T}^{(c)} = \mathcal{O}(x_{1:T},\,c^{\mathrm{obs}})

where O()\mathcal{O}(\cdot) executes the occlusion operation on the affected frames {k,,}\{k, \ldots, \ell\}. The video model ff receives the pre-interruption frames {x1,,xk}\{x_1, \ldots, x_k\} together with TT0 and must generate a complete trajectory TT1.

To quantify fidelity, STEVO-Bench proposes submetrics focused on state progress (TT2), physical plausibility (TT3), and temporal coherence (TT4). The success criterion applies indicator functions to each:

TT5

A combined task failure is reported as:

TT6

Submetrics as well as joint success rates are reported for each model-task pair (Ma et al., 13 Mar 2026).

2. Benchmark Components and Control Mechanisms

STEVO-Bench encompasses 225 tasks spanning six taxonomic categories of real-world evolutions:

  • Continuous processes (e.g., melting ice)
  • Kinematics (e.g., object rolling)
  • Relational transitions (e.g., domino chains)
  • Causal interventions (e.g., lamp toggling)
  • State transformations (e.g., inflation)
  • Expected animate actions (e.g., walking)

Each task is defined by an initial frame, an action control prompt, and an observation control directive. Occlusion is realized either by simulated in-scene barriers (“place cardboard in front of the camera”, “turn off the lights”) or programmed camera movement (“lookaway” via a sequence of trajectory deltas). For camera-based occlusion, a deterministic parameterization (e.g., TT7 per frame spin for 30 frames) ensures the object is absent during TT8 and reappears at TT9.

The ground-truth sequence c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})0 is always acquired from uninterrupted video; interruptions are synthetic and fully reversible for evaluation.

3. Evaluation Protocol and Metrics

The protocol for each (model c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})1, task) pair consists of:

  1. Prompting c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})2 with c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})3 and c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})4 to obtain c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})5.
  2. Verifying “control success”:
    • Observation: Main object fully occluded or out of frame for frames c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})6.
    • Action: Trigger event initiates before interruption.

Only control-passing trials are evaluated; failures are discarded. Automatic assessment leverages “specialist verifiers”—vision-LLM (VLM) prompts with majority-vote aggregation (c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})7)—on three axes:

  • State Progress (c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})8): Did a qualitative change occur during occlusion?
  • Physical Plausibility (c=(caction,cobs)c = (c^{\mathrm{action}}, c^{\mathrm{obs}})9): Any violation of basic physics (e.g., object discontinuities, substance level inconsistency)?
  • Temporal Coherence (cactionc^{\mathrm{action}}0): Did discontinuities, spurious cuts, or teleports arise over the occlusion boundary?

Joint success cactionc^{\mathrm{action}}1, as well as per-metric rates, are reported for each method. Control success rates (cactionc^{\mathrm{action}}2 for most models) confirm that failures in state evolution cannot be attributed to poor instruction compliance.

4. Empirical Findings and Model Comparison

The following table summarizes key results for leading image/text-to-video and camera-controlled models on the occlusion-interrupted benchmark:

Model Success (%) Progress (%) Physics (%) Coherence (%)
Veo 3 8.7 17.4 82.6 66.5
Sora 2 Pro 8.1 13.1 85.5 69.7
WAN 2.2 0.9 7.7 52.0 58.4
HunyuanVideo 1.5 0.9 4.1 42.1 59.1
CogVideoX 1.5 0.5 1.4 68.5 67.1
Model Success (%) Progress (%) Physics (%) Coherence (%)
Genie 3 0.0 2.9 15.2 27.3
HunyuanWorldPlay 0.0 0.0 72.2 88.2
Lingbot-World 0.0 3.4 40.7 76.3
GEN3C 0.0 0.0 30.6 82.4

Under occlusion-interrupted evaluation, all architectures experience a drop in success rates below cactionc^{\mathrm{action}}3. In comparison, when the same tasks are evaluated in the fully observed regime (no occlusion), the state progress metric exceeds cactionc^{\mathrm{action}}4 and overall success is cactionc^{\mathrm{action}}5. The progress gap is quantified as

cactionc^{\mathrm{action}}6

across models. The dominant failure modalities are process stalling during occlusion and incoherence immediately after the occlusion ends. Camera-controlled models frequently generate static scenes during lookaway, and, in rare dynamic cases, ignore the camera trajectory context. Memory-augmented models (e.g., VMem) successfully memorize pre-occlusion frames but do not advance the world state during observation interruption (Ma et al., 13 Mar 2026).

5. Analysis, Failure Modes, and Diagnostic Insights

STEVO-Bench reveals that present video world models exhibit a strong coupling between state evolution and visual observation. Principal diagnostic insights include:

  • Reliance on pixel-level continuity: Models frequently “pause” all state change when the field of view is blocked, in contrast to real-world physics which proceed regardless of observation.
  • Conflation of evolution and visibility: Physical processes (melting, pouring, inflation) are only advanced while they are visible to the model.
  • Coupled control failure: Camera-controlled models show a trade-off between scene dynamics and camera motion, commonly freezing one when the other is present.
  • Inefficacy of memory augmentation: Storing visible frames does not ensure correct latent world updating during unobserved intervals.

These findings isolate a fundamental architectural and dataset limitation in current video world models: the lack of an explicit, observation-decoupled internal state representation.

6. Recommendations and Future Directions

STEVO-Bench proposes several avenues for closing the observed decoupling gap:

  • Data augmentation with synthetically occlusion-interrupted sequences, facilitating supervised learning of “behind-occluder” state prediction.
  • Architectural innovations involving explicit stateful representations evolved by transition operators or physical priors, instead of relying exclusively on attention mechanisms over visible pixels.
  • Attention refinements such as masked cross-attention that intentionally bypass occluded frames, thus mitigating process stalling.
  • Metric extensions that require models to regress or update underlying scalar quantities (e.g., poured volume, temperature) in addition to pixel-level predictions.

A plausible implication is that successful decoupling of state evolution from observation will require paradigm shifts at both data and architectural levels, treating visibility as just one sensor modality among others, rather than the primary driver of temporal progression.

STEVO-Bench establishes a crucial and rigorous platform for the development and evaluation of future video world models, offering metrics, control protocols, and diagnostic tests essential for advancing beyond “observation-bound” generative modeling (Ma et al., 13 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to STEVO-Bench.