ViPOcc Evaluation Framework
- ViPOcc is a process-aware evaluation framework that assesses generative video reasoning by evaluating both intermediate processes and final outcomes.
- It employs a hierarchical rubric and a VLM-as-judge mechanism to ensure compliance with explicit process and outcome constraints.
- The framework covers 16 diverse tasks across six reasoning domains, offering scalable and interpretable evaluation metrics for video models.
ViPOcc (Video Process-Outcome Consistency) denotes a process-aware evaluation framework for Generative Video Reasoning (GVR) as introduced in the context of the VIPER benchmark. The framework operationalizes a rigorous analysis of both the intermediate reasoning process and the final outcome of generative video models, addressing shortcomings in traditional evaluation protocols that are susceptible to "outcome-hacking"—the phenomenon whereby correct outcomes are achieved through erroneous processes. The underlying protocol introduces explicit formalizations, a hierarchical rubric, and a vision-LLM (VLM)-as-judge paradigm to enable comprehensive, extensible, and interpretable assessment of GVR capabilities across a broad spectrum of tasks and reasoning domains (Li et al., 31 Dec 2025).
1. Unified Benchmark Structure and Task Taxonomy
The ViPOcc framework is instantiated via VIPER (VIdeo Process-aware Evaluation for Reasoning), a benchmark comprising 16 tasks spanning six reasoning domains: Temporal, Structural, Symbolic, Spatial, Physics, and Planning. Each task specification includes an initial image , a textual prompt articulating the explicit goal, implicit process constraints , and a ground-truth reference (video, image, or text).
Domains and prototypical tasks:
- Temporal: Smooth object movement (e.g., moving a shape to a target), content-preserving zoom sequences.
- Structural: Single-move chess mates, valid maze traversal, tic-tac-toe win/block, sudoku completions.
- Symbolic: Stepwise mathematical reasoning, multi-choice knowledge questions, multimodal problem-solving.
- Spatial: Grid-based dice rolling, block rotation, image tile re-assembly.
- Physics: Classical experiment simulation and physics-based game environments.
- Planning: Robotic pick-and-place and continuous navigation pathing.
Each domain emphasizes a set of process constraints tailored to reasoning style, and all are subject to frame-wise and sequence-level evaluation for both outcome and process correctness.
2. Formal Definition of Process-Outcome Consistency (POC@r)
For a generated video , ViPOcc defines a sampled subset at sampling rate (uniform decimation, frame-based or proportional). For this subset, a VLM-based judge executes dual binary evaluations:
- Outcome Consistency (OC@r): $1$ if at least one sampled frame achieves the explicit target ().
- Process Consistency (PC@r): $1$ if all sampled frames satisfy implicit process constraints ().
$\text{OC}@r = \mathbbm{1}\left[\exists f \in \hat V_r : f \sim t\right], \qquad \text{PC}@r = \mathbbm{1}\left[\forall f \in \hat V_r : f \sim c\right]$
A video is strictly correct if and only if both OC@r and PC@r are satisfied.
3. Hierarchical Rubric and VLM-as-Judge Mechanism
ViPOcc employs a three-level hierarchical rubric for VLM-based assessment:
- System Prompt: Defines the procedural schema, disambiguates process versus outcome criteria, and sets output structure (including explicit reasoning and decision formats).
- Domain Introduction: Shares domain-specific priorities (e.g., strict legality in Structural Reasoning).
- Task Constraints: Converts implicit constraints into granular, bullet-point criteria for each sample (e.g., "red path must not cross maze walls," "static camera").
The VLM-as-judge protocol utilizes a multimodal GPT-5 model, receiving the system prompt, domain introduction, constraints, and sampled frames. The VLM produces both a reasoning trace and a binary JSON verdict over process and outcome consistency.
4. Experimental Evaluation Protocol and Reporting
The core experimental pipeline follows:
- One-shot video generation: Pass@1 evaluation by default.
- Sampling rate (): Default (framewise); ablations at .
- Test-time scaling (Pass@k): Multiple independent generations (-fold); a task is solved if any sample achieves POC@1.0.
- Metrics:
- OC: percentage satisfying [email protected]=1.
- Hacking Rate: fraction achieving outcome but not process correctness (OC − POC).
- [email protected]: proportion satisfying both consistency criteria.
Major reported results across proprietary models (aggregate [email protected], \%): | Model | [email protected] (%) | |--------------|-------------| | Veo 3.1 | 20.3 | | Sora 2 | 23.3 | | Wan 2.6 | 18.6 | | Seedance 1.5 | 9.5 |
Open-source models exhibit [email protected] < 10%. The outcome-hacking gap (OC − POC) ranges from 20–46%. Test-time scaling improves but does not close the reasoning gap (e.g., Pass@8 Sora 2 58.6% [email protected] overall), and increasing sample rate reduces POC (e.g., Veo 3.1: 28.2% @ to 15.4% @ ).
Key failure typologies include constraint violation (e.g., illegal moves), stop failure (editing beyond completion), editing leakage (irregular changes), and illegible textual outputs in symbolic domains.
5. Algorithmic Flow
ViPOcc evaluation can be summarized by the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 |
V = 𝓜(I, t) # V = [f₁, ..., fₙ_frames] N = floor(n_frames / r) sampled = [f_{1 + floor((i-1)*r)} for i in 1…N] OC = any(𝒥.check_target(f, t) for f in sampled) PC = all(𝒥.check_constraints(f, c) for f in sampled) POC = OC and PC return {OC, PC, POC} |
6. Significance, Insights, and Future Directions
ViPOcc establishes the necessity of process-awareness for GVR: single-frame metrics are inadequate and can grossly overstate a model’s true reasoning proficiency. Explicitly incorporating process constraints in model prompts can yield a POC gain of 2–5% in some domains. While test-time scaling provides moderate gains, it cannot resolve intrinsic model limitations or the process-outcome gap.
Areas for methodological enhancement include:
- Integrating frame-level verification or structured latent planning to enforce process consistency.
- Addressing engineering challenges such as text rendering for symbolic tasks and hard video stopping to address stop failure.
A plausible implication is that extending the ViPOcc protocol to domains involving continuous control or real-world video could further expose failure modes not captured by static or discrete environments. The framework’s extensible rubric and VLM-based evaluation are likely to remain critical as generative models begin to tackle broader, more complex reasoning challenges.
7. Context, Limitations, and Broader Impact
ViPOcc addresses a critical gap in the evaluation of generative video systems by simultaneously and rigorously assessing both process fidelity and goal attainment. The reliance on VLM-as-judge enables scalable and semantically expressive assessment but introduces dependency on the accuracy and biases of the underlying VLM (here, multimodal GPT-5). Constraint specification granularity and sampling design significantly affect metric sensitivity. Further development of domain-general, transparent process-aware benchmarks will be necessary to monitor progress toward generalized visual reasoning, particularly as models evolve in reasoning complexity and deployment context (Li et al., 31 Dec 2025).