Papers
Topics
Authors
Recent
2000 character limit reached

RULER-Bench: Rule-Based Video Generation Benchmark

Updated 9 December 2025
  • RULER-Bench is a public benchmark that evaluates video generation models on rule-coherent reasoning through text-to-video and image-to-video tasks.
  • It comprises 622 curated tasks across six cognitive rule categories spanning nature, society, and virtuality, with fine-grained metrics including rule coherence.
  • The benchmark exposes significant reasoning gaps in current models and guides future improvements in multimodal alignment and neural-symbolic reasoning.

RULER-Bench is a large-scale, public benchmark specifically designed to evaluate the rule-based reasoning capabilities of state-of-the-art video generation models, with an explicit focus on cognitive rule inference and application in generated video sequences. In contrast to prior benchmarks that assess visual, perceptual, or aesthetic metrics, RULER-Bench provides fine-grained assessment of a model’s ability to infer and manifest implicit causal, logical, and social rules in both text-to-video (T2V) and image-to-video (I2V) paradigms. By constructing representative tasks sourced from diverse domains and annotating them with precise metrics, RULER-Bench exposes reasoning deficits in leading video architectures, thereby providing critical diagnostic leverage for the advancement of vision foundation intelligence (He et al., 2 Dec 2025).

1. Motivation and Distinctiveness

Contemporary video generation models have achieved notable advances in visual fidelity, temporal coherence, and limited instruction alignment. Existing benchmarks such as VBench, EvalCrafter, AIGC-Bench, and UI2V-Bench, however, restrict evaluation to perceptual or basic instruction-following criteria, failing to probe the inferential mechanisms by which a model predicts causal or abstract consequences in visually dynamic scenarios. There is thus a systematic gap: current evaluation suites do not measure whether a model can extract hidden rules from input modalities and generate videos that embody those rules, e.g., correctly visualizing a chemical reaction or a legal chess move.

RULER-Bench addresses this gap with two principal objectives:

  1. Formulating reasoning in video generation as rule-based prediction, requiring the model to infer implicit rules from the input and exhibit them visibly in the output sequence.
  2. Delivering a large-scale, publicly accessible benchmark that decomposes reasoning ability into interpretable categories of “cognitive rules,” thus enabling longitudinal tracking and targeted architectural improvements.

By foregrounding the notion of “rule coherence,” RULER-Bench serves as both a diagnostic tool and a strategic guidepost for the development of reasoning-aware, foundation-level vision models.

2. Benchmark Construction and Task Taxonomy

RULER-Bench comprises 622 curated instances distributed over 40 core tasks, each belonging to one of six cognitive rule categories. The two core task paradigms are:

  • Text-to-Video (T2V): The model receives a textual prompt (optionally accompanied by an implicit rule explanation) and must hallucinate both appearance and sequence dynamics to satisfy the inferred rule.
  • Image-to-Video (I2V): The model is provided a static image and a rule-bearing prompt, requiring correct parsing of the initial visual state and synthesis of a temporally coherent sequence manifesting the stated rule.

The six major categories, summarized in the table below, span three foundational domains (Nature, Society, Virtuality):

Domain Category Example Task (Prompt)
Nature Vision “Change the apple’s color from green to red.”
Science “Add acid to base and record the color change.”
Society Humanity “Show a person dressing for a snowy winter.”
Semantics “Depict ‘spill the beans.’”
Virtuality Hypothesis “After lying, the character’s nose grows longer.”
Game “Deliver a one-move checkmate in classic chess.”

Examples within each category target reasoning chains beyond surface-level pattern recognition, including inference of cultural conventions, physical causality, metaphorical interpretation, and legal-move logic in games. Annotated examples were synthesized via a hybrid pipeline: initial seed provided by human experts, expansion via GPT-5, and quality refinement—including deduplication, multi-modal LLM self-consistency checks, and privacy/ethics filtering using synthetic generation (Sora2). Ground truth and rule explanations are always high-precision, with human annotation in game domains to mitigate synthetic mis-explanations.

3. Evaluation Protocol and Metrics

RULER-Bench introduces a standardized, checklist-based evaluation, applied to each generated video in the benchmark. Four core metrics are measured:

  1. Instruction Following (IF): Adherence of the output to the explicitly described action.
  2. Visual Consistency (VC): Stability of object identity and appearance attributes across temporal frames.
  3. Visual Fidelity (VF): Freedom from visual artifacts, blurring, or unnatural distortions.
  4. Rule Coherence (RC): Fidelity to the causal, logical, or game rule embedded in the prompt.

Each metric is assessed via discrete Good/Medium/Poor labels, mapped to scores 1.0/0.5/0.0. For a given metric ii, the aggregated score is:

Metrici=qQiScore(q)Qi,Overall=i=14wiMetrici,wi=14\mathrm{Metric}_i = \frac{\sum_{q\in Q_i}\mathrm{Score}(q)}{|Q_i|}, \quad \mathrm{Overall} = \sum_{i=1}^4 w_i\,\mathrm{Metric}_i, \quad w_i = \tfrac14

Evaluation is performed automatically using OpenAI’s GPT-3.5-o3 (“GPT-o3”) with specialized prompting. The automated labels achieve 85.12% exact match against human annotation, with rank correlations (τ\tau, ρ\rho, rr) exceeding 0.80 for all metrics (He et al., 2 Dec 2025).

4. Experimental Results and Analysis

When evaluated on all 622 instances, state-of-the-art video generation models show pronounced deficiencies in rule-based reasoning. The leading closed-source model (Veo3.1) achieves an RC score of 48.87%, compared to perception metrics (VF and VC) between 65–90%. Open-source models perform significantly worse, with RC largely 18–23%. Task-wise analysis reveals:

  • Highest RC: Humanity and Semantics tasks, with scores up to ∼68%—presumably reflecting social media data influence.
  • Lowest RC: Game tasks, uniformly below 20%; outputs in tasks like legal chess mate or Go captures are rarely valid.
  • Moderate RC: Science tasks (33–51%) suggest partial acquisition of physical rules but incomplete causal modeling.
  • I2V Deficit: I2V lags T2V by ≈17 points in RC, indicating ongoing difficulty in visual content parsing and transformation.

Common failure modes include literal misinterpretation of idioms, incomplete chemical demonstrations, illegal or incorrect game strategies, and temporal drift or transformation in visual identity under I2V setup. Prompt Enhancement—leveraging GPT-o3 to expand prompts with explicit rule consequences—yields only modest RC gains, underscoring that models lack inherent rule-reasoning faculties.

5. Benchmark Implications and Limitations

RULER-Bench demonstrates that, despite significant progress in perceptual metrics, leading video models have not internalized the mechanisms required for human-level rule-based deduction. The evident “rule coherence” deficit suggests current architectures prioritize local texture, motion, and spatial patterns at the cost of causal and symbolic integration.

Specific findings:

  • The near doubling of RC in social/semantic tasks relative to games indicates heavy reliance on training data priors, not generalizable reasoning.
  • I2V performance gaps highlight weaknesses in fusing static semantic parsing with temporal reasoning, implicating inadequate multimodal alignment.
  • Automated, reproducible assessment using MLLM-based checklists allows precise longitudinal tracking of reasoning “health” across model development cycles.

6. Prospective Directions

Future work prompted by RULER-Bench is expected to pursue:

  • Explicit symbolic and neural–symbolic hybrid reasoning modules within video models.
  • Enhanced multimodal alignment for robust I2V performance, potentially through curriculum or self-supervised rule extraction.
  • Task expansion to Video-to-Video (V2V), multi-step reasoning, GUI interaction, algorithmic puzzles, and more complex temporal planning.
  • Checklist-based monitoring for robust, scalable evaluation of reasoning during iterative model refinement.

This suggests that comprehensive rule-based evaluation, as instantiated in RULER-Bench, is now foundational to the path toward bona fide vision foundation intelligence. By exposing deficits invisible in prior perceptual benchmarks, RULER-Bench is positioned as a community resource for building reasoning-aware architectures and for closing the gap between visual synthesis and cognitive understanding (He et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RULER-Bench.