Papers
Topics
Authors
Recent
Search
2000 character limit reached

CogReasonBench: Creative Reasoning Benchmark

Updated 25 May 2026
  • CogReasonBench is a benchmark that measures VLMs' ability to interpret abstract creative intent and plan production steps in professional video workflows.
  • It uses human-verified chain-of-thought annotations from real animation assets to assess intent fidelity, physical plausibility, and motion coherence.
  • The evaluation employs reward-based scoring and fine-tuning methods like SFT and RFT to address gaps in generic VLM reasoning.

CogReasonBench is a reasoning-oriented evaluation benchmark designed to quantify the creative-intent cognition capabilities of multimodal vision-LLMs (VLMs) within professional video generation workflows. Created in conjunction with CogOmniControl, it systematically measures a VLM’s ability to bridge abstract control signals (storyboards, clay renders, sparse prompts) and concrete, stepwise production reasoning before any actual video synthesis is performed (Yang et al., 19 May 2026).

1. Motivation and Purpose

Existing video diffusion models excel at photorealism and temporal coherence but exhibit performance collapse under abstract, sparse, or conflicting control conditions typical in professional content creation, such as hand-drawn storyboard sketches and clay renders. Two core deficiencies motivate CogReasonBench:

  • Cognitive Gap: Generic VLMs are unable to infer the user’s “creative intent” from minimal multimodal inputs, resulting in outputs that diverge from artistic targets.
  • Alignment Gap: Even if a VLM outputs a reasoning chain, it does not guarantee downstream video synthesis adheres to this chain.

CogReasonBench directly targets the cognitive gap, isolating and evaluating the VLM’s reasoning fidelity in professional workflows, decoupled from the video generation stage.

2. Dataset Composition and Annotation

CogReasonBench is constructed exclusively from real-world animation-production assets, prioritizing authentic, human-verified creative rationale over synthetic data. The dataset includes:

Source Samples Resolution Modalities Annotation
Storyboard → Final Video 80 720 P Sketch video, text prompt Human-verified CoT
Clay-render → Final Video 70 720 P Clay animation, text prompt Human-verified CoT
Pose/Depth/Subject → Final 50 480 P Single-frame + text prompt Human-verified CoT
Total 200

Annotation is performed via stepwise, “chain-of-thought” (CoT) rationales generated and quality-controlled by expert animators. Key preprocessing steps include temporal normalization (framerate unification), semantic alignment between input assets and production clips, and double-blind gold CoT drafting followed by second-pass validation.

3. Task Design and Protocol

CogReasonBench operationalizes reasoning evaluation in a multimodal prompt–response format. Each evaluation task is defined as follows:

  • Intent Cognition: Given multimodal input C={Vctrl,Iref,Tdesc}C = \{V_\mathrm{ctrl}, I_\mathrm{ref}, T_\mathrm{desc}\}, the VLM must output a precise, concise explanation of the intended creative outcome (“Creative Intent: ...”).
  • Plan Generation: For the same input, the VLM constructs a sequential plan detailing production steps (e.g., lighting, special effects, animation cues).
  • Reasoning Consistency: Full chain-of-thought reasoning traces, scored along four axes: Creative-Intent Faithfulness, Physical Plausibility, Information Integrity, and Motion Description.

Inputs may include control videos (storyboard or clay), reference imagery, and text prompts. Outputs are explicitly structured for downstream use or scoring.

4. Evaluation Metrics and Formulation

CogReasonBench leverages a “VLM-as-judge” paradigm, employing large VLMs (e.g., Gemini 3.1-Pro) as automated evaluators. Two primary reward mechanisms structure both evaluation and reinforcement fine-tuning:

  • Holistic Reasoning Reward (RholisticR_{\mathrm{holistic}}): Aggregates VLM-scored metrics for each output dimension:

Rholistic(R,C)=kKwkVLMk(R,C)R_{\mathrm{holistic}}(R, C) = \sum_{k \in K} w_k\,\mathrm{VLM}_k(R, C)

where KK is the set of dimensions and wkw_k are weights (typically uniform).

  • Factual Accuracy Reward (RaccR_{\mathrm{acc}}): Measures precision with respect to N atomic true/false queries:

Racc(R,C)=1Ni=1NVLM(R,qi)R_{\mathrm{acc}}(R, C) = \frac{1}{N} \sum_{i=1}^N \mathrm{VLM}(R, q_i)

with VLM(R,qi)=1\mathrm{VLM}(R, q_i) = 1 iff RR correctly addresses query qiq_i.

Dimension-wise scores are averaged over all samples; the aggregate mean yields the benchmark’s global summary statistic.

5. Baseline Performance and Insights

Within CogReasonBench, several VLMs—both generic (Qwen3-VL family) and specialized (Cog VLM)—were evaluated. Results expose marked disparities in reasoning quality, especially for high-level creative interpretation:

Model Intent Phys. Integrity Motion Avg.
Qwen3-VL-8B-Instruct 2.48 4.05 3.91 4.42 3.71
Qwen3-VL-8B-Thinking 2.67 3.82 3.83 4.73 3.75
Cog VLM (SFT only) 3.73 4.45 4.27 4.96 4.34
Cog VLM (SFT→RFT) 3.99 4.45 4.60 4.96 4.47

Generic VLMs register high scores on lower-level “Physics” and “Motion” metrics but underperform in “Intent,” indicating insufficient capture of abstract creative goals. Supervised fine-tuning (SFT) with professional data substantially improves all axes, particularly high-level intent recognition. Reinforcement fine-tuning (RFT) further boosts factual integrity without regression on other dimensions (Yang et al., 19 May 2026).

6. Illustrative Cases and Generation Cues

Representative benchmark entries exemplify the structure and diagnostic fidelity of CogReasonBench. In one instance, a five-frame storyboard depicting a cloaked traveler approaching a lamppost, paired with the prompt “A mysterious traveler finds shelter under an old lamp in the rain,” yields a gold CoT specifying subject characterization, environmental lighting, precise motion plan (slow approach, pause, head turn), and special effects (rain, lamp bloom). These traces then generate precise cues (depth mapping, timing markers, particle anchoring) for downstream synthesis.

A second sample involving a clay-rendered, bouncing ball, with reference texture, yields CoT steps that precisely articulate subject physical properties (mass, elasticity), bounce kinematics, per-bounce timing, and impact visuals. Such grounded reasoning is directly convertible to generative constraints for the video model.

These illustrative entries demonstrate the differentiation between superficial multimodal understanding and deep reasoning over creative intent—serving not only assessment but also as input scaffolding for generation models in harness-like architectures.

7. Significance and Broader Implications

CogReasonBench introduces a professional-workflow-centered methodology for evaluating and training VLMs to bridge abstract conceptualization and executable planning in controllable video generation. The marked improvement from domain-specialized SFT/RFT pipelines confirms that exposure to authentic creative processes, rather than synthetic or oversimplified proxies, is pivotal for robust multi-step reasoning. A plausible implication is that future advancements in controllable generation, particularly in domains with nuanced creative demands, will increasingly depend on benchmarks with human-verified reasoning under production-realistic constraints.

CogReasonBench’s annotated data, explicit evaluation axes, and open accessibility alongside CogOmniControl provide a scalable foundation for the next wave of reasoning-centric multimodal AI research (Yang et al., 19 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CogReasonBench.