Papers
Topics
Authors
Recent
Search
2000 character limit reached

CogOmniControl: Reasoning-Driven Video Generation

Updated 25 May 2026
  • CogOmniControl is a reasoning-driven framework for controllable video generation that explicitly models creative intent cognition to align abstract and conflicting inputs.
  • It integrates a specialized vision-language module (CogVLM) with a unified in-context diffusion transformer (CogOmniDiT) to overcome limitations such as identity drift and ghosting.
  • The system employs reinforcement fine-tuning and a closed-loop harness for best-of-N sample selection, achieving superior performance on professional benchmarks.

CogOmniControl is a reasoning-driven framework for controllable video generation that explicitly models creative intent cognition and leverages specialized vision-language reasoning to address the limitations of existing diffusion-based video generation pipelines in professional workflows, particularly under abstract, sparse, or conflicting control conditions. By introducing a dedicated creative intent cognition module trained on authentic production data and a unified in-context diffusion backbone guided by this reasoning, CogOmniControl demonstrates substantial improvements over prior approaches in intent alignment, content fidelity, and adaptability to complex, multimodal control signals (Yang et al., 19 May 2026).

1. Motivation and Problem Definition

Controllable video generation for professional workflows, such as anime production, presents several unique challenges. Inputs such as storyboard sketches or clay renders frequently lack pixel-level detail, rendering abstract or unexpressed intent difficult to infer. Professional pipelines further require the system to synthesize not just visual or textual cues, but an underlying creative intent encompassing aspects like cinematic pacing, camera movements, and special effects. When faced with multiple, potentially conflicting control signals—e.g., a storyboard dictates dynamic motion but a clay render constrains pose—generic diffusion models exhibit failures such as identity drift, ghosting, or loss of intent.

Existing adapter-based approaches (e.g., ControlNet, VACE) treat each condition as a low-level constraint, proving inadequate for abstract semantic reasoning and intent inference, leading to misalignment and degraded visual quality. Vision–language coupling methods (e.g., OmniWeaving, VINO) embed a VLM inside the diffusion transformer, but suffer from a cognitive gap (generic VLMs hallucinate or ignore domain-specific cues) and an alignment gap (diffusion layers interpret noisy VLM output as additional noise, reducing quality when forced to align). CogOmniControl addresses these deficiencies by factorizing the task into explicit intent cognition and closed-loop generation (Yang et al., 19 May 2026).

2. Framework Architecture

CogOmniControl operates as a two-stage system:

  • Creative Intent Cognition (CogVLM): A specialized vision-LLM trained to produce a dense, reasoning-rich “creative intent” from multimodal, often abstract control inputs.
  • Controllable Video Generation (CogOmniDiT): A video diffusion transformer that performs in-context, unified conditioning on various input modalities, aligned with the reasoning output from CogVLM through reinforcement learning.

The generative objective is formalized as

P(VC)=P(VR,C)P(RC)P(V\,|\,C) = P(V\,|\,R,\,C) \cdot P(R\,|\,C)

with C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\} as the condition set (control video, reference image, text prompt) and RR the dense reasoning output from CogVLM.

The overall data flow consists of:

  1. Inputs VctrlV_\text{ctrl}, IrefI_\text{ref}, TdescT_\text{desc} processed by CogVLM \to outputs RR, harness HH.
  2. RR passed to CogOmniDiT to guide video generation C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}0.
  3. Harness C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}1 enables selection and execution of evaluators for closed-loop best-of-N sample selection (Yang et al., 19 May 2026).

3. Creative Intent Cognition Module (CogVLM)

CogVLM is based on a Qwen3-VL-8B Transformer backbone equipped with LoRA adapters. It ingests concatenated tokens from control video frames, reference images, and text, producing two outputs: a free-text chain-of-thought “creative intent” (C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}2) and a “harness” (C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}3) of selected evaluators. Training proceeds in two stages:

  • Supervised Fine-Tuning (SFT): Cross-entropy loss on dense reasoning, using a dataset of real anime storyboards/clay renders paired with scripts:

C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}4

  • Reinforcement Fine-Tuning (RFT): Reward combines holistic score (intent, physical plausibility, info, dynamics) and accuracy (Judge metric), optimized with PPO:

C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}5

C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}6

The final objective is C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}7.

This training regimen yields a reasoning VLM with superior ability to extract and articulate domain-relevant creative intent from underspecified or complex inputs, outperforming generic VLMs in professional benchmarks (Yang et al., 19 May 2026).

4. Unified Controllable Generation and Reinforcement Alignment

CogOmniDiT is an in-context diffusion transformer that unifies multimodal control via explicit concatenation:

C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}8

with C={Vctrl,Iref,Tdesc}C = \{V_\text{ctrl}, I_\text{ref}, T_\text{desc}\}9 noisy latent, RR0 encoding RR1, RR2 encoding RR3, and RR4 as the last-layer CogVLM feature. Denoising is conditioned on reasoning RR5. Reinforcement alignment employs visual and intent-following rewards:

RR6

The policy update (also PPO) seeks to maximize a reward combining edit distance to the reasoning script and FID for realism:

RR7

Training is computationally demanding (32×H20 GPUs, 256px), while inference operates at 720p (Yang et al., 19 May 2026).

5. Closed-Loop Harness and Best-of-N Generation

A defining feature is the closed-loop “harness” integration. CogVLM outputs a harness RR8, a dynamic set of evaluators (e.g., Artifact Detector, Temporal Smoothness, Storyboard Annotation Following) relevant to the specific task. The generation process:

  • For RR9 samples: each VctrlV_\text{ctrl}0 from CogOmniDiT is scored by each VctrlV_\text{ctrl}1 using VctrlV_\text{ctrl}2.
  • The final sample VctrlV_\text{ctrl}3 maximizes the expected harness score:

VctrlV_\text{ctrl}4

Best-of-N (BoN) selection using the harness consistently improves alignment and quality metrics, but multiplies inference time (Yang et al., 19 May 2026).

6. Empirical Benchmarks and Performance

Evaluation uses two custom professional benchmarks:

CogReasonBench: Targets CogVLM’s reasoning performance with ~3k anime storyboard→video pairs plus general reference→video. Metrics span Intent, Physics, Integrity, Motion (score 1–5, VLM-as-judge).

Model Avg Score
Qwen3-VL-8B-Instruct 3.712
Qwen3-VL-8B-Thinking 3.752
CogVLM (SFT) 4.343
CogVLM (RFT) 4.473

CogControlBench: Tests end-to-end framework on 200 high-res clips, scoring Aesthetic & Image Quality (AQ, IQ, TF, MS, DD), Multimodal Intent & Content Following (MI, AF, SF, CF, DF), Identity Consistency (MN, IC), and Dynamic Plausibility (DP) (judged by Gemini 3.1-Pro, avg 0–1 scale).

Model Avg Score
VACE-Wan2.1 0.665
VINO 0.686
CogOmniControl (single) 0.727
CogOmniControl + BoN 0.733
CogOmniControl + Harness BoN 0.742
Proprietary Seedance2.0 0.750

Ablation shows removing RFT from CogVLM or CogOmniDiT degrades results by 2–4 points; full SFT+RFT yields best overall and individual benchmark scores (Yang et al., 19 May 2026).

7. Strengths, Limitations, and Future Directions

CogOmniControl advances controllable video generation by bridging the cognitive gap with domain-specialized intent reasoning, achieving robust multi-condition adherence via in-context diffusion, and leveraging a closed-loop harness for adaptive evaluator selection. Principal limitations are high training costs (due to large-scale RL fine-tuning) and increased inference latency from BoN sampling (VctrlV_\text{ctrl}5).

Proposed future research directions include distilling lightweight imitations of CogVLM reasoning to reduce latency, enabling real-time interactive user corrections to the reasoning script, and expanding to additional control modalities such as audio, scene graphs, or narrative structure (Yang et al., 19 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CogOmniControl.