CogOmniControl: Reasoning-Driven Video Generation
- CogOmniControl is a reasoning-driven framework for controllable video generation that explicitly models creative intent cognition to align abstract and conflicting inputs.
- It integrates a specialized vision-language module (CogVLM) with a unified in-context diffusion transformer (CogOmniDiT) to overcome limitations such as identity drift and ghosting.
- The system employs reinforcement fine-tuning and a closed-loop harness for best-of-N sample selection, achieving superior performance on professional benchmarks.
CogOmniControl is a reasoning-driven framework for controllable video generation that explicitly models creative intent cognition and leverages specialized vision-language reasoning to address the limitations of existing diffusion-based video generation pipelines in professional workflows, particularly under abstract, sparse, or conflicting control conditions. By introducing a dedicated creative intent cognition module trained on authentic production data and a unified in-context diffusion backbone guided by this reasoning, CogOmniControl demonstrates substantial improvements over prior approaches in intent alignment, content fidelity, and adaptability to complex, multimodal control signals (Yang et al., 19 May 2026).
1. Motivation and Problem Definition
Controllable video generation for professional workflows, such as anime production, presents several unique challenges. Inputs such as storyboard sketches or clay renders frequently lack pixel-level detail, rendering abstract or unexpressed intent difficult to infer. Professional pipelines further require the system to synthesize not just visual or textual cues, but an underlying creative intent encompassing aspects like cinematic pacing, camera movements, and special effects. When faced with multiple, potentially conflicting control signals—e.g., a storyboard dictates dynamic motion but a clay render constrains pose—generic diffusion models exhibit failures such as identity drift, ghosting, or loss of intent.
Existing adapter-based approaches (e.g., ControlNet, VACE) treat each condition as a low-level constraint, proving inadequate for abstract semantic reasoning and intent inference, leading to misalignment and degraded visual quality. Vision–language coupling methods (e.g., OmniWeaving, VINO) embed a VLM inside the diffusion transformer, but suffer from a cognitive gap (generic VLMs hallucinate or ignore domain-specific cues) and an alignment gap (diffusion layers interpret noisy VLM output as additional noise, reducing quality when forced to align). CogOmniControl addresses these deficiencies by factorizing the task into explicit intent cognition and closed-loop generation (Yang et al., 19 May 2026).
2. Framework Architecture
CogOmniControl operates as a two-stage system:
- Creative Intent Cognition (CogVLM): A specialized vision-LLM trained to produce a dense, reasoning-rich “creative intent” from multimodal, often abstract control inputs.
- Controllable Video Generation (CogOmniDiT): A video diffusion transformer that performs in-context, unified conditioning on various input modalities, aligned with the reasoning output from CogVLM through reinforcement learning.
The generative objective is formalized as
with as the condition set (control video, reference image, text prompt) and the dense reasoning output from CogVLM.
The overall data flow consists of:
- Inputs , , processed by CogVLM outputs , harness .
- passed to CogOmniDiT to guide video generation 0.
- Harness 1 enables selection and execution of evaluators for closed-loop best-of-N sample selection (Yang et al., 19 May 2026).
3. Creative Intent Cognition Module (CogVLM)
CogVLM is based on a Qwen3-VL-8B Transformer backbone equipped with LoRA adapters. It ingests concatenated tokens from control video frames, reference images, and text, producing two outputs: a free-text chain-of-thought “creative intent” (2) and a “harness” (3) of selected evaluators. Training proceeds in two stages:
- Supervised Fine-Tuning (SFT): Cross-entropy loss on dense reasoning, using a dataset of real anime storyboards/clay renders paired with scripts:
4
- Reinforcement Fine-Tuning (RFT): Reward combines holistic score (intent, physical plausibility, info, dynamics) and accuracy (Judge metric), optimized with PPO:
5
6
The final objective is 7.
This training regimen yields a reasoning VLM with superior ability to extract and articulate domain-relevant creative intent from underspecified or complex inputs, outperforming generic VLMs in professional benchmarks (Yang et al., 19 May 2026).
4. Unified Controllable Generation and Reinforcement Alignment
CogOmniDiT is an in-context diffusion transformer that unifies multimodal control via explicit concatenation:
8
with 9 noisy latent, 0 encoding 1, 2 encoding 3, and 4 as the last-layer CogVLM feature. Denoising is conditioned on reasoning 5. Reinforcement alignment employs visual and intent-following rewards:
6
The policy update (also PPO) seeks to maximize a reward combining edit distance to the reasoning script and FID for realism:
7
Training is computationally demanding (32×H20 GPUs, 256px), while inference operates at 720p (Yang et al., 19 May 2026).
5. Closed-Loop Harness and Best-of-N Generation
A defining feature is the closed-loop “harness” integration. CogVLM outputs a harness 8, a dynamic set of evaluators (e.g., Artifact Detector, Temporal Smoothness, Storyboard Annotation Following) relevant to the specific task. The generation process:
- For 9 samples: each 0 from CogOmniDiT is scored by each 1 using 2.
- The final sample 3 maximizes the expected harness score:
4
Best-of-N (BoN) selection using the harness consistently improves alignment and quality metrics, but multiplies inference time (Yang et al., 19 May 2026).
6. Empirical Benchmarks and Performance
Evaluation uses two custom professional benchmarks:
CogReasonBench: Targets CogVLM’s reasoning performance with ~3k anime storyboard→video pairs plus general reference→video. Metrics span Intent, Physics, Integrity, Motion (score 1–5, VLM-as-judge).
| Model | Avg Score |
|---|---|
| Qwen3-VL-8B-Instruct | 3.712 |
| Qwen3-VL-8B-Thinking | 3.752 |
| CogVLM (SFT) | 4.343 |
| CogVLM (RFT) | 4.473 |
CogControlBench: Tests end-to-end framework on 200 high-res clips, scoring Aesthetic & Image Quality (AQ, IQ, TF, MS, DD), Multimodal Intent & Content Following (MI, AF, SF, CF, DF), Identity Consistency (MN, IC), and Dynamic Plausibility (DP) (judged by Gemini 3.1-Pro, avg 0–1 scale).
| Model | Avg Score |
|---|---|
| VACE-Wan2.1 | 0.665 |
| VINO | 0.686 |
| CogOmniControl (single) | 0.727 |
| CogOmniControl + BoN | 0.733 |
| CogOmniControl + Harness BoN | 0.742 |
| Proprietary Seedance2.0 | 0.750 |
Ablation shows removing RFT from CogVLM or CogOmniDiT degrades results by 2–4 points; full SFT+RFT yields best overall and individual benchmark scores (Yang et al., 19 May 2026).
7. Strengths, Limitations, and Future Directions
CogOmniControl advances controllable video generation by bridging the cognitive gap with domain-specialized intent reasoning, achieving robust multi-condition adherence via in-context diffusion, and leveraging a closed-loop harness for adaptive evaluator selection. Principal limitations are high training costs (due to large-scale RL fine-tuning) and increased inference latency from BoN sampling (5).
Proposed future research directions include distilling lightweight imitations of CogVLM reasoning to reduce latency, enabling real-time interactive user corrections to the reasoning script, and expanding to additional control modalities such as audio, scene graphs, or narrative structure (Yang et al., 19 May 2026).