Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Published 14 May 2026 in cs.CV | (2605.15141v1)

Abstract: Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces causal CD, a novel initialization mechanism that reduces training cost by ~4x and first-frame latency by 50% for real-time video generation.
It demonstrates superior quality and efficiency over chunk-wise AR diffusion methods with improved metrics on benchmarks like VBench and VisionReward.
The approach extends to action-conditioned world models, highlighting its versatility for interactive simulations and control environments.

Causal Forcing++: Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Motivation and Context

The progression of interactive video generation has been propelled by advances in AR diffusion models, which combine low-latency autoregressive rollout with the high expressivity of diffusion-based synthesis. However, contemporary AR diffusion distillation methods predominantly operate in chunk-wise regimes (typically 4-step generation), entailing substantial response granularity and latency that are insufficient for truly real-time, interactive deployments. Distillation from bidirectional teachers into few-step AR students has yielded improvements but remains hampered by the initialization bottleneck: existing schemes are either architecturally misaligned (ODE initialization with bidirectional teachers), not tailored for aggressive few-step generation (direct AR models), or resource-intensive (causal ODE initialization with AR teachers).

Methodological Innovations

Causal Forcing++ introduces a principled and scalable pipeline by leveraging causal consistency distillation (causal CD) for AR student initialization in the aggressive frame-wise, 1–2-step regime. The core insight is the equivalence in learning target between causal ODE distillation and causal CD—both seek to approximate the AR-conditional flow map (the consistency function) of the teacher model. Crucially, causal CD achieves supervision by querying the teacher with a single ODE step between adjacent timesteps, performed online on real video data, thus eschewing the necessity to precompute and store full PF-ODE trajectories.

Stages of Training

Stage 1 (Teacher Forcing AR Diffusion Training): The AR diffusion model is trained with teacher forcing.
Stage 2 (Efficient Few-Step Initialization via Causal CD): Instead of costly ODE trajectory-based initialization, causal CD imposes local consistency objectives, matching adjacent steps and greatly reducing per-sample optimization gaps.
Stage 3 (Asymmetric DMD and Self-Rollout): The student is further refined under asymmetric distribution matching distillation (DMD), with bidirectional teachers and AR students.

Empirical Evaluation

Extensive benchmarking on Wan2.1-1.3B establishes the superiority of Causal Forcing++ under frame-wise, 2-step configurations. The approach demonstrates:

Quantitative Gains:
- VBench Total: 84.14 (+0.1 over prior SOTA)
- VBench Quality: 84.89 (+0.3)
- VisionReward: 6.66 (+0.335)
Efficiency:
- 50% reduction in first-frame latency (from 0.60s to 0.27s)
- ~4x reduction in Stage 2 training cost (from ~11,600 to ~2,900 A800 GPU hours)
- Elimination of auxiliary trajectory storage requirements (~1,900 GiB saved)
Qualitative Outcome: Visual artifacts and dynamics exhibit superior fidelity and object consistency compared to prior chunk-wise and frame-wise AR distillation schemes.

Ablation studies further corroborate the indispensability of strong few-step initialization—multi-step AR diffusion initialization collapses in aggressive regimes, whereas causal CD offers both robust quality and scalability.

Analysis of Initialization Alternatives

The paper rigorously analyzes causal score distillation (causal DMD) and demonstrates its inferiority for initialization, despite sharper early frames. The reverse-KL, mode-seeking nature of DMD exacerbates exposure bias during AR rollout, propagating historical errors and degrading sample quality in later frames. In contrast, forward-KL, mode-covering causal CD maintains greater robustness to accumulated history errors, underscoring its suitability as a few-step initialization mechanism.

Extension to Action-Conditioned World Models

Causal Forcing++ is shown to generalize to action-conditioned world modeling in a Genie3-style paradigm, distilling a bidirectional, camera-pose-conditioned generator into an interactive AR model. This extension integrates pose annotations and further supports interactive simulation and control regimes, demonstrating practical versatility.

Implications, Limitations, and Future Directions

Causal Forcing++ marks a shift towards scalable, low-latency, interactive AR diffusion model distillation. By aligning theoretical targets with efficient training procedures and demonstrating empirical gains, it unlocks practical deployments for real-time video applications and world modeling. The methodology is inherently extensible to streaming, multi-modal, and action-conditioned generation, broadening the applicability of AR diffusion architectures.

Potential future avenues include fully real-time action-conditioned interaction under frame-wise, 1–2-step generation, integration with RL-based world modeling protocols (Wang et al., 9 Feb 2026), and further refinement of initialization strategies to mitigate exposure bias in ultra-long or open-ended generative rollouts.

Conclusion

Causal Forcing++ establishes an efficient, scalable, and theoretically principled regime for few-step AR diffusion distillation, achieving state-of-the-art performance and latency reduction in real-time, interactive video generation. The causal CD approach obviates the resource bottlenecks of prior initialization schemes, facilitates high-fidelity synthesis, and enables extensions to action-conditioned world models, setting a new standard for practical autoregressive diffusion workflows (2605.15141).

Markdown Report Issue