Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion (2506.08009v1)

Published 9 Jun 2025 in cs.CV, cs.AI, and cs.LG

Abstract: We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/

Summary

The paper presents Self Forcing, a novel approach that conditions each frame's generation on its own outputs to address the training–inference mismatch.
It employs holistic video-level loss and a rolling KV cache mechanism to enhance efficiency and achieve real-time video synthesis at 17 FPS.
The method mitigates exposure bias in autoregressive models, enabling practical deployment in live streaming, gaming, and interactive applications.

Overview of Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

The paper "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion" presents a novel approach to training autoregressive video diffusion models, addressing key issues inherent in prior methodologies. Specifically, it tackles the problem of exposure bias in autoregressive models—where models trained on perfect ground-truth data must generate sequences based on their own imperfect outputs during inference. The proposed solution, Self Forcing, employs a training paradigm that conditions frame generation on previously self-generated frames via autoregressive rollout and key-value (KV) caching, thereby aligning the training and inference process more closely.

Key Contributions

Self Forcing Paradigm: Unlike traditional methods that rely on ground-truth context for denoising future frames, Self Forcing conditions each frame's generation on the outputs generated by itself during prior steps. This helps align the training distribution with the inference distribution, mitigating the exposure bias.
Holistic Loss Evaluation: The model optimizes a holistic loss at the video level rather than traditional frame-wise objectives. This loss evaluates the quality of the entire generated sequence, facilitating better supervision of the autoregressive process.
Efficiency Techniques: To balance computational cost and performance, a few-step diffusion model is employed alongside a stochastic gradient truncation strategy. This ensures the training remains computationally feasible while retaining the benefits of holistic loss evaluation.
Rolling KV Cache Mechanism: The introduction of a rolling KV cache mechanism allows efficient autoregressive video extrapolation, bypassing the need for extensive recomputation of attention matrices, which is prevalent in bidirectional video diffusion models.

Numerical Results

The experiments conducted demonstrate the efficacy of Self Forcing in achieving real-time streaming video generation. The model delivers video outputs with sub-second latency on a single GPU, surpassing the quality of significantly slower and non-causal diffusion models. It achieves throughput of 17 FPS, facilitating real-time interactive applications where latency is critical.

Implications

The practical implications of Self Forcing are profound for areas requiring real-time video synthesis, such as live streaming, gaming, and interactive applications. Theoretically, it offers a significant advancement in bridging the train-test distribution gap—an issue faced across various domains utilizing autoregressive models.

Future Directions

This work opens several avenues for future research. As the field progresses, there is potential for exploring more sophisticated gradient truncation strategies or hybrid models combining the strengths of autoregressive and diffusion frameworks. Additionally, future work could focus on expanding the rolling KV cache approach to handle longer context lengths efficiently, paving the way for extended video generation with consistent quality.

Conclusion

Overall, Self Forcing represents a significant step forward in autoregressive video diffusion modeling. Its novel approach to autoregressively conditioning models on self-generated outputs during training effectively addresses exposure bias, leading to improved performance and enabling practical real-time applications in AI video synthesis.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Self Forcing
GitHub - guandeh17/Self-Forcing (59 stars)

Tweets

https://twitter.com/eisneim/status/1933882618480832898

https://twitter.com/wildmindai/status/1934699313877156092

https://twitter.com/Yn5Xi9PQSFcY7UU/status/1936815840625860896

YouTube

Show All Videos