Self-Forcing in Video Diffusion Training

Updated 4 July 2026

Self-Forcing is a training paradigm in video diffusion that replaces teacher-forcing with self-generated contexts to reduce exposure bias during autoregressive rollout.
It employs mechanisms like KV caching, few-step diffusion, and holistic video-level objectives to better match inference conditions with training.
Extensions such as Self-Forcing++, Mutual Forcing, and latent self-forcing in video-language models further enhance generation quality and operational efficiency.

Searching arXiv for papers on Self-Forcing and closely related variants. Self-Forcing is a training and rollout paradigm for autoregressive video diffusion in which a model conditions each new frame or chunk on previously self-generated outputs rather than on teacher-forced ground truth. In its canonical formulation, Self Forcing addresses exposure bias by aligning training with inference-time rollout, couples that alignment to key-value (KV) caching and few-step diffusion, and evaluates generation with holistic video-level objectives rather than only frame-wise denoising losses. Subsequent work extended the idea to long-horizon correction beyond a short teacher’s horizon, teacher-free dual-mode audio-video generation, unified teacher-forcing/self-forcing distillation recipes, and latent reasoning in video-LLMs (Huang et al., 9 Jun 2025).

1. Definition, scope, and contrast with teacher-forcing

In autoregressive video diffusion, the core mismatch is that training commonly conditions on clean or externally noised ground-truth history, while inference conditions on the model’s own imperfect generations. Self Forcing replaces those data-side contexts with self-generated contexts during training. In the original formulation, the autoregressive factorization is

$p(x^{1:N})=\prod_{i=1}^{N} p(x^i\mid x^{<i}),$

and Self Forcing explicitly samples from the model rollout distribution

$\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$

so the training distribution is the one induced by the generator itself (Huang et al., 9 Jun 2025).

This differs from both teacher forcing and diffusion forcing. Teacher forcing denoises the next frame conditioned on clean ground-truth past frames. Diffusion forcing conditions on noisy past frames with independently sampled noise levels, but still uses data-side contexts rather than the model’s own rollout distribution. A central misconception is therefore to equate Self-Forcing with merely using noisier history; the defining change is oracle context versus self-generated context, not simply clean versus noisy context (Huang et al., 9 Jun 2025).

Later work in causal distillation formalized the same distinction as a contrast between offline and on-policy causal training. In that framing, teacher-forcing is the stable, offline, forward-style objective using clean history, whereas self-forcing is the on-policy regime in which the model rolls out chunks sequentially with KV caching and is trained under its own inference-time context distribution. This recasts Self-Forcing as the autoregressive analogue of reverse-divergence refinement: it directly simulates autoregressive inference during training, but is correspondingly more sensitive to rollout instability and initialization quality (Zheng et al., 24 Jun 2026).

2. Canonical formulation in autoregressive video diffusion

The original Self Forcing method retains diffusion as the conditional generator inside each autoregressive step, but replaces teacher-forced context with self-generated context. For a frame $x^i$ , the forward corruption is written as

$x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$

and a denoiser $G_\theta$ predicts the added noise. Standard training minimizes the frame-wise denoising MSE

$\mathcal L_{\text{DM}}(\theta) = \mathbb E_{x^i,t^i,\epsilon^i} \left[ w_{t^i}\,\|\hat\epsilon_\theta^i-\epsilon^i\|_2^2 \right].$

Self Forcing departs from this objective regime by training on model rollouts rather than on isolated frame-wise denoising under data contexts (Huang et al., 9 Jun 2025).

To make such on-policy rollout feasible, the method uses a few-step diffusion backbone. It defines a subsequence of timesteps $\{t_0=0,t_1,\dots,t_T=1000\}$ and models each conditional generator as a composition of denoising maps

$p_\theta(x^i\mid x^{<i}) = f_{\theta,t_1}\circ f_{\theta,t_2}\circ \cdots \circ f_{\theta,t_T}(x^i_{t_T}), \qquad x^i_{t_T}\sim\mathcal N(0,I).$

Algorithmically, training initializes an empty output sequence and empty KV cache, samples a random truncation step $s\sim\text{Uniform}\{1,\dots,T\}$ , and for each frame denoises from $t_T$ down to $\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 0 without gradients, enables gradients at step $\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 1, produces $\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 2, appends it to the generated sequence, computes and caches its KV embeddings with gradients detached, and continues autoregressively on the self-generated history (Huang et al., 9 Jun 2025).

The memory-control mechanism is stochastic gradient truncation. Only the final denoising step of each frame receives gradient, the truncation index is sampled randomly, and gradients through the KV cache are detached. This keeps the graph short enough to be trainable while still supervising different diffusion steps across iterations (Huang et al., 9 Jun 2025).

A second defining element is the shift from frame-wise denoising loss to holistic video-level distribution matching. The target is expressed as

$\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 3

with both distributions optionally noised to align $\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 4 and $\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 5. The paper instantiates this with three alternatives: DMD Distribution Matching Distillation, SiD Score Identity Distillation, and a relativistic GAN objective with regularization. The conceptual point is that Self Forcing evaluates the entire generated rollout under the actual rollout process, rather than optimizing only local denoising correctness under teacher-forced contexts (Huang et al., 9 Jun 2025).

3. Streaming generation, rolling KV cache, and reported performance

Self Forcing also introduces a rolling KV cache for long video extrapolation. Rather than recomputing the whole overlapping context, inference maintains a cache of size $\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 6, appends the KV state of each newly generated frame, and evicts the oldest entry when full. The reported amortized complexity is

$\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 7

for long video extrapolation, compared with roughly $\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 8 for bidirectional diffusion sliding windows and $\{x_\theta^{1:N}\}\sim p_\theta(x^{1:N})=\prod_{i=1}^{N} p_\theta(x^i\mid x^{<i}),$ 9 for prior causal diffusion methods that still recompute overlapping KV states (Huang et al., 9 Jun 2025).

The method includes a training trick for rolling-KV stability: when denoising the final chunk during training, the model cannot attend to the first chunk. This local-attention training is intended to match long-horizon rolling inference conditions and reduce flicker and artifacts (Huang et al., 9 Jun 2025).

The implementation reported in the original study uses Wan2.1-T2V-1.3B, a flow-matching video model generating 5-second videos at 16 FPS and $x^i$ 0 resolution. The few-step model uses 4 denoising steps with timesteps $x^i$ 1, and both frame-wise and chunk-wise autoregressive variants are evaluated. Training uses 64 H100 GPUs. DMD training converges in about 1.5 hours on 64 H100s, while SiD and GAN take about 2–3 hours (Huang et al., 9 Jun 2025).

The reported performance claims are framed around real-time streaming generation with sub-second latency on a single GPU. Chunk-wise Self Forcing is reported at 17.0 FPS, 0.69 s latency, VBench total score 84.31, quality score 85.07, and semantic score 81.28. Frame-wise Self Forcing is reported at 8.9 FPS, 0.45 s latency, VBench total score 84.26, quality score 85.25, and semantic score 80.30. In the same comparison table, Wan2.1 is reported at 0.78 FPS and 103 s latency, SkyReels-V2 at 0.49 FPS and 112 s latency, MAGI-1 at 0.19 FPS and 282 s latency, and CausVid at 17.0 FPS and 0.69 s latency, with Self Forcing improving CausVid’s total score by 3.11 VBench points (Huang et al., 9 Jun 2025).

The reported limitations are also specific. Quality drops when generating much longer videos than the training context length, and the memory-saving truncation that only backpropagates through the final denoising step may restrict learning of very long-range dependencies. These two points became central motivations for later work on longer-horizon correction and better memory mechanisms (Huang et al., 9 Jun 2025).

4. Long-horizon scaling and the KV-cache bottleneck

A systems-level reformulation of Self-Forcing emerged when long-horizon rollout was treated not only as a modeling problem but as a cache-management problem. In the Wan2.1-based Self-Forcing stack studied in 2026, self-forcing video generation extends a short-horizon model to longer rollouts by repeatedly feeding generated content back as context, which causes the KV cache to grow with rollout length. The study evaluates 33 quantization and cache-policy variants over 610 prompt-level observations and 63 benchmark-level summaries on MovieGen and StoryEval, jointly measuring peak VRAM, runtime, realized compression ratio, VBench imaging quality, BF16-referenced fidelity, and terminal drift (Ranganath et al., 29 Mar 2026).

The scaling intuition is summarized as

$x^i$ 2

and more concretely

$x^i$ 3

where $x^i$ 4 is rollout length, $x^i$ 5 the number of layers, $x^i$ 6 the number of heads, and $x^i$ 7 the head dimension. The study also makes the practical point that even if the cache is quantized by a factor $x^i$ 8, measured peak VRAM can still differ sharply from the idealized compressed footprint because attention reads and refresh stages may reconstruct or retain large BF16 buffers. Its summary statement is explicit:

$x^i$ 9

This is presented as a systems artifact rather than a failure of compression theory (Ranganath et al., 29 Mar 2026).

The strongest practical operating region is identified as a FlowCache-inspired soft-prune INT4 adaptation, specifically FLOWCACHE_SOFT_PRUNE_INT4. The study also highlights PRQ_INT4 and QUAROT_KV_INT4 as fidelity leaders that are operationally expensive. The key reported tradeoffs are as follows.

Method	Reported strengths	Reported costs
FLOWCACHE_SOFT_PRUNE_INT4	MovieGen: 5.49×, 11.71 GB, 75.0 s, 0.739 imaging, 0.738 drift-last; StoryEval: 5.42×, 11.76 GB, 75.2 s, 0.680 imaging, 0.679 drift-last	modest runtime overhead
PRQ_INT4	MovieGen: SSIM 0.824, LPIPS 0.082; StoryEval: SSIM 0.724, strong drift-last quality	runtime around 160 s; peak VRAM around 20.69 GB
QUAROT_KV_INT4	preserves quality better than plain RTN	runtime around 236–240 s; peak VRAM around 19.98 GB

A major negative finding is that nominal KV compression does not guarantee lower peak VRAM. The current integration may reconstruct dense BF16 buffers during attention reads, retain BF16 regions intentionally for recent context, or briefly hold both compressed and reconstructed states during refresh or update phases. The study cites QUAROT_KV_INT4, RTN_INT4_RECENT2, and RTN_INT4_REFRESH as examples of methods that can compress the KV cache mathematically yet still peak near or above BF16 memory in practice (Ranganath et al., 29 Mar 2026).

This systems perspective materially changes the meaning of deployable Self-Forcing. It suggests that longer rollouts require not only better generation quality but also attention paths that can consume compressed cache natively and refresh mechanisms that avoid transient BF16 spikes. The paper therefore contributes not only a comparison of 33 methods but also a benchmark harness and empirical map of which KV-cache ideas are practical today and which remain research directions (Ranganath et al., 29 Mar 2026).

5. Extensions, successors, and unifications

Self-Forcing rapidly became a template for successor frameworks. One direction, Self-Forcing++, targets the regime beyond the short teacher’s horizon. Its central mechanism is to generate a long video with the autoregressive student itself, sample contiguous windows from that long rollout, add noise back to the student’s own denoised latent through backward noise initialization, and distill teacher corrections on those sampled segments. The method keeps temporal consistency while scaling video length by up to 20× beyond the teacher’s capability, reports generation up to 255 seconds, and reaches 1023 frames out of the base model’s 1024 latent-frame span, described as 99.9% of the maximum supported span (Cui et al., 2 Oct 2025).

A second direction, Mutual Forcing, transfers the self-generated-history idea to fast autoregressive audio-video character generation. It is proposed as a teacher-free, dual-mode self-evolution framework in which a single weight-shared model operates in Few-step and Multi-step modes. Few mode predicts an interval-averaged velocity for large jumps, while Multi mode uses the same parameters as an instantaneous velocity predictor in a probability flow ODE. The Multi-step mode improves the Few-step mode via self-distillation, and the Few-step mode generates historical context during training to improve training-inference consistency. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, and allows direct learning from real paired data. Experiments report that strong baselines often use around 50 sampling steps or even 100 NFEs, whereas Mutual Forcing uses only 4 or 8 steps; the speed table reports 30 FPS at low resolution on 1 GPU, versus 0.6 FPS for Universe-1 and 1.3 FPS for Ovi (Zhou et al., 28 Apr 2026).

A third line of work, Causal-rCM, treats self-forcing as the reverse-divergence, on-policy refinement stage within a three-stage recipe: teacher-forcing conversion of a bidirectional diffusion model into a causal model, teacher-forcing consistency distillation into a few-step student, and self-forcing DMD refinement on self-generated rollouts. The paper’s explicit claim is that teacher-forcing CM is currently the best complement to self-forcing DMD as an initialization strategy. It also reports the first implementation of teacher-forcing-based continuous-time CMs for autoregressive video diffusion, enabled by a custom-mask FlashAttention-2 JVP kernel, with over 10× faster convergence compared to discrete-time CMs. In the reported Wan2.1 setup, the distilled 2-step causal Wan2.1-1.3B model achieves a VBench-T2V score of 84.63 with only 1 or 2 sampling steps (Zheng et al., 24 Jun 2026).

Taken together, these frameworks preserve the central Self-Forcing insight—training under self-generated history—while differing on whether the system remains teacher-based, becomes teacher-free, or combines teacher-forcing and self-forcing sequentially. A plausible implication is that “Self-Forcing” now denotes both a specific 2025 method and a broader family of on-policy autoregressive training strategies.

6. Latent self-forcing in video-LLMs

The term also acquired a distinct meaning in multimodal LLMs. In VideoLatent, latent self-forcing refers to a training paradigm in which the model self-generates continuous latent reasoning states $x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 0 and is forced to keep those latent states aligned with the video and question context through contrastive objectives, using only standard video-question-answer triplets. The method explicitly distinguishes itself from pseudo-label self-training, chain-of-thought supervision, and latent reasoning methods that require auxiliary supervision such as CoT traces, helper images, bounding boxes, or pretrained visual foundation models (Hu et al., 22 Jun 2026).

Architecturally, VideoLatent adds a latent mode and a latent injection module $x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 1 to an MLLM backbone. After a special token <|video_latent_start|>, the model can generate latent thoughts for up to $x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 2 steps or until <|video_latent_end|>. The latent injection module combines three cross-attention paths,

$x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 3

to keep latent reasoning grounded in video and question context (Hu et al., 22 Jun 2026).

The overall training objective is

$x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 4

where $x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 5 is latent-video alignment, $x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 6 latent-question alignment, $x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 7 inter-latent diversity, and $x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 8 intra-latent diversity. The paper reports training on 81k video-question-answer triplets, memory bank size $x_{t^i}^i = \Psi(x^i,\epsilon^i,t^i)=\alpha_{t^i}x^i+\sigma_{t^i}\epsilon^i,\qquad \epsilon^i\sim \mathcal N(0,I),$ 9, temperature $G_\theta$ 0, $G_\theta$ 1, and $G_\theta$ 2 (Hu et al., 22 Jun 2026).

The empirical effect is reported through both benchmark comparisons and ablations. On a 16-frame setting, latent self-forcing improves over SFT on MVBench from 65.8 to 68.3, TempCompass from 72.4 to 73.7, Video-MME from 57.5 to 59.3, LongVideoBench from 53.0 to 57.3, VCR-Bench from 46.3 to 49.6, and VideoMathQA from 28.8 to 30.5, with a slight drop on Video-TT from 43.9 to 43.4. Relative to Video-R1, the paper reports about 6× lower training overhead and about 68× lower inference overhead. This broadens the term “self-forcing” from autoregressive generation to implicit latent reasoning, while preserving the central idea of training on self-generated internal trajectories that are then regularized for fidelity and diversity (Hu et al., 22 Jun 2026).

7. Distinct non-generative usage in forcing theory

A conceptually separate use of related language appears in set theory and philosophical logic. In the forcing version of Yablo’s paradox, forcing is used to separate names from objects: a ground model $G_\theta$ 3 contains a name for an infinite family of sentences, a generic filter $G_\theta$ 4 interprets that name, and the extension $G_\theta$ 5 contains the realized object. The construction uses a Prikry-style forcing notion with measurable cardinal $G_\theta$ 6, normal ultrafilter $G_\theta$ 7, and conditions $G_\theta$ 8 consisting of a finite increasing stem, a finite Yablo-like sentence fragment, and an ultrafilter set (Garti, 2021).

The crucial structural move is that each condition contains only a finite approximation $G_\theta$ 9 of the eventual sequence, while the generic union

$\mathcal L_{\text{DM}}(\theta) = \mathbb E_{x^i,t^i,\epsilon^i} \left[ w_{t^i}\,\|\hat\epsilon_\theta^i-\epsilon^i\|_2^2 \right].$ 0

produces an infinite family $\mathcal L_{\text{DM}}(\theta) = \mathbb E_{x^i,t^i,\epsilon^i} \left[ w_{t^i}\,\|\hat\epsilon_\theta^i-\epsilon^i\|_2^2 \right].$ 1 in the extension. The paper states that this forcing notion is $\mathcal L_{\text{DM}}(\theta) = \mathbb E_{x^i,t^i,\epsilon^i} \left[ w_{t^i}\,\|\hat\epsilon_\theta^i-\epsilon^i\|_2^2 \right].$ 2-cc, satisfies the Prikry property, and that $\mathcal L_{\text{DM}}(\theta) = \mathbb E_{x^i,t^i,\epsilon^i} \left[ w_{t^i}\,\|\hat\epsilon_\theta^i-\epsilon^i\|_2^2 \right].$ 3 is $\mathcal L_{\text{DM}}(\theta) = \mathbb E_{x^i,t^i,\epsilon^i} \left[ w_{t^i}\,\|\hat\epsilon_\theta^i-\epsilon^i\|_2^2 \right].$ 4-closed, hence all cardinals are preserved in the generic extension. The resulting paradox is therefore generated not by direct self-reference inside a single sentence, but by the passage from finite approximations in the ground model to a paradoxical totality in the extension (Garti, 2021).

This usage is not part of the machine-learning lineage of Self-Forcing. Its relevance is terminological and philosophical: it uses forcing to show how a paradoxical object can be manufactured through name/object separation and generic realization, yielding what the paper describes as a form of self-forcing through the ground-model/extension divide. The connection to the generative-model sense is therefore analogical rather than technical (Garti, 2021).