Diffusion-Forcing: Causal Sequence Denoising

Updated 4 July 2026

Diffusion-Forcing is a sequence-generation paradigm where each token receives an independent noise level for causal denoising.
It unifies next-token prediction with full-sequence diffusion, enabling stable long-horizon rollouts in applications like video prediction and planning.
The method employs a 2D denoising schedule and causal latent dynamics to condition on uncertain pasts, enhancing robustness and control.

Searching arXiv for Diffusion Forcing and closely related follow-up papers to ground the article. arXiv search: "Diffusion Forcing" Diffusion-Forcing is a sequence-generation paradigm in which a diffusion model is trained to denoise a set of tokens with independent per-token noise levels, rather than applying one global diffusion level to an entire sequence. In its original formulation, the method was introduced as a way to unify next-token prediction and full-sequence diffusion: it preserves causal, variable-length generation while retaining diffusion’s controllability, long-horizon guidance, and robustness on continuous high-dimensional sequences such as video (Chen et al., 2024). Subsequent work has treated the same core idea—causal denoising under arbitrary, token-dependent uncertainty—as a reusable design pattern for online motion reconstruction, multi-agent interaction modeling, autonomous-driving planning, data assimilation, multimodal robot learning, streaming motion generation, discrete diffusion LLMs, and autoregressive video/world-model distillation.

1. Conceptual definition

The central move in Diffusion-Forcing is to assign an independent noise level to each token in a sequence and train a causal model to denoise whatever subset of tokens happens to be noised (Chen et al., 2024). In the notation used by the original paper, clean tokens are written as $x_t^0 \equiv x_t$ , a token noised to diffusion level $k$ is $x_t^k$ , and the full training object is a sequence of the form

$(x_1^{k_1}, \dots, x_T^{k_T}),$

where each $k_t$ can differ.

This formulation departs from standard full-sequence diffusion, in which all tokens share the same diffusion level at a given denoising step, and from ordinary next-token prediction, which is trained only on the immediate next token given the exact past (Chen et al., 2024). The original paper explicitly frames the difference as a change in what “masking” means: a token at noise level $0$ is fully visible, while a token at noise level $K$ is essentially pure noise and therefore fully masked. Diffusion-Forcing thus treats sequence modeling as denoising under partial, tokenwise uncertainty rather than prediction from a pristine prefix.

That reinterpretation is why the method is presented as a unification of Bayesian filtering along time and diffusion along uncertainty/noise (Chen et al., 2024). At one extreme, if the past is clean and only the current token is uncertain, the behavior approaches next-token prediction. At the other extreme, if every token is equally noisy, the behavior approaches full-sequence diffusion. Between those limits lie schedules that neither classical autoregression nor ordinary diffusion supports naturally, including variable-horizon rollout, subsequence completion, and causal long-horizon guidance.

2. Training formulation and causal denoising

The original training setup samples per-token noise levels independently from $[K]^T$ , noises each $x_t$ to the corresponding level $k_t$ , and trains a causal latent model to predict the denoising target (Chen et al., 2024). The causal form, called Causal Diffusion Forcing (CDF), introduces a latent state $k$ 0 that is updated recurrently from the previous latent and the current noisy token. The latent dynamics are causal, $k$ 1 summarizes all past noisy tokens, and the model predicts the denoising target for the current token conditioned on $k$ 2 and the current noise level.

The original paper characterizes this causal update as a hybrid between a Bayes filter and a diffusion model (Chen et al., 2024). If $k$ 3, the model is seeing the clean token and performs a posterior-like update. If $k$ 4, the token is just noise, so the latent update reduces to a prior-like transition. This interpretation is important because it explains why the same trained system can generate one token, several tokens, or an arbitrarily long sequence: it is trained to handle arbitrary amounts of past context and arbitrary current-token uncertainty.

At sampling time, Diffusion-Forcing uses a 2D denoising schedule

$k$ 5

where $k$ 6 specifies the noise level of token $k$ 7 at denoising row $k$ 8 (Chen et al., 2024). This schedule matrix is the operational counterpart of independent per-token noise at training time. It permits some tokens to be nearly denoised while others remain highly uncertain, and it permits uncertainty to vary over both denoising depth and temporal position.

A direct consequence is that the model can condition on uncertain pasts rather than only exact histories (Chen et al., 2024). That point distinguishes Diffusion-Forcing from teacher forcing in the usual autoregressive sense. In ordinary next-token training, the model only learns from clean prefixes. In Diffusion-Forcing, the model is explicitly forced to recover tokens when earlier context itself may be partially noisy.

3. Sampling schedules, rollout, and subsequence likelihoods

Sampling in Diffusion-Forcing starts from white noise,

$k$ 9

and then iterates row-by-row, denoising left-to-right according to the 2D schedule matrix (Chen et al., 2024). This sampling design is what enables several behaviors highlighted in the original paper: stabilized autoregressive generation, “zig-zag” uncertainty schedules in which near-future tokens are denoised more than far-future tokens, and long-horizon guidance for planning.

The stabilized autoregressive mode is especially central. Rather than feeding a perfectly clean generated token back into the model, Diffusion-Forcing can feed it back as a slightly noisy token, which reduces compounding error on continuous sequences such as video (Chen et al., 2024). The same mechanism underlies the paper’s claim that the method can roll out sequences beyond the training horizon, whereas autoregressive baselines diverge.

Guidance is another major feature. The original paper states that one can use classifier guidance or more general energy-based objectives such as reward-to-go, and it introduces Monte Carlo Tree Guidance (MCTG), in which multiple future rollouts are sampled, their reward gradients are averaged, and current decisions are guided by expected future reward rather than a single sampled trajectory (Chen et al., 2024). Because future tokens can remain diffused while earlier tokens are partially denoised, gradients from long-horizon objectives can propagate backward through the causal latent dynamics.

The main theoretical result is that Diffusion-Forcing optimizes a variational lower bound on the expected log-likelihoods of subsequences or noise-corrupted sequences, not only on whole fixed-length sequences (Chen et al., 2024). The paper further emphasizes a special case in which each $x_t^k$ 0 is either $x_t^k$ 1 or $x_t^k$ 2: then any prefix, subsequence, or partially masked set of tokens can be treated as input, and the model learns the corresponding conditional distribution over the rest. This is the theoretical basis for claims that the method models all possible noise sequences simultaneously and that it supports arbitrary subsequence conditioning rather than a single fixed generation protocol.

4. Empirical behavior in the original formulation

The initial evaluation of Diffusion-Forcing spans video prediction, planning/offline reinforcement learning, robotics/imitation learning, and time-series forecasting (Chen et al., 2024). On Minecraft and DMLab video prediction, the method is reported to be uniquely stable: it can roll out far beyond the training horizon, remains temporally consistent, and supports rollouts of around 1000 frames, with appendix examples of even longer rollouts, while the model was trained on much shorter sequences.

On D4RL maze environments, Diffusion-Forcing outperforms MPPI, CQL, IQL, Diffuser, and an ablated version without MCTG, and the paper emphasizes that causal consistency matters (Chen et al., 2024). In that comparison, directly executing Diffuser’s generated actions performs poorly because the generated states and actions are not self-consistent, whereas Diffusion-Forcing’s raw action generations are.

On a real robot fruit-swapping task, the method uses latent memory to remember the initial object configuration, succeeds where a memoryless diffusion policy fails, and is reported to achieve about 80% success, dropping only modestly under visual corruption (Chen et al., 2024). The robotics result is presented as evidence that causal latent memory and conditioning on uncertain context are useful even when the sensory stream is noisy or incomplete.

On multiple GluonTS benchmark datasets, Diffusion-Forcing is competitive with strong forecasting baselines including DeepAR, Transformer-MAF, TimeGrad, and ScoreGrad (Chen et al., 2024). Within the original paper, this broader forecasting result is significant because it indicates that the objective is not restricted to video or planning, but functions as a general-purpose sequence model.

5. Domain-specific adaptations and extensions

A large subsequent literature has retained the defining ingredient—independent or heterogeneous noise levels across time or tokens—while altering the architecture, conditioning scheme, or inference schedule to match domain-specific constraints.

Paper	Domain	Diffusion-Forcing adaptation
EgoForce (Hwang et al., 13 May 2026)	Egocentric full-body motion	Temporally asymmetric, frame-wise diffusion schedule
MAGNet (Maluleke et al., 19 Dec 2025)	Multi-agent motion	Token-level independent diffusion times with inter-agent coupling
DFP (Zhang et al., 9 Jun 2026)	Autonomous driving	Independent noise on history, current, and future chunks
ForcingDAS (Jia et al., 14 May 2026)	Data assimilation	Per-frame diffusion levels with filtering-to-smoothing schedules
MDF (Huang et al., 6 Nov 2025)	Forceful manipulation	2D time-modality noise level matrix
FloodDiffusion (Cai et al., 3 Dec 2025)	Streaming text-to-motion	Lower-triangular scheduler with bidirectional attention
D2F (Wang et al., 8 Aug 2025)	Diffusion LLM inference	Block-wise causal discrete diffusion forcing
Causal-rCM (Zheng et al., 24 Jun 2026)	Streaming video and world models	Teacher-forcing and self-forcing causal distillation

In online egocentric motion reconstruction, EgoForce adapts Diffusion-Forcing to strict causal streaming from sparse and noisy head-mounted observations (Hwang et al., 13 May 2026). Its key modification is a temporally asymmetric, frame-wise diffusion schedule in which each frame in a window has its own diffusion timestep, together with a visibility mask and “noise-robust causal observation injection.” The model maintains a sliding temporal buffer of length $x_t^k$ 3, warm-starts the latent states from the previous step, appends a new future frame initialized from Gaussian noise, and performs only a small fixed refinement amount $x_t^k$ 4 rather than re-denoising the whole sequence from scratch (Hwang et al., 13 May 2026). The paper reports lower MPJPE and MPJRE-F, higher semantic similarity, lower FID, and better Peak Jerk and Area Under Jerk, and it attributes long-horizon stability to progressive refinement that continuously reuses and improves past predictions.

In multi-agent motion generation, MAGNet converts interaction motion into a tokenized, autoregressively denoised sequence in which a token corresponds to one agent at one temporal segment (Maluleke et al., 19 Dec 2025). It combines token-level independent diffusion times with a VQ-VAE pose representation, explicit pairwise transforms $x_t^k$ 5, and a coupling/consistency loss that enforces interpersonal kinematic consistency. The same framework supports partner inpainting, partner prediction, joint future prediction, agentic rollout, and ultra-long generation, and the paper reports up to 56 FPS for partner prediction and 54–56 FPS for other tasks (Maluleke et al., 19 Dec 2025).

In autonomous driving, Diffusion Forcing Planner decomposes the ego trajectory into history, current, and future chunks and assigns independent diffusion noise levels to these chunks (Zhang et al., 9 Jun 2026). The model jointly denoises history and future, and inference uses classifier-free guidance with annealed history so that history becomes a progressive and controllable influence rather than a static conditioning vector. On nuPlan, the paper reports improvements over Diffusion Planner on Val14 from $x_t^k$ 6 in non-reactive closed-loop evaluation and from $x_t^k$ 7 in reactive evaluation, and it highlights a Comfort score of 96.97 in a high-speed scenario (Zhang et al., 9 Jun 2026). The central claim is that history should stabilize planning without inducing simple copying of past motion.

In data assimilation, ForcingDAS uses a separate diffusion noise level for each frame in a trajectory and interprets Diffusion-Forcing as a trajectory prior rather than a frame-to-frame transition model (Jia et al., 14 May 2026). A scheduling matrix

$x_t^k$ 8

selects filtering, fixed-lag smoothing, or full-sequence smoothing at inference time without retraining. The paper further introduces Causality-Aware Training (CAT), which interpolates between i.i.d. noise-level sampling and sorted monotone patterns, and adds measurement guidance weighted by per-frame model uncertainty (Jia et al., 14 May 2026). Across 2D Navier–Stokes vorticity, SEVIR precipitation nowcasting, and ERA5 global atmospheric state estimation, the paper reports that a single model is competitive with or outperforms specialized learned and classical baselines, with the largest gains on real-world weather benchmarks.

In multimodal robot learning, Multimodal Diffusion Forcing generalizes time-varying noise into a 2D Time–Modality Noise Level Matrix $x_t^k$ 9 (Huang et al., 6 Nov 2025). The model learns the joint distribution of point clouds, force, actions, proprioception, rewards, and privileged states, so the same trained system can act as a policy, planner, inverse model, state estimator, or anomaly detector. On contact-rich tasks such as Nut Thread, Gear Mesh, Peg Insert, Oil Cap Installation, and Oil Cap Removal, the paper reports that MDF-Policy reaches 100% on Nut Thread versus DP3 at 96%, 86% on Gear Mesh versus DP3 at 80%, and 80% on Peg Insert, while also showing improved robustness under noisy point clouds (Huang et al., 6 Nov 2025).

In streaming human motion generation, FloodDiffusion argues that a straightforward implementation of vanilla diffusion forcing as proposed for video models fails to model real motion distributions and must be tailored in three ways: bidirectional attention instead of causal attention, a lower-triangular time scheduler instead of a random one, and continuous time-varying text conditioning (Cai et al., 3 Dec 2025). The deterministic scheduler

$(x_1^{k_1}, \dots, x_T^{k_T}),$ 0

creates a local active window and yields streaming locality. On HumanML3D, the paper reports $(x_1^{k_1}, \dots, x_T^{k_T}),$ 1, $(x_1^{k_1}, \dots, x_T^{k_T}),$ 2, $(x_1^{k_1}, \dots, x_T^{k_T}),$ 3, $(x_1^{k_1}, \dots, x_T^{k_T}),$ 4, $(x_1^{k_1}, \dots, x_T^{k_T}),$ 5, and Diversity $(x_1^{k_1}, \dots, x_T^{k_T}),$ 6, and it shows that removing bidirectional attention or the lower-triangular scheduler causes FID to jump to 3.377 and 3.883 respectively (Cai et al., 3 Dec 2025).

In discrete text generation, Discrete Diffusion Forcing (D2F) extends the idea to masked discrete diffusion LLMs by partitioning the output into blocks, imposing monotonically increasing masking across blocks, and using block-wise causal attention so that exact KV caching becomes possible (Wang et al., 8 Aug 2025). The student is distilled from a bidirectional dLLM by an asymmetric objective in which the teacher sees the whole noisy sequence while the student sees only the causal prefix up through each block. Combined with pipelined parallel decoding, the paper reports 119.9 tokens/s for D2F-Dream-Base-7B on GSM8K, compared with 48.0 for LLaMA3-Instruct-8B and 52.7 for Qwen2.5-Base-7B, and it states that acceleration over vanilla dLLMs can exceed 50× while maintaining comparable output quality (Wang et al., 8 Aug 2025).

In autoregressive video diffusion and interactive world models, Causal-rCM places Diffusion-Forcing in a broader family of teacher-forcing, diffusion-forcing, and self-forcing causal regimes (Zheng et al., 24 Jun 2026). Its contribution is a unified TF + SF distillation recipe in which TF-CM provides a forward-divergence, offline initialization and SF-DMD provides reverse-divergence, on-policy refinement. The paper introduces the first implementation of teacher-forcing continuous-time consistency models for autoregressive video diffusion, enabled by a custom-mask FlashAttention-2 JVP kernel, reports about 10× faster convergence than discrete-time consistency models, and states that a distilled 2-step causal Wan2.1-1.3B model achieves a VBench-T2V score of 84.63 with only 1 or 2 sampling steps (Zheng et al., 24 Jun 2026).

6. Recurring design tensions, limitations, and open questions

A recurrent point in the literature is that Diffusion-Forcing is not a single fixed recipe, but a family of causal denoising schemes whose practical behavior depends strongly on how uncertainty is distributed across time, modalities, or blocks. Several follow-up papers explicitly identify train–test mismatch as the central technical hazard. ForcingDAS observes that standard Diffusion-Forcing training samples per-frame noise i.i.d. while inference schedules are causally monotone, and introduces Causality-Aware Training to reduce that mismatch (Jia et al., 14 May 2026). Causal-rCM makes a related argument in the autoregressive video setting: teacher-forcing and diffusion-forcing remain offline causal regimes, but they do not fully match the model’s own rollout distribution, which motivates self-forcing refinement (Zheng et al., 24 Jun 2026).

Another recurring issue is whether causal attention is intrinsic to diffusion-forcing. The original CDF formulation is explicitly causal (Chen et al., 2024), EgoForce is designed under strict causal constraints (Hwang et al., 13 May 2026), and D2F uses block-wise causal attention for cache reuse (Wang et al., 8 Aug 2025). FloodDiffusion, however, reports that streaming human motion requires training with bi-directional attention instead of casual attention and that causal attention discards useful context and hurts quality (Cai et al., 3 Dec 2025). The disagreement is not over independent tokenwise noise itself, but over which attention structure best matches a given inference regime.

History handling is another source of methodological tension. DFP argues that injecting ego history as a static conditioning signal can induce the planner to copy historical patterns instead of adapting to the current traffic context, and therefore makes history part of the diffusion state with annealed classifier-free guidance (Zhang et al., 9 Jun 2026). EgoForce addresses a related problem by anchoring only trusted observed dimensions while leaving the rest stochastic, so incoming egocentric measurements act as soft evidence rather than hard ground truth (Hwang et al., 13 May 2026). These variants preserve the core diffusion-forcing intuition while rejecting a literal “clean history, predict future” pipeline.

The literature also records several domain-specific limitations. MDF notes that jointly learning many multimodal distributions at once makes optimization harder and suggests more targeted training strategies and heterogeneous training as important future directions (Huang et al., 6 Nov 2025). D2F indicates that quality can degrade if block size and decoding thresholds are chosen too aggressively and that the training recipe is specialized to block-wise generation and LoRA-based adaptation (Wang et al., 8 Aug 2025). FloodDiffusion states that the current system lacks instruction fine-tuning and semantic memory for commands such as “repeat your last action,” and that long stylized streaming data is scarce (Cai et al., 3 Dec 2025). DFP reports an inference-speed cost for full classifier-free guidance, with about 7.7 FPS versus 14.8 FPS for the unguided branch (Zhang et al., 9 Jun 2026).

Taken together, these works define Diffusion-Forcing less as a single algorithm than as a general sequence-modeling principle: assign heterogeneous uncertainty across a structured sequence, train the model to denoise under that uncertainty, and choose an inference schedule that matches the target deployment regime. The principle is stable across domains; the exact causal mask, schedule geometry, guidance rule, and representation are not.