Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chunkwise Diffusion Forcing

Updated 4 July 2026
  • The paper introduces a chunkwise diffusion forcing framework that partitions sequence generation into small, context-conditioned chunks to achieve low latency and continuous output.
  • It employs tailored attention mechanisms and memory-aware self-forcing training strategies to mitigate exposure bias, error accumulation, and drift.
  • Empirical evaluations across video, motion, and language tasks demonstrate enhanced performance metrics, validating the effectiveness of the chunk-based design.

Searching arXiv for the cited papers and adjacent work on chunkwise/diffusion forcing. Chunkwise Diffusion Forcing denotes a family of streaming generative schemes in which a diffusion or flow-matching model does not synthesize an entire sequence in a single denoising pass, but instead advances through the sequence in temporally organized units—typically chunks, windows, or active bands—while conditioning each new unit on previously denoised context. Across recent work, the term covers several closely related constructions: chunked autoregressive video diffusion for interactive humanoid generation, active-window diffusion forcing for streaming motion, temporally asymmetric sliding-window denoising for causal reconstruction, and semantic chunk factorization in diffusion LLMs. The common structural theme is to combine bounded-latency rollout with conditioning on already generated or already denoised history, while mitigating the exposure bias, drift, and train–test mismatch that arise when long sequences are generated incrementally (Wang et al., 15 Jan 2026).

1. Definition and conceptual scope

In FlowAct-R1, chunkwise diffusion forcing is explicitly the mechanism that turns a normally short-clip diffusion video model into a streaming, low-latency, arbitrarily long interactive generator. The method generates video in small fixed-length chunks rather than all at once, and trains the model so that each chunk is conditioned on already generated history and can continue the video smoothly. This establishes the canonical meaning of the term in the interactive video setting: a chunked autoregressive diffusion process whose training and inference are aligned around sequential continuation (Wang et al., 15 Jan 2026).

Related work shows that the same general idea admits several task-specific realizations. FloodDiffusion adapts diffusion forcing to streaming human motion by assigning different diffusion times to different frames in a long latent sequence so that only a small active window must be denoised at each step; it emphasizes low first-token latency and explicit use of past generated motion under time-varying control (Cai et al., 3 Dec 2025). EgoForce uses a fixed-length sliding temporal window with progressive refinement, where previously denoised latents are carried over to the next window under strict causal constraints; the paper explicitly states that it is inspired by Diffusion Forcing but does not present itself as a general-purpose chunkwise diffusion-forcing framework (Hwang et al., 13 May 2026). DCDM transfers the blockwise factorization idea to discrete diffusion language modeling by replacing fixed positional blocks with learned content-defined semantic chunks, thereby reframing chunkwise diffusion forcing as autoregression over semantic groups rather than contiguous spans (Zhu et al., 15 May 2026).

This suggests that chunkwise diffusion forcing is better understood as a design pattern than as a single algorithm. The pattern has three recurring elements: partition the sequence into units that can be denoised in parallel, preserve causal or chunk-causal dependence across those units, and construct training procedures that reflect the imperfect context encountered during rollout. A plausible implication is that the term now spans both literal temporal chunking and more abstract chunk-based factorization.

2. Streaming motivation: latency, continuity, and arbitrary duration

The immediate motivation in FlowAct-R1 is the conjunction of real-time interaction and long-duration generation. Real-time interaction requires low latency: the model cannot spend many seconds denoising a long clip before returning pixels. Long-duration generation, however, tends to accumulate errors when chunks are produced autoregressively, which is especially damaging in humanoid video because identity consistency, lip-sync, and body motion must remain stable over long interactions. Chunkwise diffusion forcing is introduced precisely to address both requirements at once: low-latency synthesis through chunk-by-chunk streaming, and reduced drift and repetition through training that aligns chunk continuation with inference-time rollout (Wang et al., 15 Jan 2026).

FloodDiffusion sharpens the latency argument by contrasting diffusion forcing with chunk-by-chunk or autoregressive models with a diffusion head. Its objective is streaming motion generation under time-varying text prompts, where the system must react immediately to the newest prompt on already buffered frames. The paper argues that vanilla diffusion forcing, as used in video settings, does not work well for human motion because motion is a 1D temporal structure rather than spatial video, the control signal can change at arbitrary times, and the model must update buffered frames in response to newly arriving control. In this formulation, low first-token latency and explicit use of past generated motion are not incidental benefits but primary design requirements (Cai et al., 3 Dec 2025).

EgoForce makes the same tradeoff explicit in online egocentric motion reconstruction. Existing generative methods may handle noisy and sparse measurements but typically assume a fixed-length observation window and are thus unsuitable for real-time applications; faster autoregressive prediction sacrifices robustness. EgoForce instead maintains a persistent latent window, reuses past predictions as warm-starts, and performs only a fixed Δk\Delta k refinement as new observations arrive. The chunkwise element here is the overlapping sliding window rather than a non-overlapping chunk sequence, but the operational purpose remains the same: bounded online delay with long-horizon coherence under streaming input (Hwang et al., 13 May 2026).

Causal Forcing++ exposes a further limitation of earlier chunk-wise autoregressive diffusion distillation in video: coarse response granularity and non-negligible latency in the chunk-wise 4-step regime. Its shift to frame-wise autoregression with only 1–2 sampling steps can be read as a limit case of chunkwise diffusion forcing in which the chunk size shrinks toward a single frame. This suggests that the chunkwise formulation is partly a controllable compromise between latency, granularity, and rollout stability (Zhao et al., 14 May 2026).

3. Core mechanics: chunking, memory, and active denoising regions

In FlowAct-R1, generation is organized around a fixed-size streaming buffer containing a reference latent from the input image, a long-term memory queue of previously denoised latents, a short-term memory latent from the immediately preceding chunk, and a denoising stream containing the current chunk or chunks being refined. The paper states that the system outputs 0.5 seconds of video per 0.5 seconds of wall-clock time, corresponding to one chunk, and that inference uses a denoising stream organized as 3 chunks ×\times 3 latents per chunk. Operationally, the target sequence is split into chunks; the current chunk is denoised while conditioned on the reference and previously completed chunks; the denoised chunk is appended to memory; and the process repeats indefinitely for arbitrary-duration generation (Wang et al., 15 Jan 2026).

FloodDiffusion realizes the same principle through a lower-triangular schedule over sequence positions rather than explicit fixed chunks. It defines

αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,

with

m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.

Frames k<m(t)k<m(t) are fully denoised, frames kn(t)k\ge n(t) remain pure noise, and only the interval [m(t),n(t))[m(t),n(t)) is active. The resulting locality statement is

ut(Xt,c0:K)=[00:m(t) utm(t):n(t)(Xt0:n(t),c0:n(t)) 0n(t):K].u_t(\mathbf{X}_t,\mathbf{c}^{0:K})= \begin{bmatrix} \mathbf{0}^{0:m(t)} \ u_t^{m(t):n(t)}(\mathbf{X}_t^{0:n(t)},\mathbf{c}^{0:n(t)}) \ \mathbf{0}^{n(t):K} \end{bmatrix}.

This is still chunkwise in the operational sense: only a bounded band is denoised at each step, completed frames are fixed, future frames remain noisy, and the process advances progressively through the sequence (Cai et al., 3 Dec 2025).

EgoForce uses a fixed-length chunk of size h+1+fh+1+f,

Xt0={xth0,,xt+f0},\mathbf{X}_t^0 = \{\mathbf{x}_{t-h}^0, \dots, \mathbf{x}_{t+f}^0\},

with a rolling update: the oldest frame is discarded, shared frames are shifted forward, a new terminal frame is initialized from Gaussian noise, and only incremental denoising is performed. The future-horizon schedule is

×\times0

Past frames are clean or near-clean, the current frame is clean, and future frames carry increasing uncertainty with temporal distance. This is not simply next-step autoregression; it is rolling diffusion with structured uncertainty over a fixed overlapping buffer (Hwang et al., 13 May 2026).

DCDM abstracts the unit of chunking away from time and position. Hidden states ×\times1 are routed into ×\times2 semantic chunks by Chunking Attention, and the resulting hard assignments define chunk sets

×\times3

The sequence factorizes autoregressively over chunks,

×\times4

while inference-time attention obeys the chunk-causal mask

×\times5

Here the “chunk” is a learned semantic group rather than a temporal segment, but the formal structure—parallel denoising within a group and causal conditioning across groups—remains the same (Zhu et al., 15 May 2026).

4. Training alignment and error-accumulation control

A central difficulty in chunkwise diffusion forcing is train–test mismatch. In standard chunkwise diffusion forcing in FlowAct-R1, the model is trained with ground-truth chunks as conditioning history, whereas inference conditions on model-generated history. The paper introduces a self-forcing-style training strategy, inspired by Self-Forcing++, in which an intermediate trained model injects noise into ground-truth video latents, denoises them, and produces generated-GT-latents. During training, the model probabilistically selects generated-GT-latents instead of pure GT-latents when sampling memory components. The paper describes the memory as sampled from a mixture in which the conditioning memory is ground-truth with some probability and generated-GT otherwise. This exposes the model to realistic rollout errors during training and is presented as the main mechanism for alleviating error accumulation (Wang et al., 15 Jan 2026).

FlowAct-R1 also uses memory-aware fake-causal attention. The denoising stream can attend to the reference, memory, and itself, while the reference and memory are prevented from attending to the denoising stream. The asymmetry stabilizes the conditioning sources and ensures that already denoised information remains an uncorrupted anchor across deeper DiT layers. In addition, the paper identifies short-term memory as especially sensitive and periodically performs noise injection and denoising repair on short-term memory frames, using copies of the reference and long-term memory as stable guidance. Together, self-forcing and memory refinement are the paper’s explicit remedies for drift, repetition, and artifact propagation in long rollouts (Wang et al., 15 Jan 2026).

FloodDiffusion argues that preserving the output distribution under streaming constraints requires tailoring diffusion forcing in three specific ways: bi-directional attention instead of causal attention, a lower triangular time scheduler instead of a random one, and continuous time-varying text conditioning instead of prompt refresh. The paper reports that naive video-style choices collapse quality dramatically. Removing bi-directional attention increases HumanML3D FID from ×\times6 to ×\times7, and replacing the lower-triangular scheduler with a random scheduler increases FID to ×\times8. The theoretical claim is that exact saturation regions created by the triangular schedule yield a streaming locality theorem and allow exact factorization of the active computation; bidirectional attention is then the correct mechanism inside the active interval because that interval is not truly causal internally (Cai et al., 3 Dec 2025).

EgoForce addresses alignment and robustness through heterogeneous frame-wise corruption and causal imputation. Training samples per-frame diffusion timesteps

×\times9

and injects observed body components via

αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,0

A noisy-control robust variant further perturbs the egocentric control signal and switches between noisy observations and the model’s own denoising predictions according to a threshold αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,1. This is a task-specific mechanism, but its role is analogous: make the model robust to the imperfect, partially denoised, partially observed state encountered during online rollout (Hwang et al., 13 May 2026).

In DCDM, the corresponding alignment issue is not temporal drift but structural mismatch between fixed positional blocks and semantic dependencies. Its answer is end-to-end learned chunking. The soft routing path carries gradient from the diffusion objective through the chunking layer, while a Gumbel-Softmax straight-through auxiliary loss maintains balanced hard chunk assignments. A plausible implication is that this learned partition reduces the mismatch between the model’s denoising groups and the dependency structure actually present in the sequence (Zhu et al., 15 May 2026).

5. Mathematical formulations and attention structure

The mathematical form of chunkwise diffusion forcing differs by domain, but several recurrent motifs appear. In FlowAct-R1, the paper does not provide a standalone recurrence equation for chunkwise diffusion forcing, but it does define an operational chunked autoregressive diffusion process in which denoising is recursively conditioned on reference and memory. Its most explicit formal contribution in the excerpt is the fake-causal attention rule: denoising tokens attend to reference, memory, and themselves, whereas reference and memory are shielded from denoising-stream updates. The training curriculum comprises autoregressive adaptation, joint audio-motion training, and distillation to 3 NFEs, with a weighted loss that preserves native image-to-video capability and coherent initialization of the first chunk (Wang et al., 15 Jan 2026).

FloodDiffusion is more explicit. For each data sample αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,2, the corruption path is

αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,3

The conditional score and velocity are given as

αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,4

and

αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,5

With αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,6, training reduces to flow matching / velocity regression,

αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,7

Only the active window αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,8 is trained and denoised at each step, which is the formal basis for efficient streaming (Cai et al., 3 Dec 2025).

EgoForce formulates the causal target distribution as

αtk=clamp ⁣(tkns,0,1),βtk=1αtk,σt=0,\alpha_t^k = \mathrm{clamp}\!\left(t-\frac{k}{n_s},\,0,\,1\right), \qquad \beta_t^k = 1-\alpha_t^k, \qquad \sigma_t=0,9

and trains a denoiser m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.0 with

m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.1

Online refinement is expressed as

m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.2

This formalism emphasizes persistent latent state, incremental reverse diffusion, and strict causal conditioning rather than chunk-causal masking per se (Hwang et al., 13 May 2026).

In DCDM, chunking itself is parameterized. Each cluster m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.3 has a learnable subspace matrix m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.4, token–cluster alignment is

m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.5

and hard chunk IDs are

m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.6

The diffusion objective is

m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.7

Within a chunk, denoising is bidirectional; across chunks, it is causal. This preserves the hybrid AR–diffusion structure of block diffusion while allowing the grouping to be learned from content (Zhu et al., 15 May 2026).

6. Systems integration, distillation, and empirical findings

In practical systems, chunkwise diffusion forcing is rarely sufficient on its own; it is combined with distillation and systems optimization to achieve usable latency. FlowAct-R1 integrates chunkwise diffusion forcing into a Seedance MMDiT backbone with a streaming buffer comprising reference latent, long-term memory queue, short-term memory latent, and denoising stream. The framework combines diffusion distillation to 3 NFEs, removal of CFG overhead by distilling multiple guidance scales into one model, step distillation followed by few-step score distillation (DMD), chunk-aware DMD that simulates progressive rollout behavior, FP8 quantization on selected attention and linear layers, frame-level hybrid parallelism, kernel fusion, and asynchronous DiT denoising and VAE decoding. The reported outcome is 25 fps at 480p with TTFF around 1.5 seconds and arbitrary-length streaming generation (Wang et al., 15 Jan 2026).

The paper attributes FlowAct-R1’s superior behavioral naturalness to MLLM-guided action planning and chunkwise diffusion forcing, which together mitigate motion repetition. In a user study against KlingAvatar 2.0, LiveAvatar, and OmniHuman-1.5, it is reported to outperform them in motion naturalness, lip-sync accuracy, frame structure stability, and motion richness, while also supporting long-duration streaming, real-time responsiveness, and better perceptual realism. The experimental section does not isolate chunkwise diffusion forcing with a dedicated quantitative table, so the evidence is attributional rather than a standalone ablation (Wang et al., 15 Jan 2026).

FloodDiffusion provides a cleaner quantitative case for streaming diffusion forcing. On HumanML3D it reports R@1 m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.8, R@2 m(t)=(t1)ns,n(t)=tns.m(t)=\lceil (t-1)n_s\rceil,\qquad n(t)=\lceil t n_s\rceil.9, R@3 k<m(t)k<m(t)0, FID k<m(t)k<m(t)1, MM-Dist k<m(t)k<m(t)2, and Diversity k<m(t)k<m(t)3. On BABEL it reports Peak Jerk k<m(t)k<m(t)4 and Area Under the Jerk k<m(t)k<m(t)5, both better than PRIMAL and MotionStreamer. Compared with MotionStreamer, FID improves from k<m(t)k<m(t)6 to k<m(t)k<m(t)7. The ablations are especially relevant to chunkwise diffusion forcing because they show that omitting the tailored active-window design principles breaks performance severely (Cai et al., 3 Dec 2025).

DCDM supplies evidence from language modeling rather than temporal media. On nine downstream benchmark entries, the average at 0.5B is MDLM k<m(t)k<m(t)8, BDLM k<m(t)k<m(t)9, DCDM kn(t)k\ge n(t)0, and DCDM-MoE kn(t)k\ge n(t)1; at 1.5B it is MDLM kn(t)k\ge n(t)2, BDLM kn(t)k\ge n(t)3, DCDM kn(t)k\ge n(t)4, and DCDM-MoE kn(t)k\ge n(t)5. The paper also reports DCDM dense average kn(t)k\ge n(t)6 versus AdaBlock kn(t)k\ge n(t)7. Its advantage appears early in training and remains stable, and at 0.5B the evaluated subset is best at kn(t)k\ge n(t)8. These findings support the claim that chunkwise factorization can benefit optimization when chunk boundaries better reflect sequence structure (Zhu et al., 15 May 2026).

Causal Forcing++ is informative about the limits of chunk-wise interactive video generation. Relative to the prior chunk-wise 4-step Causal Forcing baseline, its frame-wise 2-step setting reports throughput 14.1 FPS versus 10.4 FPS, latency 0.27 s versus 0.60 s, VBench Total 84.14 versus 84.04, VBench Quality 84.89 versus 84.59, and VisionReward 6.661 versus 6.326, while reducing first-frame latency by 50%. Stage 2 cost drops from about 11,600 A800 GPU hours and about 1,900 GiB storage for causal ODE distillation to about 2,900 A800 GPU hours and 0 extra storage with causal consistency distillation. These results indicate that chunk-wise diffusion forcing established a workable regime for real-time video, but aggressive frame-wise rollout required a better initialization strategy rather than only better self-rollout (Zhao et al., 14 May 2026).

7. Relation to adjacent methods and open interpretive issues

Chunkwise diffusion forcing is often conflated with several neighboring ideas, but the recent literature distinguishes them carefully. It is not equivalent to ordinary autoregression: EgoForce, for example, is not frame-by-frame next-step prediction but rolling diffusion with multiple future frames jointly denoised under temporally asymmetric uncertainty (Hwang et al., 13 May 2026). Nor is it synonymous with any fixed blocking scheme: DCDM explicitly argues that positional blocks are a poor inductive bias for language and replaces them with learned semantic chunks, while still preserving autoregression over denoising groups (Zhu et al., 15 May 2026).

The literature also shows that there is no single universally valid attention pattern. FlowAct-R1 uses fake-causal attention in which memory and reference remain stable anchors while the denoising stream attends to them (Wang et al., 15 Jan 2026). FloodDiffusion, by contrast, argues that for streaming motion the active window should use bi-directional attention rather than causal attention, because the relevant context at time kn(t)k\ge n(t)9 is an interval [m(t),n(t))[m(t),n(t))0, not merely a strict past prefix (Cai et al., 3 Dec 2025). This is not a contradiction so much as a domain-dependent consequence of how the active denoising region is defined.

Another recurring misconception is that chunking itself solves drift. The evidence instead points to chunking as the scaffold on which anti-drift mechanisms must be built. FlowAct-R1 adds self-forcing and memory repair; EgoForce adds stabilization noise reinjection and noise-robust imputation; Causal Forcing++ identifies student initialization as the bottleneck when moving from chunk-wise 4-step to frame-wise 1–2 step generation (Wang et al., 15 Jan 2026). A plausible implication is that the decisive design axis is not simply chunk size, but the combination of chunking with training alignment, local consistency, and structured conditioning.

From a broader perspective, the family now spans at least four regimes: fixed temporal chunks for interactive video, active-band streaming for motion under changing prompts, overlapping causal windows for online reconstruction, and semantic chunks for discrete diffusion LLMs. The unifying principle is parallel denoising within a bounded unit and causal or chunk-causal dependence across units. The main unresolved issue, as suggested by the diversity of designs, is whether there exists a domain-agnostic theory of optimal chunk formation and conditioning, or whether chunkwise diffusion forcing is intrinsically task-specific in its scheduler, attention mask, and training alignment choices.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chunkwise Diffusion Forcing.