Papers
Topics
Authors
Recent
Search
2000 character limit reached

FIFO-Diffusion: Text-Conditional Infinite Video

Updated 19 February 2026
  • FIFO-Diffusion is an algorithm for text-to-video synthesis that generates arbitrarily long videos with constant memory usage.
  • It employs a FIFO queue with diagonal denoising, latent partitioning, and lookahead to overcome training-inference mismatches and preserve temporal coherence.
  • Empirical evaluations demonstrate improved motion plausibility and reduced artifacts over extended video sequences using pretrained models.

FIFO-Diffusion is an inference algorithm for text-conditional video generation using pretrained diffusion models, designed to achieve arbitrarily long or infinite video synthesis with constant memory usage and no retraining. It operates by maintaining a first-in-first-out (FIFO) queue of video frame latents at distinct noise levels and denoising them diagonally, enabling frames to be generated sequentially while reusing the fixed-length denoiser trained on short clips. This approach addresses the challenge of extending diffusion-based video generation to long sequences without temporal discontinuities or prohibitive memory requirements, introducing latent partitioning and lookahead denoising to mitigate discrepancies between training and inference (Kim et al., 2024).

1. Problem Setup and Motivation

Standard text-to-video diffusion models are trained on short fixed-length clips (typically f=16f=16 frames) and apply denoising to all frames in a clip at the same noise level. Extending these models to generate sequences of length NfN \gg f using naive approaches (such as autoregressively generating overlapping chunks) requires O(N)\mathcal{O}(N) memory and incurs discontinuities at chunk boundaries. The objective of FIFO-Diffusion is to design an efficient, training-free inference algorithm that reuses the original ff-frame denoiser, maintains O(f)\mathcal{O}(f) memory complexity regardless of NN, and produces temporally coherent, arbitrarily long videos conditioned on a text prompt (Kim et al., 2024).

2. Algorithmic Structure: Diagonal Denoising with FIFO Queue

FIFO-Diffusion introduces a diagonal denoising strategy in which a queue QQ of ff frame latents is maintained, each corresponding to a different noise level. At each iteration:

  • The latents in Q=[zτ11,zτ22,,zτff]Q = [z_{\tau_1}^1, z_{\tau_2}^2, \dots, z_{\tau_f}^f] (with timesteps τ1<τ2<<τf=T\tau_1 < \tau_2 < \dots < \tau_f = T) are jointly denoised in a single model forward pass:

Q=Φ(Q,[τ1,,τf],c;εθ)Q' = \Phi(Q, [\tau_1, \ldots, \tau_f], c; \varepsilon_\theta)

  • The first latent (zτ01z_{\tau_0}^1, now fully denoised) is decoded as the output video frame.
  • A fresh noise latent is sampled and enqueued at the FIFO tail.

This "sliding diagonal" over the diffusion f×Sf \times S grid ensures that the denoiser exploits future context (forward referencing) without exceeding the original model's context size. Memory usage is constrained to O(f)\mathcal{O}(f) latents, and each frame requires a constant number of model calls. The practical pseudo-code is:

1
2
3
4
5
6
Initialize Q = [z_{τ_1}¹, , z_{τ_f}^f]
for i in 1...N:
    Q = Φ(Q, [τ...τ_f], c; ε_θ)
    z_out = Q.dequeue()
    v_i = Dec(z_out)
    Q.enqueue(Normal(0, I))

3. Addressing the Training-Inference Mismatch

The original diffusion model is trained only on clips where all ff frames share the same noise level. During FIFO-Diffusion inference, the ff frames in the diagonal queue have distinct—and often widely separated—noise levels, leading to a training-inference distributional gap.

Latent Partitioning: To mitigate this, the noise-level range [τ1,τf][\tau_1, \tau_f] is partitioned into nn contiguous sub-ranges, each of size ff. The queue expands to nfn f latents, segmented into nn blocks Q0,,Qn1Q_0, \dots, Q_{n-1}. Each block is processed independently and in parallel, shrinking the noise-level gap within each block and reducing the additional error (theoretically of order O(στfστ1)\mathcal{O}(|\sigma_{\tau_f} - \sigma_{\tau_1}|)) (Kim et al., 2024).

Lookahead Denoising: To ensure all latents in the queue benefit from forward reference (not just the noisiest), frames are grouped so each latent appears in a block tail in at least one partitioning pass. This involves holding a queue of $2 n f'$ latents (with f=f/2f' = \lfloor f/2 \rfloor), updating the last ff' elements of each block per pass. This approach doubles compute cost but remains fully parallelizable.

4. Computational and Memory Analysis

FIFO-Diffusion's complexity is:

  • Memory: O(f)\mathcal{O}(f), independent of the target video length NN.
  • Time per frame: O(1)\mathcal{O}(1) model calls for basic diagonal denoising (nn calls for nn-partition, $2n$ calls with lookahead; all are parallelizable).

A comparison with alternative training-free methods is summarized below:

Method Memory for 512 Frames Latency per Frame (1 GPU)
FreeNoise 44 GB 6.09 s
Gen-L-Video 11 GB 22.07 s
FIFO-Diffusion (n=4) 11.2 GB 6.20 s
FIFO-Diffusion (n=4, lookahead) 11.2 GB 12.37 s

Memory usage remains bounded regardless of NN, unlike methods requiring all latents in memory (Kim et al., 2024).

5. Empirical Evaluation

Experiments on four pretrained backbones (VideoCrafter1, VideoCrafter2, zeroscope, Open-Sora Plan) demonstrate that FIFO-Diffusion produces 10,000-frame videos without perceptible quality degradation or chunk artifacts. Frame-to-frame coherence and prompt transitions are preserved. Windowed Fréchet Video Distance (FVD) remains stable, unlike FreeNoise, which degrades over long samples. User studies (70 raters, 30 prompts) show FIFO-Diffusion is preferred in over 70% of cases for overall quality, motion plausibility, motion magnitude, text fidelity, and aesthetics.

Ablation studies quantify the effectiveness of latent partitioning and lookahead:

  • No partition/lookahead yields 1.09×1.09 \times the base model's ε\varepsilon-prediction MSE.
  • Partitioning with n=4n=4 reduces this to 1.02×1.02 \times; adding lookahead brings it to 0.98×0.98 \times, surpassing the base model.
  • Temporal smoothness improves as blockiness and flicker artifacts are reduced or eliminated (Kim et al., 2024).

6. Limitations and Prospects

The diagonal denoising approach in inference cannot perfectly match the noise-level distribution seen during training, even after partitioning and lookahead. Ideal resolution would require incorporating diagonal denoising into the training regime (e.g., sampling ff different noise levels per batch). Future research directions include:

  • Training models to directly handle multi-noise-level inputs, eliminating doubled compute costs of lookahead.
  • Exploring non-linear or content-aware noise-level schedules.
  • Extending the paradigm to adaptive ff or spatio-temporal latent partitioning (Kim et al., 2024).

7. Relation to Prior Work

FIFO-Diffusion is distinct from prior literature on diffusion approximations in queueing theory, e.g., the FIFO diffusion models for many-server queues with abandonment, in which diffusion processes approximate queue length dynamics under first-in-first-out discipline (He et al., 2011). The application of FIFO queuing in the context of infinite video generation, however, is an inference-time algorithmic innovation for text-to-video diffusion, unrelated to the stochastic process theory underlying classical queueing models.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FIFO-Diffusion.