FIFO-Diffusion: Text-Conditional Infinite Video

Updated 19 February 2026

FIFO-Diffusion is an algorithm for text-to-video synthesis that generates arbitrarily long videos with constant memory usage.
It employs a FIFO queue with diagonal denoising, latent partitioning, and lookahead to overcome training-inference mismatches and preserve temporal coherence.
Empirical evaluations demonstrate improved motion plausibility and reduced artifacts over extended video sequences using pretrained models.

FIFO-Diffusion is an inference algorithm for text-conditional video generation using pretrained diffusion models, designed to achieve arbitrarily long or infinite video synthesis with constant memory usage and no retraining. It operates by maintaining a first-in-first-out (FIFO) queue of video frame latents at distinct noise levels and denoising them diagonally, enabling frames to be generated sequentially while reusing the fixed-length denoiser trained on short clips. This approach addresses the challenge of extending diffusion-based video generation to long sequences without temporal discontinuities or prohibitive memory requirements, introducing latent partitioning and lookahead denoising to mitigate discrepancies between training and inference (Kim et al., 2024).

1. Problem Setup and Motivation

Standard text-to-video diffusion models are trained on short fixed-length clips (typically $f=16$ frames) and apply denoising to all frames in a clip at the same noise level. Extending these models to generate sequences of length $N \gg f$ using naive approaches (such as autoregressively generating overlapping chunks) requires $\mathcal{O}(N)$ memory and incurs discontinuities at chunk boundaries. The objective of FIFO-Diffusion is to design an efficient, training-free inference algorithm that reuses the original $f$ -frame denoiser, maintains $\mathcal{O}(f)$ memory complexity regardless of $N$ , and produces temporally coherent, arbitrarily long videos conditioned on a text prompt (Kim et al., 2024).

2. Algorithmic Structure: Diagonal Denoising with FIFO Queue

FIFO-Diffusion introduces a diagonal denoising strategy in which a queue $Q$ of $f$ frame latents is maintained, each corresponding to a different noise level. At each iteration:

The latents in $Q = [z_{\tau_1}^1, z_{\tau_2}^2, \dots, z_{\tau_f}^f]$ (with timesteps $\tau_1 < \tau_2 < \dots < \tau_f = T$ ) are jointly denoised in a single model forward pass:

$Q' = \Phi(Q, [\tau_1, \ldots, \tau_f], c; \varepsilon_\theta)$

The first latent ( $z_{\tau_0}^1$ , now fully denoised) is decoded as the output video frame.
A fresh noise latent is sampled and enqueued at the FIFO tail.

This "sliding diagonal" over the diffusion $f \times S$ grid ensures that the denoiser exploits future context (forward referencing) without exceeding the original model's context size. Memory usage is constrained to $\mathcal{O}(f)$ latents, and each frame requires a constant number of model calls. The practical pseudo-code is:

Initialize Q = [z_{τ_1}¹, …, z_{τ_f}^f]
for i in 1...N:
    Q = Φ(Q, [τ₁...τ_f], c; ε_θ)
    z_out = Q.dequeue()
    v_i = Dec(z_out)
    Q.enqueue(Normal(0, I))

3. Addressing the Training-Inference Mismatch

The original diffusion model is trained only on clips where all $f$ frames share the same noise level. During FIFO-Diffusion inference, the $f$ frames in the diagonal queue have distinct—and often widely separated—noise levels, leading to a training-inference distributional gap.

Latent Partitioning: To mitigate this, the noise-level range $[\tau_1, \tau_f]$ is partitioned into $n$ contiguous sub-ranges, each of size $f$ . The queue expands to $n f$ latents, segmented into $n$ blocks $Q_0, \dots, Q_{n-1}$ . Each block is processed independently and in parallel, shrinking the noise-level gap within each block and reducing the additional error (theoretically of order $\mathcal{O}(|\sigma_{\tau_f} - \sigma_{\tau_1}|)$ ) (Kim et al., 2024).

Lookahead Denoising: To ensure all latents in the queue benefit from forward reference (not just the noisiest), frames are grouped so each latent appears in a block tail in at least one partitioning pass. This involves holding a queue of $2 n f'$ latents (with $f' = \lfloor f/2 \rfloor$ ), updating the last $f'$ elements of each block per pass. This approach doubles compute cost but remains fully parallelizable.

4. Computational and Memory Analysis

FIFO-Diffusion's complexity is:

Memory: $\mathcal{O}(f)$ , independent of the target video length $N$ .
Time per frame: $\mathcal{O}(1)$ model calls for basic diagonal denoising ( $n$ calls for $n$ -partition, $2n$ calls with lookahead; all are parallelizable).

A comparison with alternative training-free methods is summarized below:

Method	Memory for 512 Frames	Latency per Frame (1 GPU)
FreeNoise	44 GB	6.09 s
Gen-L-Video	11 GB	22.07 s
FIFO-Diffusion (n=4)	11.2 GB	6.20 s
FIFO-Diffusion (n=4, lookahead)	11.2 GB	12.37 s

Memory usage remains bounded regardless of $N$ , unlike methods requiring all latents in memory (Kim et al., 2024).

5. Empirical Evaluation

Experiments on four pretrained backbones (VideoCrafter1, VideoCrafter2, zeroscope, Open-Sora Plan) demonstrate that FIFO-Diffusion produces 10,000-frame videos without perceptible quality degradation or chunk artifacts. Frame-to-frame coherence and prompt transitions are preserved. Windowed Fréchet Video Distance (FVD) remains stable, unlike FreeNoise, which degrades over long samples. User studies (70 raters, 30 prompts) show FIFO-Diffusion is preferred in over 70% of cases for overall quality, motion plausibility, motion magnitude, text fidelity, and aesthetics.

Ablation studies quantify the effectiveness of latent partitioning and lookahead:

No partition/lookahead yields $1.09 \times$ the base model's $\varepsilon$ -prediction MSE.
Partitioning with $n=4$ reduces this to $1.02 \times$ ; adding lookahead brings it to $0.98 \times$ , surpassing the base model.
Temporal smoothness improves as blockiness and flicker artifacts are reduced or eliminated (Kim et al., 2024).

6. Limitations and Prospects

The diagonal denoising approach in inference cannot perfectly match the noise-level distribution seen during training, even after partitioning and lookahead. Ideal resolution would require incorporating diagonal denoising into the training regime (e.g., sampling $f$ different noise levels per batch). Future research directions include:

Training models to directly handle multi-noise-level inputs, eliminating doubled compute costs of lookahead.
Exploring non-linear or content-aware noise-level schedules.
Extending the paradigm to adaptive $f$ or spatio-temporal latent partitioning (Kim et al., 2024).

7. Relation to Prior Work

FIFO-Diffusion is distinct from prior literature on diffusion approximations in queueing theory, e.g., the FIFO diffusion models for many-server queues with abandonment, in which diffusion processes approximate queue length dynamics under first-in-first-out discipline (He et al., 2011). The application of FIFO queuing in the context of infinite video generation, however, is an inference-time algorithmic innovation for text-to-video diffusion, unrelated to the stochastic process theory underlying classical queueing models.

References:

"FIFO-Diffusion: Generating Infinite Videos from Text without Training" (Kim et al., 2024)
Many-server queueing diffusion models (He et al., 2011)

Markdown Report Issue Upgrade to Chat

References (2)

FIFO-Diffusion: Generating Infinite Videos from Text without Training (2024)

Many-server queues with customer abandonment: numerical analysis of their diffusion models (2011)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FIFO-Diffusion.