FIFO-Diffusion: Text-Conditional Infinite Video
- FIFO-Diffusion is an algorithm for text-to-video synthesis that generates arbitrarily long videos with constant memory usage.
- It employs a FIFO queue with diagonal denoising, latent partitioning, and lookahead to overcome training-inference mismatches and preserve temporal coherence.
- Empirical evaluations demonstrate improved motion plausibility and reduced artifacts over extended video sequences using pretrained models.
FIFO-Diffusion is an inference algorithm for text-conditional video generation using pretrained diffusion models, designed to achieve arbitrarily long or infinite video synthesis with constant memory usage and no retraining. It operates by maintaining a first-in-first-out (FIFO) queue of video frame latents at distinct noise levels and denoising them diagonally, enabling frames to be generated sequentially while reusing the fixed-length denoiser trained on short clips. This approach addresses the challenge of extending diffusion-based video generation to long sequences without temporal discontinuities or prohibitive memory requirements, introducing latent partitioning and lookahead denoising to mitigate discrepancies between training and inference (Kim et al., 2024).
1. Problem Setup and Motivation
Standard text-to-video diffusion models are trained on short fixed-length clips (typically frames) and apply denoising to all frames in a clip at the same noise level. Extending these models to generate sequences of length using naive approaches (such as autoregressively generating overlapping chunks) requires memory and incurs discontinuities at chunk boundaries. The objective of FIFO-Diffusion is to design an efficient, training-free inference algorithm that reuses the original -frame denoiser, maintains memory complexity regardless of , and produces temporally coherent, arbitrarily long videos conditioned on a text prompt (Kim et al., 2024).
2. Algorithmic Structure: Diagonal Denoising with FIFO Queue
FIFO-Diffusion introduces a diagonal denoising strategy in which a queue of frame latents is maintained, each corresponding to a different noise level. At each iteration:
- The latents in (with timesteps ) are jointly denoised in a single model forward pass:
- The first latent (, now fully denoised) is decoded as the output video frame.
- A fresh noise latent is sampled and enqueued at the FIFO tail.
This "sliding diagonal" over the diffusion grid ensures that the denoiser exploits future context (forward referencing) without exceeding the original model's context size. Memory usage is constrained to latents, and each frame requires a constant number of model calls. The practical pseudo-code is:
1 2 3 4 5 6 |
Initialize Q = [z_{τ_1}¹, …, z_{τ_f}^f]
for i in 1...N:
Q = Φ(Q, [τ₁...τ_f], c; ε_θ)
z_out = Q.dequeue()
v_i = Dec(z_out)
Q.enqueue(Normal(0, I)) |
3. Addressing the Training-Inference Mismatch
The original diffusion model is trained only on clips where all frames share the same noise level. During FIFO-Diffusion inference, the frames in the diagonal queue have distinct—and often widely separated—noise levels, leading to a training-inference distributional gap.
Latent Partitioning: To mitigate this, the noise-level range is partitioned into contiguous sub-ranges, each of size . The queue expands to latents, segmented into blocks . Each block is processed independently and in parallel, shrinking the noise-level gap within each block and reducing the additional error (theoretically of order ) (Kim et al., 2024).
Lookahead Denoising: To ensure all latents in the queue benefit from forward reference (not just the noisiest), frames are grouped so each latent appears in a block tail in at least one partitioning pass. This involves holding a queue of $2 n f'$ latents (with ), updating the last elements of each block per pass. This approach doubles compute cost but remains fully parallelizable.
4. Computational and Memory Analysis
FIFO-Diffusion's complexity is:
- Memory: , independent of the target video length .
- Time per frame: model calls for basic diagonal denoising ( calls for -partition, $2n$ calls with lookahead; all are parallelizable).
A comparison with alternative training-free methods is summarized below:
| Method | Memory for 512 Frames | Latency per Frame (1 GPU) |
|---|---|---|
| FreeNoise | 44 GB | 6.09 s |
| Gen-L-Video | 11 GB | 22.07 s |
| FIFO-Diffusion (n=4) | 11.2 GB | 6.20 s |
| FIFO-Diffusion (n=4, lookahead) | 11.2 GB | 12.37 s |
Memory usage remains bounded regardless of , unlike methods requiring all latents in memory (Kim et al., 2024).
5. Empirical Evaluation
Experiments on four pretrained backbones (VideoCrafter1, VideoCrafter2, zeroscope, Open-Sora Plan) demonstrate that FIFO-Diffusion produces 10,000-frame videos without perceptible quality degradation or chunk artifacts. Frame-to-frame coherence and prompt transitions are preserved. Windowed Fréchet Video Distance (FVD) remains stable, unlike FreeNoise, which degrades over long samples. User studies (70 raters, 30 prompts) show FIFO-Diffusion is preferred in over 70% of cases for overall quality, motion plausibility, motion magnitude, text fidelity, and aesthetics.
Ablation studies quantify the effectiveness of latent partitioning and lookahead:
- No partition/lookahead yields the base model's -prediction MSE.
- Partitioning with reduces this to ; adding lookahead brings it to , surpassing the base model.
- Temporal smoothness improves as blockiness and flicker artifacts are reduced or eliminated (Kim et al., 2024).
6. Limitations and Prospects
The diagonal denoising approach in inference cannot perfectly match the noise-level distribution seen during training, even after partitioning and lookahead. Ideal resolution would require incorporating diagonal denoising into the training regime (e.g., sampling different noise levels per batch). Future research directions include:
- Training models to directly handle multi-noise-level inputs, eliminating doubled compute costs of lookahead.
- Exploring non-linear or content-aware noise-level schedules.
- Extending the paradigm to adaptive or spatio-temporal latent partitioning (Kim et al., 2024).
7. Relation to Prior Work
FIFO-Diffusion is distinct from prior literature on diffusion approximations in queueing theory, e.g., the FIFO diffusion models for many-server queues with abandonment, in which diffusion processes approximate queue length dynamics under first-in-first-out discipline (He et al., 2011). The application of FIFO queuing in the context of infinite video generation, however, is an inference-time algorithmic innovation for text-to-video diffusion, unrelated to the stochastic process theory underlying classical queueing models.
References:
- "FIFO-Diffusion: Generating Infinite Videos from Text without Training" (Kim et al., 2024)
- Many-server queueing diffusion models (He et al., 2011)