Infinite-length video generation with perfect fidelity

Establish whether and how to generate infinite-length video sequences with perfect fidelity within the WorldWarp autoregressive pipeline that uses the Spatio-Temporal Diffusion (ST-Diff) model and an online 3D Gaussian Splatting cache, preventing the accumulation and propagation of artifacts and geometric inconsistencies when each generated chunk is used as historical context for subsequent chunks, particularly for sequences exceeding 1000 frames.

Background

WorldWarp generates long-range novel view sequences using an autoregressive pipeline: each new chunk of frames is conditioned on forward-warped hints derived from an online 3D Gaussian Splatting cache and refined by the ST-Diff model with a spatially and temporally varying noise schedule. This design enforces geometric consistency while enabling inpainting of occluded regions.

Despite strong performance on long sequences, the authors note that using previously generated chunks as history introduces the risk of error accumulation over very long horizons. Minor artifacts or geometric drift can propagate across chunks, and for extremely long sequences (e.g., beyond 1000 frames), this can degrade visual quality or geometric stability. Addressing this issue is essential for achieving truly infinite-length, high-fidelity generation.

References

Although our model is trained in an asynchronous diffusion manner, where we apply varying noise strengths to different frames and spatial regions to mimic inference conditions, generating infinite-length video sequences with perfect fidelity remains an unresolved challenge.

WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion (2512.19678 - Kong et al., 22 Dec 2025) in Supplementary, Section "Limitations", Error Accumulation in Long-horizon Generation