MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

This presentation explores MilliVid, a breakthrough approach to generating long, temporally consistent videos. The method introduces a hierarchical tokenization framework and coarse-to-fine diffusion sampling that dramatically improves how video models maintain global scene consistency and object permanence over hundreds of frames. By strategically allocating transformer context across multiple semantic resolution levels, MilliVid overcomes the drift-forgetting trade-off that plagues traditional autoregressive and diffusion-based video generators, achieving superior long-range coherence without sacrificing visual quality.
Script
Video generation models can create stunning short clips, but they rapidly lose coherence as sequences grow longer. After just dozens of frames, objects drift, scenes forget their own geometry, and the model hallucinates contradictory details because the transformer's context window cannot practically hold hundreds of frames worth of tokens.
MilliVid solves this with a hierarchical tokenization strategy. The model encodes each frame into multiple discrete latent levels, from coarse semantic layouts to fine textures, then allocates the transformer's fixed context budget strategically: coarse tokens cover hundreds of frames for global consistency, while fine tokens capture recent detail.
The authors pair this with a coarse-to-fine diffusion sampling procedure. At each rollout step, the model generates large video segments at coarse scale first, then incrementally refines them at finer scales, ensuring that spatial and temporal decisions are always grounded in the same shared global structure rather than decoupled super-resolution passes.
The researchers tested this on Loopcraft, a custom Minecraft dataset with 1024 frame videos featuring frequent spatial re-visitation. MilliVid dramatically outperforms prior methods: long-horizon peak signal-to-noise ratio of 16.69 versus 11.98 for FramePack, and 62.4 keypoint matches versus just 7 for autoregressive rollout, while maintaining equivalent perceptual quality.
Ablations reveal that simply increasing context or using mean-pooled cascaded architectures fails. The key is learned semantic compression at coarse levels, which allows the model to recall global structure over hundreds of effective frames without collapsing or hallucinating.
MilliVid fundamentally shifts how we think about long-horizon video modeling, proving that abstraction-aware token compression unlocks scalable temporal reach for world simulation, robotics planning, and any task requiring reliable scene memory. To dive deeper into this research and create your own video explainers, visit EmergentMind.com.