Dice Question Streamline Icon: https://streamlinehq.com

Scalable Training of Large-Scale Text-to-Video Foundation Models

Develop scalable training methodologies for large-scale text-to-video foundation models that effectively handle the complexities introduced by modeling motion, in order to synthesize realistic, temporally coherent videos under stringent memory, compute, and data scale constraints.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper emphasizes that moving from images to videos adds a temporal dimension that substantially increases modeling complexity and resource demands. Beyond capturing realistic motion, training text-to-video models must contend with higher memory and compute requirements and larger datasets to learn the more complex distribution of videos.

The authors motivate their architectural choice (a Space-Time U-Net generating full clips at once) as a response to these challenges, contrasting it with cascaded temporal super-resolution approaches that struggle with global temporal consistency.

References

However, training large-scale text-to-video (T2V) foundation models remains an open challenge due to the added complexities that motion introduces.

Lumiere: A Space-Time Diffusion Model for Video Generation (2401.12945 - Bar-Tal et al., 23 Jan 2024) in Section 1, Introduction