Scalable Training of Large-Scale Text-to-Video Foundation Models
Develop scalable training methodologies for large-scale text-to-video foundation models that effectively handle the complexities introduced by modeling motion, in order to synthesize realistic, temporally coherent videos under stringent memory, compute, and data scale constraints.
References
However, training large-scale text-to-video (T2V) foundation models remains an open challenge due to the added complexities that motion introduces.
— Lumiere: A Space-Time Diffusion Model for Video Generation
(2401.12945 - Bar-Tal et al., 23 Jan 2024) in Section 1, Introduction