Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens (2401.09985v1)

Published 18 Jan 2024 in cs.CV
WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Abstract: World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of LLMs, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments. WorldDreamer showcases versatility in executing tasks such as text-to-video conversion, image-tovideo synthesis, and video editing. These results underscore WorldDreamer's effectiveness in capturing dynamic elements within diverse general world environments.

Introduction to WorldDreamer

The innovative concept of WorldDreamer is introduced, which is a state-of-the-art model for generating dynamic video content. WorldDreamer transcends the typical limitations of pre-existing models that are often restricted to specific domains such as gaming or autonomous driving, embracing a wide array of real-world scenarios. Its core innovation lies in treating visual inputs as discrete tokens and predicting those that are masked, inspired by the recent triumphs of LLMs. This paper explores the architecture, methodologies employed, and the remarkable capabilities of WorldDreamer, illustrating its potential to redefine our approach to video generation tasks.

Conceptual Architecture

At the heart of WorldDreamer's technical blueprint is the Spatial Temporal Patchwise Transformer (STPT), a mechanism designed to enhance WorldDreamer's attention in localized patches across spatial-temporal dimensions, allowing for a more nuanced representation of motion and physics in videos. The model uses VQGAN for encoding images into discrete tokens and adopts a Transformer architecture familiar from LLMs. This enables a more efficient learning process and lends itself to an exceptional speed advantage over existing diffusion-based models, promising a threefold increase in speed for video generation tasks.

Diverse Applications and Promising Results

WorldDreamer's versatility allows it to perform a range of video generation tasks, including text-to-video conversion, image-to-video synthesis, and video editing. It excels not only in traditional environments but also in generating realistic natural scene videos and handling the intricacies of autonomous driving datasets. The results from extensive experiments confirm WorldDreamer's superior capability in generating cohesive and dynamic videos, underpinning the model's adaptability and comprehensive understanding of various world environments.

Advanced Training Strategies and Implementation

A notable aspect of WorldDreamer is its training approach, which incorporates dynamic masking strategies for visual tokens, allowing for a parallel sampling process during video generation. This technical design is instrumental in reducing the time required for video generation tasks, setting WorldDreamer apart from existing methods. To optimize performance, the model is trained on meticulously amassed datasets, including a deduplicated subset of the LAION-2B image dataset, high-quality video datasets, and autonomous driving data from the NuScenes dataset. The training process involves a combination of the AdamW optimizer, learning rate adjustments, and Classifier-Free Guidance (CFG) to fine-tune the generated content to high fidelity.

Final Thoughts

WorldDreamer embodies a significant leap in the domain of video generation, providing a unique and efficient means to generate videos by capitalizing on the predictive modeling of masked visual tokens. Its adoption of LLM optimization techniques, speed of execution, and extensive training on diverse datasets make it a powerful tool for creating realistic and intricate videos. Moreover, WorldDreamer's potential applications are vast, ranging from entertainment to the development of advanced driver-assistance systems, paving the way for more dynamic and authentic video content creation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  2. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
  3. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  5. Language models are few-shot learners. NeurIPS, 2020.
  6. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  7. Brandon Castellano. Pyscenedetect. Github repository, 2020.
  8. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  9. Maskgit: Masked generative image transformer. In CVPR, 2022.
  10. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  11. Generative pretraining from pixels. 2020.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. Cogview: Mastering text-to-image generation via transformers. NeurIPS, 2021.
  14. Cogview2: Faster and better text-to-image generation via hierarchical transformers. NIPS, 2022.
  15. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  16. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  17. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023.
  18. Recurrent world models facilitate policy evolution. NeurIPS, 2018.
  19. Deep hierarchical planning from pixels. NeurIPS, 2022.
  20. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  21. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
  22. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  23. Diffit: Diffusion vision transformers for image generation. arXiv preprint arXiv:2312.02139, 2023.
  24. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  25. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  26. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  27. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  28. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  29. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.
  30. Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023.
  31. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  32. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  33. Learning to model the world with language. arXiv preprint arXiv:2308.01399, 2023.
  34. amused: An open muse reproduction. arXiv preprint arXiv:2401.01808, 2024.
  35. Improving language understanding by generative pre-training. OpenAI, 2018.
  36. Language models are unsupervised multitask learners. OpenAI, 2019.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
  38. Zero-shot text-to-image generation. In ICML, 2021.
  39. Laion-5b: An open large-scale dataset for training next generation image-text models. NIPS, 2022.
  40. Masked world models for visual control. In CoRL, 2023.
  41. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  42. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  43. Neural discrete representation learning. NeurIPS, 2017.
  44. Attention is all you need. NIPS, 2017.
  45. Phenaki: Variable length video generation from open domain textual description. ICLR, 2023.
  46. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  47. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023.
  48. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. arXiv preprint arXiv:2311.17918, 2023.
  49. On the de-duplication of laion-2b. arXiv preprint arXiv:2303.12733, 2023.
  50. Daydreamer: World models for physical robot learning. In CoRL, 2023.
  51. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  52. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  53. Magvit: Masked generative video transformer. In CVPR, 2023.
  54. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
  55. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiaofeng Wang (310 papers)
  2. Zheng Zhu (200 papers)
  3. Guan Huang (75 papers)
  4. Boyuan Wang (15 papers)
  5. Xinze Chen (10 papers)
  6. Jiwen Lu (192 papers)
Citations (21)