Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GameGen-X: Interactive Open-world Game Video Generation (2411.00769v3)

Published 1 Nov 2024 in cs.CV and cs.AI

Abstract: We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over a million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Diffusion for world modeling: Visual details matter in atari. arXiv preprint arXiv:2405.12399, 2024.
  2. Anastasia. The rising costs of aaa game development, 2023. URL https://ejaw.net/the-rising-costs-of-aaa-game-development/. Accessed: 2024-6-15.
  3. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp.  5803–5812, 2017.
  4. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024.
  5. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp.  961–970, 2015.
  6. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  7. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13320–13331, 2024.
  8. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10850–10869, 2023.
  9. Animateanything: Fine-grained open domain image animation with motion guidance. arXiv e-prints, pp.  arXiv–2311, 2023.
  10. David Eberly. 3D game engine design: a practical approach to real-time computer graphics. CRC Press, 2006.
  11. BAAI Emu3 Team. Emu3: Next-token prediction is all you need, 2024. URL https://github.com/baaivision/Emu3.
  12. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations, 2023.
  13. World models. arXiv preprint arXiv:1803.10122, 2018.
  14. Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667, 2024a.
  15. Id-animator: Zero-shot identity-preserving human video generation. arXiv preprint arXiv:2404.15275, 2024b.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  17. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  18. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  19. Magicfight: Personalized martial arts combat video generation. In ACM Multimedia 2024, 2024a.
  20. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  21807–21818, 2024b.
  21. Read to play (r2-play): Decision transformer with multimodal game instruction. arXiv preprint arXiv:2402.04154, 2024.
  22. Miradata: A large-scale video dataset with long durations and structured captions, 2024. URL https://arxiv.org/abs/2407.06358.
  23. Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
  24. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023.
  25. Learning to simulate dynamic environments with gamegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1231–1240, 2020.
  26. Open-sora-plan, April 2024. URL https://doi.org/10.5281/zenodo.10948109.
  27. Composable text controls in latent space with odes, 2023a. URL https://arxiv.org/abs/2208.00638.
  28. Unified generation, reconstruction, and representation: Generalized diffusion with adaptive latent encoding-decoding, 2024. URL https://arxiv.org/abs/2402.19009.
  29. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations (ICLR), 2023b.
  30. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  31. Playable video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10061–10070, 2021.
  32. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  34. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020.
  35. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  36. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022a.
  37. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022b.
  38. How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347, 2018.
  39. Christoph Schuhmann. Improved aesthetic predictor. https://github.com/christophschuhmann/improved-aesthetic-predictor, 2023. Accessed: 2023-10-04.
  40. Enhancing temporal consistency in video editing by reconstructing videos with 3d gaussian splatting. arXiv preprint arXiv:2406.02541, 2024.
  41. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  42. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  43. Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838, 2020.
  44. Connor Holmes Will DePue Yufei Guo Li Jing David Schnurr Joe Taylor Troy Luhman Eric Luhman Clarence Ng Ricky Wang Tim Brooks, Bill Peebles and Aditya Ramesh. Video generation models as world simulators, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators. Accessed: 2024-6-15.
  45. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024.
  46. A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  47. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  48. Videolcm: Video latent consistency model. arXiv preprint arXiv:2312.09109, 2023b.
  49. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023c.
  50. Customvideo: Customizing text-to-video generation with multiple subjects. arXiv preprint arXiv:2401.09962, 2024.
  51. Wikipedia. Development of the last of us part ii, 2023. URL https://en.wikipedia.org/wiki/Development_of_The_Last_of_Us_Part_II. Accessed: 2024-09-16.
  52. Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455, 2024.
  53. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
  54. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  55. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
  56. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024.
  57. Ziming Liu Haotian Zhou Qianli Ma Xuanlei Zhao, Zhongkai Zhao and Yang You. Opendit: An easy, fast and memory-efficient system for dit training and inference, 2024. URL https://github.com/NUS-HPC-AI-Lab/VideoSys/tree/v1.0.0.
  58. Street gaussians for modeling dynamic urban scenes. arXiv preprint arXiv:2401.01339, 2024.
  59. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023.
  60. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024.
  61. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
  62. Mira: A mini-step towards sora-like long video generation. https://github.com/mira-space/Mira, 2023. ARC Lab, Tencent PCG.
  63. Dsp: Dynamic sequence parallelism for multi-dimensional transformers, 2024a. URL https://arxiv.org/abs/2403.10266.
  64. Real-time video generation with pyramid attention broadcast, 2024b. URL https://arxiv.org/abs/2408.12588.
  65. More-3s: Multimodal-based offline reinforcement learning with shared semantic spaces. arXiv preprint arXiv:2402.12845, 2024a.
  66. Open-sora: Democratizing efficient video production for all, March 2024b. URL https://github.com/hpcaitech/Open-Sora.
  67. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024.
  68. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  69. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024.
Citations (3)

Summary

  • The paper presents a novel diffusion transformer model, GameGen-X, that enables interactive open-world game video generation with high-quality, diverse outputs.
  • It employs a dual-stage training process combining text-to-video generation with interactive instruction tuning, fueled by over one million gameplay video clips.
  • The integration of a Masked Spatial-Temporal Diffusion Transformer and InstructNet ensures spatial-temporal coherence and dynamic user control, yielding superior FID and FVD metrics.

Overview of GameGen-X\mathbb{X}: Interactive Open-world Game Video Generation

The paper presents GameGen-X\mathbb{X}, a novel diffusion transformer model explicitly crafted for generating and interactively controlling videos set in open-world gaming environments. The model marks a significant advancement in leveraging generative models for game content creation, achieving high-quality and diverse video generation while integrating interactive capabilities. This paper provides a comprehensive breakdown of this model's framework, evidencing its dual-stage training strategy that combines foundation model pre-training with interactive tuning, thereby enhancing both the generation quality and interactive control of video content.

The model's development required the construction of the OGameData, the first and largest dataset tailored for open-world game video generation, containing over a million gameplay video clips from over 150 distinct games. This dataset served as a cornerstone for training the GameGen-X\mathbb{X} model, ensuring that it captures the complexity and diversity of virtual game environments.

Key Methodological Insights

1. Dual-stage Training Process:

GameGen-X\mathbb{X} utilizes a two-stage training approach, comprising text-to-video generation and video continuation for the foundation model, paired with instruction tuning for interactive control. This modular training strategy ensures that the generative model maintains high quality and diversity in video output while facilitating precise interactive control.

2. Masked Spatial-Temporal Diffusion Transformer:

At the heart of this model is the Masked Spatial-Temporal Diffusion Transformer (MSDiT), which employs a unique blending of spatial, temporal, and cross-attention mechanisms. This allows the model to maintain spatial coherence while capturing intricate temporal dependencies across video frames — a critical requirement for generating coherent and high-fidelity video sequences.

3. InstructNet Implementation:

The integration of InstructNet into the framework allows GameGen-X\mathbb{X} to achieve interactively controllable video generation. InstructNet modifies video generation based on user inputs without altering the pre-trained foundational capabilities of the model. This segment aligns latent video representations with multimodal user intentions, allowing real-time video content adaptation based on structured text instructions and keyboard inputs.

Experimental Results and Performance

The implementation of GameGen-X\mathbb{X} yields superior performance across several metrics, such as FID, FVD, and TVA, demonstrating marked improvements in text-to-video alignment and overall visual and temporal quality compared to existing state-of-the-art models. Its interactive control capabilities, assessed through metrics like Success Rate for Character Actions and Environment Events (SR-C and SR-E), indicate a robust ability to adapt video output in response to user control, surpassing other models in controllability.

Implications and Future Directions

Theoretical and Practical Implications:

The work signifies a step forward in understanding and applying generative models within the gaming industry. By enhancing interactive control and generation quality, GameGen-X\mathbb{X} paves the way for more efficient content creation processes in game design, potentially reducing the resource intensity traditionally associated with developing open-world game environments.

Potential Developments:

Future developments could focus on optimizing the model for real-time generation capabilities, addressing current constraints related to computational demands. Additionally, expanding the model’s applicability through 3D modeling or integrating it more tightly with existing game engines would further enhance its practical relevance.

Overall, GameGen-X\mathbb{X} offers a compelling vision for the integration of generative models in gaming, coupling the creative possibilities of automatic content generation with the dynamic demands of interactive gameplay. Its structured framework and encouraging results underscore a promising trajectory for generative-based game development and simulation tools, potentially extending beyond gaming into broader interactive media applications.

Youtube Logo Streamline Icon: https://streamlinehq.com