GameGen-X: Interactive Open-world Game Video Generation (2411.00769v3)
Abstract: We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over a million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.
- Diffusion for world modeling: Visual details matter in atari. arXiv preprint arXiv:2405.12399, 2024.
- Anastasia. The rising costs of aaa game development, 2023. URL https://ejaw.net/the-rising-costs-of-aaa-game-development/. Accessed: 2024-6-15.
- Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp. 5803–5812, 2017.
- Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970, 2015.
- Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
- Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13320–13331, 2024.
- Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10850–10869, 2023.
- Animateanything: Fine-grained open domain image animation with motion guidance. arXiv e-prints, pp. arXiv–2311, 2023.
- David Eberly. 3D game engine design: a practical approach to real-time computer graphics. CRC Press, 2006.
- BAAI Emu3 Team. Emu3: Next-token prediction is all you need, 2024. URL https://github.com/baaivision/Emu3.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations, 2023.
- World models. arXiv preprint arXiv:1803.10122, 2018.
- Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667, 2024a.
- Id-animator: Zero-shot identity-preserving human video generation. arXiv preprint arXiv:2404.15275, 2024b.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Magicfight: Personalized martial arts combat video generation. In ACM Multimedia 2024, 2024a.
- Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807–21818, 2024b.
- Read to play (r2-play): Decision transformer with multimodal game instruction. arXiv preprint arXiv:2402.04154, 2024.
- Miradata: A large-scale video dataset with long durations and structured captions, 2024. URL https://arxiv.org/abs/2407.06358.
- Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
- 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023.
- Learning to simulate dynamic environments with gamegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1231–1240, 2020.
- Open-sora-plan, April 2024. URL https://doi.org/10.5281/zenodo.10948109.
- Composable text controls in latent space with odes, 2023a. URL https://arxiv.org/abs/2208.00638.
- Unified generation, reconstruction, and representation: Generalized diffusion with adaptive latent encoding-decoding, 2024. URL https://arxiv.org/abs/2402.19009.
- Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations (ICLR), 2023b.
- Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
- Playable video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10061–10070, 2021.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Latent video transformer. arXiv preprint arXiv:2006.10704, 2020.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022a.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022b.
- How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347, 2018.
- Christoph Schuhmann. Improved aesthetic predictor. https://github.com/christophschuhmann/improved-aesthetic-predictor, 2023. Accessed: 2023-10-04.
- Enhancing temporal consistency in video editing by reconstructing videos with 3d gaussian splatting. arXiv preprint arXiv:2406.02541, 2024.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
- Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838, 2020.
- Connor Holmes Will DePue Yufei Guo Li Jing David Schnurr Joe Taylor Troy Luhman Eric Luhman Clarence Ng Ricky Wang Tim Brooks, Bill Peebles and Aditya Ramesh. Video generation models as world simulators, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators. Accessed: 2024-6-15.
- Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024.
- A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
- Videolcm: Video latent consistency model. arXiv preprint arXiv:2312.09109, 2023b.
- Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023c.
- Customvideo: Customizing text-to-video generation with multiple subjects. arXiv preprint arXiv:2401.09962, 2024.
- Wikipedia. Development of the last of us part ii, 2023. URL https://en.wikipedia.org/wiki/Development_of_The_Last_of_Us_Part_II. Accessed: 2024-09-16.
- Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455, 2024.
- Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
- Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
- Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024.
- Ziming Liu Haotian Zhou Qianli Ma Xuanlei Zhao, Zhongkai Zhao and Yang You. Opendit: An easy, fast and memory-efficient system for dit training and inference, 2024. URL https://github.com/NUS-HPC-AI-Lab/VideoSys/tree/v1.0.0.
- Street gaussians for modeling dynamic urban scenes. arXiv preprint arXiv:2401.01339, 2024.
- Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023.
- Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024.
- Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
- Mira: A mini-step towards sora-like long video generation. https://github.com/mira-space/Mira, 2023. ARC Lab, Tencent PCG.
- Dsp: Dynamic sequence parallelism for multi-dimensional transformers, 2024a. URL https://arxiv.org/abs/2403.10266.
- Real-time video generation with pyramid attention broadcast, 2024b. URL https://arxiv.org/abs/2408.12588.
- More-3s: Multimodal-based offline reinforcement learning with shared semantic spaces. arXiv preprint arXiv:2402.12845, 2024a.
- Open-sora: Democratizing efficient video production for all, March 2024b. URL https://github.com/hpcaitech/Open-Sora.
- Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024.
- Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024.