Diffusion Models Are Real-Time Game Engines (2408.14837v2)

Published 27 Aug 2024 in cs.LG, cs.AI, and cs.CV

Abstract: We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.

Citations (14)

View on Semantic Scholar

Summary

The paper presents GameNGen, a novel system that repurposes diffusion models for real-time interactive game simulation.
It demonstrates that an augmented Stable Diffusion model achieves 20 FPS and a PSNR of approximately 29.4, matching lossy JPEG quality.
The research lays a foundation for neural game engines, potentially reducing development costs and enabling dynamic, interactive virtual environments.

Diffusion Models Are Real-Time Game Engines

The paper "Diffusion Models Are Real-Time Game Engines" presents GameNGen, an innovative application of neural models for real-time game simulation. The paper demonstrates that diffusion models, specifically an augmented variant of Stable Diffusion v1.4, can simulate the classic game DOOM at a rate of over 20 frames per second on a single TPU. This work shows promising results in using neural models to handle complex, interactive virtual environments, a domain traditionally dominated by manually crafted software systems.

Summary of Core Contributions

GameNGen Architecture: The GameNGen system is built upon a pre-trained Stable Diffusion v1.4 model, which has been adapted for interactive world simulation. The model operates in two training phases: an agent is first trained to play the game using reinforcement learning (RL), and then the generative diffusion model is trained on the accumulated data from the agent’s gameplay. The model is conditioned on sequences of past frames and actions, enabling autoregressive generation of game frames.

Performance Metrics and Model Efficacy:

Frame Rate: GameNGen achieves real-time performance at 20 FPS, showing the computational efficiency of the approach.
Next Frame Prediction: It achieves a Peak Signal-to-Noise Ratio (PSNR) of 29.4, which is comparable to common lossy JPEG compression levels.
Human Evaluation: Human raters could barely distinguish between short clips of the real game and the simulation, indicating high visual fidelity.
Noise Augmentation: The model incorporates a noise augmentation technique to mitigate autoregressive drift, crucial for maintaining visual quality over long trajectories.

Simulation Quality and Evaluation

GameNGen is evaluated using various metrics to ensure high simulation quality:

Image Quality: When evaluated in a teacher-forcing setup for single-frame prediction, the model achieves a PSNR of 29.43 and an LPIPS of 0.249, metrics indicative of high visual fidelity.
Video Quality: Evaluated in an autoregressive context, the model delivers an FVD of 114.02 for 16-frame sequences and 186.23 for 32-frame sequences.
Human Evaluation: When tasked with distinguishing between simulated and real game clips, human evaluators did so only slightly better than random chance, underscoring the model's ability to produce visually convincing outputs.

Methodological Details

Data Collection: The agent, trained via PPO (Proximal Policy Optimization) using a simple CNN architecture, generates the training dataset by playing the game in a variety of scenarios. The collected trajectories include diverse gameplay situations, ensuring the training data is rich and varied.

Training Procedure:

The generative model is re-purposed from Stable Diffusion v1.4, removing text conditioning and introducing embeddings for past actions.
The latent decoder of the original diffusion model is fine-tuned to resolve artifacts such as HUD details.
DDIM sampling and Classifier-Free Guidance are used during inference to balance quality and computational efficiency.

Mitigation of Autoregressive Drift: By adding Gaussian noise to context frames during training, the model learns to correct inaccuracies over time, which is critical for long-term stability in autoregressive scenarios.

Ablations and Analysis

The authors conduct comprehensive ablations to analyze the contributions of various components:

Context Length: Increasing the number of history frames improves the model’s performance, though gains diminish beyond a certain point.
Noise Augmentation: Demonstrably enhances autoregressive stability, a critical aspect for maintaining long-frame visual quality.
Agent Play: Comparing training data generated by an agent versus random policy highlights the agent's importance in producing a robust and diverse dataset.

Implications and Future Directions

Practical Implications:

The success of GameNGen implies potential reductions in game development costs by automating game environment creation.
Enhanced interactivity: Such models can enable novel ways for user interaction within virtual environments, offering adaptability that static, rule-based engines cannot match.

Theoretical Implications:

This research points towards a new paradigm in game engine design where neural models supplant manually written code.
The method paves the way for further exploration into using generative models for interactive applications beyond gaming, such as simulation and training environments.

Future Work:

The paper outlines several future avenues:

Extending GameNGen to other games or interactive applications.
Addressing the model’s memory constraints by experimenting with architectural modifications to support longer conditioning contexts.
Further optimizing model performance for higher frame rates and deployment on consumer hardware.

Conclusion

"Diffusion Models Are Real-Time Game Engines" marks a significant step in applying neural models to a traditionally hand-crafted domain. By demonstrating the feasibility and potential of GameNGen, this research lays the groundwork for an automated, neural-network-driven future in game engine design, potentially transforming both the development and user experience of interactive virtual environments.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1828631472632172911

https://twitter.com/DrJimFan/status/1828813716810539417

https://twitter.com/hardmaru/status/1828811483779661853

https://twitter.com/BrianRoemmele/status/1828770969256874163

https://twitter.com/majorarlene/status/1828659388384784612

https://twitter.com/unixpickle/status/1852562335921008939