From Virtual Games to Real-World Play (2506.18901v1)

Published 23 Jun 2025 in cs.CV

Abstract: We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer-RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer-although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles. Project page can be found: https://wenqsun.github.io/RealPlay/

Summary

The paper introduces RealPlay, a neural diffusion-based system that achieves a 90% real-world control success rate via a two-stage training pipeline.
It employs mixed supervision by combining labeled game data and unlabeled real-world videos, eliminating the need for extensive real-world action annotations.
Empirical results show high visual fidelity and effective generalization, with successful control transfer across diverse entities like bicycles and pedestrians.

RealPlay: Neural Interactive Video Generation for Real-World Game Engines

The paper "From Virtual Games to Real-World Play" (2506.18901) introduces RealPlay, a neural network-based system for interactive, photorealistic video generation conditioned on user control signals. RealPlay is positioned as a data-driven alternative to traditional graphics-based game engines, with the explicit goal of bridging the gap between virtual game environments and real-world visual dynamics. The system leverages advances in diffusion-based video generation and demonstrates both control transfer and entity transfer from virtual to real-world domains.

Methodological Contributions

RealPlay is built upon a two-stage training pipeline:

Chunk-wise Video Generation Adaptation: The authors adapt a pre-trained image-to-video diffusion model (CogVideoX-5B) to support chunk-wise, iterative video generation. This adaptation is essential for interactive applications, as it enables low-latency feedback by generating short video segments in response to user commands. Key modifications include:
- Conditioning on previously generated video chunks rather than a single frame.
- Custom attention masking to maintain temporal coherence.
- Temporal resolution adjustment to balance latency and visual quality.
- Noise augmentation during training (Diffusion Forcing) to mitigate the distribution gap between training (ground-truth conditioning) and inference (model-generated conditioning).
Mixed Supervision for Control Transfer: RealPlay is fine-tuned on a combination of labeled game data (from Forza Horizon 5) and unlabeled real-world video data (vehicles, bicycles, pedestrians). The only architectural modification is the introduction of an adaptive LayerNorm control module, which injects action signals (e.g., "move forward", "turn left", "turn right") into the model. For real-world data, the action input is set to zero, enabling the model to learn from visual transitions alone.

The training paradigm is inspired by classifier-free guidance, where the model is exposed to both conditional (action-labeled) and unconditional (action-agnostic) samples, facilitating robust control transfer even in the absence of real-world action annotations.

Empirical Results

The evaluation is comprehensive, covering visual quality, control effectiveness, and human preference (Elo score). The main findings are as follows:

Control Success Rate: RealPlay achieves a control success rate of 90% on real-world entities, a substantial improvement over both single-forward-pass models (26.7–33.9%) and chunk-wise models trained on labeled or pseudo-labeled real-world data (36.2–58.9%).
Visual Quality: RealPlay maintains high visual fidelity, with metrics on par with or exceeding state-of-the-art video diffusion models. The chunk-wise generation framework avoids the static or low-dynamic outputs observed in some baselines.
Entity Transfer: Despite being supervised only on car-based game actions, RealPlay generalizes control to bicycles and pedestrians in real-world videos. Control rates vary by entity, with higher rates for entities exhibiting larger motion amplitudes (e.g., pedestrians).
Cross-Entity Training: Including diverse real-world entities during training improves control transfer for each individual entity, indicating that shared motion dynamics facilitate generalization.
Ablation on Control Injection: Adaptive LayerNorm outperforms self-attention and cross-attention strategies for action signal fusion, yielding the highest control success rate.

Implementation Considerations

Computational Requirements: Training and inference are performed on 8×A100 GPUs, reflecting the high computational demands of large-scale video diffusion models.
Latency and Chunk Size: There is a trade-off between control latency and visual quality. Reducing the number of frames per chunk improves responsiveness but can degrade temporal coherence and image quality, as the pre-trained model is optimized for longer sequences.
Annotation Efficiency: The mixed supervision paradigm eliminates the need for real-world action annotations, relying instead on scalable game data and unlabeled real-world videos. This approach is practical for domains where real-world control labeling is infeasible.
Error Accumulation: Iterative chunk-wise generation introduces error accumulation over long horizons, particularly in real-world settings where the visual distribution is more complex than in games.

Implications and Future Directions

RealPlay demonstrates that neural video generation models, when properly adapted and trained, can serve as interactive, photorealistic game engines capable of responding to user control in real time. The system's ability to transfer control from virtual to real-world entities without explicit real-world action labels is a notable result, suggesting that high-fidelity simulation and control can be achieved through data-driven learning rather than handcrafted rules.

Practical implications include:

Enabling new forms of interactive media and simulation where real-world visuals and dynamics are required.
Reducing reliance on expensive graphics engines and manual annotation for control tasks in real-world video domains.
Providing a foundation for embodied AI agents to interact with realistic environments for planning and decision-making.

Theoretical implications include:

Evidence that large-scale diffusion models can generalize control policies across domains and entities, given appropriate training regimes.
Insights into the role of motion amplitude and cross-entity data in facilitating control transfer.

Future research directions:

Improving real-time performance through model distillation or accelerated denoising (e.g., shortcut diffusion).
Extending the action space beyond simple navigation commands to support richer interactions.
Addressing long-horizon consistency and error accumulation, potentially via memory-augmented architectures or hierarchical planning.
Exploring the integration of physics-based priors to further enhance realism and controllability.

Conclusion

RealPlay establishes a new paradigm for neural, data-driven game engines that operate on real-world video distributions. By leveraging chunk-wise diffusion models and mixed supervision, it achieves high-fidelity, controllable video generation with strong generalization across domains and entities. The approach offers a scalable path toward interactive, photorealistic simulation and has significant implications for both AI research and practical applications in simulation, robotics, and media.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/ArxivToday/status/1937554932908831008