Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition (2506.17201v1)

Published 20 Jun 2025 in cs.CV

Abstract: Recent advances in diffusion-based and controllable video generation have enabled high-quality and temporally coherent video synthesis, laying the groundwork for immersive interactive gaming experiences. However, current methods face limitations in dynamics, generality, long-term consistency, and efficiency, which limit the ability to create various gameplay videos. To address these gaps, we introduce Hunyuan-GameCraft, a novel framework for high-dynamic interactive video generation in game environments. To achieve fine-grained action control, we unify standard keyboard and mouse inputs into a shared camera representation space, facilitating smooth interpolation between various camera and movement operations. Then we propose a hybrid history-conditioned training strategy that extends video sequences autoregressively while preserving game scene information. Additionally, to enhance inference efficiency and playability, we achieve model distillation to reduce computational overhead while maintaining consistency across long temporal sequences, making it suitable for real-time deployment in complex interactive environments. The model is trained on a large-scale dataset comprising over one million gameplay recordings across over 100 AAA games, ensuring broad coverage and diversity, then fine-tuned on a carefully annotated synthetic dataset to enhance precision and control. The curated game scene data significantly improves the visual fidelity, realism and action controllability. Extensive experiments demonstrate that Hunyuan-GameCraft significantly outperforms existing models, advancing the realism and playability of interactive game video generation.

Summary

  • The paper presents Hunyuan-GameCraft, a diffusion-based framework for high-dynamic, interactive game video generation using a unified action representation and hybrid history conditioning.
  • The system achieves state-of-the-art results in visual quality, control, and consistency, reaching real-time interactive speed of 6.6 FPS via model distillation.
  • Hunyuan-GameCraft generalizes to real-world video generation tasks and provides a scalable foundation for interactive media applications and future research.

Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition

Hunyuan-GameCraft presents a comprehensive framework for high-dynamic, interactive game video generation, addressing key challenges in controllability, temporal consistency, and efficiency that have limited prior approaches. The system is built upon a diffusion-based text-to-video foundation (HunyuanVideo), augmented with a unified action representation, a hybrid history-conditioned training strategy, and model distillation for accelerated inference. The following analysis details the technical contributions, empirical results, and broader implications of this work.

Unified Action Representation

A central innovation is the unification of discrete keyboard and mouse inputs into a continuous camera representation space. This design enables smooth interpolation between movement and camera operations, supporting fine-grained control over translation and rotation (excluding roll) with explicit velocity parameters. The action encoder leverages lightweight convolutional and pooling layers, followed by token addition for efficient fusion with video latents. This approach achieves effective control injection with minimal computational overhead, outperforming more complex alternatives such as token or channel-wise concatenation in both accuracy and efficiency.

Hybrid History-conditioned Training

To address the challenge of long-term temporal consistency in autoregressive video generation, Hunyuan-GameCraft introduces a hybrid history-conditioned training paradigm. This strategy mixes three conditioning modes during training: single-frame, single-clip, and multi-clip history. A variable mask indicator distinguishes between historical and predicted frames, enabling the model to balance responsiveness to new action inputs with the preservation of scene continuity. Empirical ablations demonstrate that this hybrid approach achieves superior trade-offs between interaction accuracy and long-term visual fidelity compared to single-mode conditioning.

Model Distillation and Acceleration

For real-time deployment, the framework incorporates model distillation using the Phased Consistency Model (PCM). This reduces the number of diffusion steps required for inference, achieving up to a 20× speedup and enabling rendering rates of 6.6 FPS at 720p. Classifier-free guidance distillation further streamlines the process, allowing the student model to directly produce guided outputs. The resulting system supports interactive applications with sub-5s latency per action, a significant improvement over prior diffusion-based methods.

Dataset Construction and Training

The model is trained on a large-scale, curated dataset comprising over one million gameplay recordings from more than 100 AAA titles, supplemented with high-precision synthetic sequences for geometric priors. The data pipeline includes scene and action-aware partitioning, quality filtering, 6-DoF camera trajectory annotation, and hierarchical captioning. A distribution balancing strategy mitigates forward-motion bias, enhancing generalization across diverse camera trajectories.

Empirical Evaluation

Quantitative and qualitative comparisons against state-of-the-art baselines (Matrix-Game, CameraCtrl, MotionCtrl, WanX-Cam) demonstrate the following:

  • Visual Quality: Achieves the lowest FVD (1554.2) and competitive image quality and aesthetic scores.
  • Dynamic Performance: Substantially higher dynamic average (67.2) than all baselines, indicating more realistic and varied motion.
  • Control Accuracy: Reduces relative pose error (trans/rot) by 55% compared to Matrix-Game.
  • Temporal Consistency: Maintains high temporal consistency (0.95), supporting long video extension without quality collapse.
  • Inference Speed: With PCM, achieves real-time interaction (6.6 FPS), a marked improvement over previous models.
  • User Study: Receives the highest user preference scores across all evaluated dimensions.

Ablation studies confirm the necessity of hybrid history conditioning and the balanced use of synthetic and real gameplay data for optimal performance.

Generalization and Limitations

Although optimized for game environments, Hunyuan-GameCraft demonstrates strong generalization to real-world video generation tasks, attributable to its foundation on a large-scale, pre-trained video model. However, the current action space is primarily suited for open-world exploration and lacks support for more complex, game-specific actions (e.g., shooting, object manipulation). Future work is proposed to expand the action repertoire and further enhance physical interactivity.

Implications and Future Directions

Hunyuan-GameCraft establishes a robust foundation for interactive video generation in both research and applied settings. Its unified action representation and hybrid conditioning paradigm offer a scalable solution for controllable, temporally coherent video synthesis. The demonstrated efficiency gains via model distillation make it viable for real-time applications, including game prototyping, virtual environment simulation, and interactive content creation.

Theoretically, the hybrid history-conditioned approach provides a generalizable framework for balancing responsiveness and consistency in autoregressive generative models. Practically, the system's modularity and efficiency position it as a candidate for integration into next-generation game engines and interactive media platforms.

Future research may explore the extension of the action space to encompass a broader range of interactions, the incorporation of physics-based constraints, and the adaptation of the framework to other domains requiring high-fidelity, controllable video generation. The large-scale, annotated dataset and the hybrid training methodology also offer valuable resources and insights for the broader generative modeling community.

Youtube Logo Streamline Icon: https://streamlinehq.com