Papers
Topics
Authors
Recent
2000 character limit reached

HunyuanVideo 1.5 Technical Report (2511.18870v2)

Published 24 Nov 2025 in cs.CV

Abstract: We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

Summary

  • The paper presents an 8.3B Diffusion Transformer architecture that balances high-fidelity video output with efficient inference on consumer GPUs.
  • It employs a unified DiT backbone with sparse attention and a bilingual glyph-aware text encoder to achieve superior motion coherence and semantic consistency.
  • The cascaded super-resolution and progressive training with RLHF collectively enable scalable 1080p video synthesis, democratizing advanced video generation research.

HunyuanVideo 1.5: An Open-Source, Efficient, Multimodal Video Generation Framework

Overview and Contributions

HunyuanVideo 1.5 introduces a significant advancement in open-source video generation, presenting an 8.3B-parameter Diffusion Transformer (DiT) architecture with state-of-the-art visual fidelity and motion coherence, all within an efficient parameter budget suitable for consumer-grade GPUs. The model addresses critical gaps in the field, namely the trade-off between output quality and inference/resource efficiency observed in prior open and proprietary models. Its contributions are multifaceted: a carefully curated data pipeline emphasizes diversity and quality, a unified DiT backbone incorporates sparse attention (SSTA), a bilingual glyph-aware text encoder enables precise prompt alignment, and progressive training/post-training strategies systematically refine generative performance. The cascading video super-resolution network further upscales outputs to 1080p without introducing notable distortions.

Data Preparation and Captioning Pipeline

The data pipeline is constructed with rigorous multi-stage filtering, ultimately distilling over 10 million hours of raw video into 800 million high-quality, diverse clips. A standout feature is the sophisticated captioning infrastructure: three specialized models generate rich, semantically structured descriptions for images, videos, and image-to-video transformations. To address the classic richness-hallucination trade-off in captioning, the post-training process integrates RL via OPA-DPO to maximize descriptive detail while minimizing factual errors. Camera movement recognition augments captions to facilitate controllable video synthesis. Figure 1

Figure 1: The caption model post-training pipeline, integrating RL to optimize for caption informativeness and factuality.

Model Architecture: Unified DiT and Sparse Attention

The framework's backbone is a unified DiT architecture supporting joint training across text-to-video, image-to-video, and text-to-image tasks. The model leverages a causal 3D VAE encoder with aggressive spatial and temporal compression, substantially reducing token count and computational burden. A dual-encoder mechanism combines Qwen2.5-VL and Glyph-ByT5, achieving robust bilingual and glyphic understanding critical for multilingual prompt precision.

Sparse attention is realized via SSTA, dynamically pruning spatiotemporal tokens based on content relevance and local-global adaptivity. This reduces overhead, achieving 1.87×1.87 \times inference speedup for 10s 720p video synthesis over FlashAttention-3. Figure 2

Figure 2: Architecture of the Unified Diffusion Transformer supporting progressive, multimodal video synthesis with compressed latent representations and dual-channel text encoding.

Cascaded Video Super-Resolution

HunyuanVideo 1.5 incorporates a cascaded video super-resolution stage, operating in latent space and informed by channel concatenation for latent alignment. The super-resolution block upscales videos synthesized at up to 720p to 1080p, correcting distortions and enhancing sharpness while preserving temporal stability. Figure 3

Figure 3: The pipeline of the cascaded video super-resolution model, converting low-resolution latent outputs into coherent, high-definition video.

Training Paradigm: Progressive Pre-training and Post-training

The pre-training regimen is structured, progressively scaling spatial, temporal, and task complexity. Mixed-task training balances semantic and visual objectives across T2I, T2V, and I2V, with careful ratio tuning and bucketed data sampling. Flow matching and adaptive shift scheduling maintain stability across variant token lengths. The Muon optimizer is employed for accelerated convergence, outperforming AdamW in both training loss and generative quality benchmarks.

Post-training adopts a multi-stage approach: continuing training, SFT, and RLHF alignment. RLHF strategies differ for T2V and I2V; offline DPO is complemented by online RL, training reward models to improve motion fidelity, semantic consistency, and aesthetics. RL-based post-training yields systematic improvements in all qualitative and quantitative metrics. Figure 4

Figure 4: Visualization of generative refinement across continuing training, SFT, and RLHF stages.

Quantitative Evaluation and Results

HunyuanVideo 1.5 is benchmarked against both open and closed SOTA models utilizing multi-faceted rating and GSB (Good/Same/Bad) paradigms. Five key dimensions guide assessment: semantic consistency, aesthetic quality, visual fidelity, structural stability, and motion effects. HunyuanVideo 1.5 demonstrates

  • Superior motion coherence and structural stability in both text-to-video and image-to-video tasks
  • Competitive or leading performance in visual and aesthetic quality versus systems with >27B parameters or proprietary architectures
  • Exceptional bilingual instruction-following and accurate text rendering, with strong scores for image-video consistency

Inference Efficiency and Resource Requirements

The SSTA sparse attention mechanism yields pronounced computational savings. Inference speed analyses—both with and without engineering optimizations—confirm that the model's compact design and compression yields >2x reduction in latency for long-sequence synthesis. Total inference for 720p/121-frame videos falls within 13.6 GB RAM on a RTX 4090-class GPU. With pipeline offloading and tiling, consumer-grade deployments are practical, further lowering the barrier to creative and research use.

Visual Quality

Qualitative results show marked advancements in detail, sharpness, and motion stability, attributable to the super-resolution network and post-trained distribution refinement. Figure 5

Figure 5: Visual results of the cascaded video super-resolution model demonstrating sharpness and temporal coherence across diverse inputs.

Practical and Theoretical Implications

HunyuanVideo 1.5 demonstrates that open-source models can achieve quality and efficiency competitive with proprietary systems, eroding the resource-accessibility divide. Its architecture, data, and training strategies set precedent for future compact generative approaches, especially those requiring multimodal and multilingual alignment. The release of code/weights opens avenues for broad adoption, extension to controllable video synthesis, and integration within downstream creative and industrial applications. Sparse attention and RL-based instruction tuning will likely become staples in multimodal generative systems.

Conclusion

HunyuanVideo 1.5 establishes a new open-source benchmark for efficient, high-fidelity video generation. Its unified architecture, sophisticated data and captioning pipeline, cutting-edge sparse attention optimization, and robust training strategies together provide a scalable and practical platform for multimodal video creation and research. The release democratizes access to high-quality video synthesis, and its framework and innovations will inform future developments in generative AI.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 20 likes about this paper.