HunyuanVideo 1.5 – Open-Source Video Synthesis Model
- HunyuanVideo 1.5 is an open-source video generation model leveraging an 8.3B Diffusion Transformer to synthesize temporally consistent and high-fidelity videos from text and images.
- It uses a unified two-stage pipeline with a 3D causal VAE and a cascaded video super-resolution network to upscale content from 480p–720p to 1080p while ensuring efficiency.
- Novel selective and sliding tile attention mechanisms paired with glyph-aware dual-channel text encoding enhance motion coherence, bilingual performance, and benchmark competitiveness.
HunyuanVideo 1.5 is an open-source video generation model from Tencent, advancing the state of the art in visual quality, motion coherence, and efficiency in the domain of text-to-video and image-to-video synthesis. With 8.3 billion parameters, HunyuanVideo 1.5 introduces a highly optimized architecture, novel attention mechanisms, glyph-aware text encoding, progressive multi-stage training, and a cascaded video super-resolution network, establishing new performance benchmarks among open-source video generation systems. The model and its code base are available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5 (Wu et al., 24 Nov 2025).
1. System Architecture and Pipeline
HunyuanVideo 1.5 utilizes a unified, two-stage pipeline that enables high-fidelity, temporally consistent video synthesis while maintaining computational efficiency. The pipeline consists of:
- An 8.3B parameter Diffusion Transformer (DiT) core that operates on 3D causal VAE latents.
- An initial video synthesis step generating 5–10 second video latents at 480p–720p resolution.
- A latent-space Video Super-Resolution (VSR) network that further upsamples content to 1080p.
This architecture is directly focused on supporting both text-to-video (T2V) and image-to-video (I2V) generation within a single framework. Peak memory usage is held at 13.6 GB for 720p × 121 frames, ensuring inference feasibility on consumer GPUs.
Pipeline Structure:
1 2 3 4 5 6 7 8 9 10 11 |
Text/Image Prompt
│
VAE Encoder
│
8.3B DiT (with SSTA)
│
Latent Video (480–720p)
│
VSR Network
│
1080p Video |
2. Data Acquisition and Filtering
HunyuanVideo 1.5's success critically depends on large-scale, multi-stage data curation:
- Video Data: Over 10 million hours of raw video form the basis for pre-training; post-segmentation and filtering (using PySceneDetect, transition classifiers), ≈800 million high-quality clips remain. Duration is fixed at 2–10 seconds per clip.
- Image Data: 5 billion images are pre-filtered from an initial 10 billion pool for staged T2I bootstrapping.
- Filtering Pipeline Steps:
- De-duplication, removal of padding and stitch artifacts, and exclusion of low-motion samples.
- Visual quality assessment (sharpness, detail, dynamic range, noise).
- Aesthetic filtering based on Dover scores.
Captioning:
Structured, multi-component captions are generated: for video—narrative, shot type, angle, lighting, style, color, and atmosphere; for I2V—foreground/background transitions; for T2V/I2V—natural-language tokens for recognized camera motion.
Post-training, RL-based fine-tuning (OPA-DPO) balances descriptive richness and hallucination in captions, with camera motion recognized and encoded as conditional tokens (Wu et al., 24 Nov 2025).
3. Model Design: DiT and Attention Innovations
HunyuanVideo 1.5 adopts a DiT backbone with advanced attention mechanisms to simultaneously handle large context lengths, spatiotemporal correlations, and diverse generative conditioning.
- Core Hyperparameters:
- 54 dual-stream DiT blocks, model dimension 2048, FFN dimension 8192.
- 16 attention heads with head dimension 128.
- Selective and Sliding Tile Attention (SSTA):
- Selective attention: Top- block selection based on importance scores (Equation: ) derived from Q–K similarity and K–K redundancy.
- Sliding Tile Attention: Local 3D windowed masks reinforce sparsity, controlling receptive field size.
- The resulting mask enables flexible, data-adaptive sparse attention, yielding up to 1.87× speedup on 10s 720p sequences versus FlashAttention-3.
- 3D Causal VAE:
The VAE reduces the input spatial/temporal size efficiently, supporting the DiT regime and enabling latent upsampling for final output.
4. Text Encoding and Conditioning
A specialized dual-channel text encoding scheme optimizes cross-lingual generation fidelity:
- Qwen2.5-VL Multimodal Encoder: Captures global semantics and high-level action or scene information.
- Glyph-ByT5 Encoder: Extracts fine-grained glyph features for both Chinese and English, improving token-level discrimination.
- Conditioning is performed by concatenated embedding streams, augmented with learnable tokens to indicate task type (T2V, I2V, or T2I). An additional cross-channel alignment loss during multi-task pre-training aligns VL/Glyph semantic spaces:
This dual approach yields ∼17% absolute improvements in instruction following, with strong bilingual prompting and zero-shot generalization (Wu et al., 24 Nov 2025).
5. Progressive Training and Post-Training Regimen
The training strategy is highly staged, blending image and video tasks for maximal transfer and stability:
- Pre-training: Eight progressive stages with dynamic curriculum mixing image and video data at increasing resolutions and durations. Initial T2I stages warm up DiT weights, preventing catastrophic forgetting and enhancing structural stability.
- Continued Training (CT): Separate CT runs for T2V and I2V on 1M premium video clips each.
- Supervised Fine-Tuning (SFT): Final stabilization using rigorously filtered, high-aesthetic video clips to maximize output realism.
- RLHF (for I2V): Online RL (MixGRPO solver) with reward models over textual and visual alignment, image fidelity, and motion realism. For T2V, offline Direct Preference Optimization (DPO) on ranked data, then online RL.
- Losses:
- Denoising flow-matching:
- For VSR: pixel-wise, perceptual (VGG-based), and flow-matching objectives.
The optimizer Muon is employed for faster convergence compared to AdamW (weight decay = 0.01) (Wu et al., 24 Nov 2025).
6. Video Super-Resolution and Output
To achieve 1080p output from lower-resolution latents, HunyuanVideo 1.5 applies a cascaded VSR network:
- Low-resolution latents (480–720p) plus noise are processed by the DiT-based VSR, which aligns low-to-high resolution in latent space via a dedicated spatial upsampler.
- The VAE decoder then reconstructs full-resolution frames.
Losses combine pixel-level differences and VGG feature-space perceptual terms, maintaining spatial and temporal consistency.
7. Comparative Evaluation and Open Source Release
Quantitative Benchmarks:
HunyuanVideo 1.5 achieves superior or competitive performance across major open and closed-source systems in both T2V and I2V tasks.
| Dimension | HY1.5 (T2V, 720p) | Best Open Baseline | Best Closed Baseline |
|---|---|---|---|
| Instruction Following | 61.57 | 50.03 (Kling2.1) | 73.77 (Veo3) |
| Aesthetic Quality | 63.30 | 68.22 (Seedance) | 67.98 (Veo3) |
| Visual Quality | 57.35 | 60.20 (Seedance) | 58.64 (Veo3) |
| Structural Stability | 79.75 | 73.75 (Wan2.2) | 75.62 (Veo3) |
| Motion Effects | 57.67 | 58.59 (Kling2.1) | 60.81 (Veo3) |
Speed and Memory:
- 720p × 121 frames: ~2.0s/step (dense); ~1.6s/step (sparse) on 8×H800 GPUs.
- 50-step generation: 28.33s (dense), 26.41s (sparse), with 13.6 GB GPU usage for 720p sequences.
Open-Source Release:
Code and weights are available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5, including PyTorch APIs for text-to-video, image-to-video, super-resolution, and attention sparsity toggling (Wu et al., 24 Nov 2025).
8. Context, Significance, and Implications
HunyuanVideo 1.5 decisively advances the open-source video generation field, reducing the gap to closed-source models in both motion realism and text-video alignment. Key contributions include efficient SSTA for scalable attention over long video sequences, glyph-aware dual-channel encoding for robust bilingual comprehension, and a transparent, highly scalable training regimen anchored by progressive curation and curriculum.
The model’s design suggests promising directions for further research: the SSTA mechanism, joint pre-training with structured, multilingual captions, and the dual-channel language encoder establish new templates for scalable, general-purpose video generation systems. HunyuanVideo 1.5 also lowers the technical and resource barrier for academic and industrial video synthesis applications, providing a reproducible, performant foundation for the next generation of open-source generative models (Wu et al., 24 Nov 2025, Kong et al., 3 Dec 2024).