Papers
Topics
Authors
Recent
2000 character limit reached

HunyuanVideo 1.5 – Open-Source Video Synthesis Model

Updated 26 November 2025
  • HunyuanVideo 1.5 is an open-source video generation model leveraging an 8.3B Diffusion Transformer to synthesize temporally consistent and high-fidelity videos from text and images.
  • It uses a unified two-stage pipeline with a 3D causal VAE and a cascaded video super-resolution network to upscale content from 480p–720p to 1080p while ensuring efficiency.
  • Novel selective and sliding tile attention mechanisms paired with glyph-aware dual-channel text encoding enhance motion coherence, bilingual performance, and benchmark competitiveness.

HunyuanVideo 1.5 is an open-source video generation model from Tencent, advancing the state of the art in visual quality, motion coherence, and efficiency in the domain of text-to-video and image-to-video synthesis. With 8.3 billion parameters, HunyuanVideo 1.5 introduces a highly optimized architecture, novel attention mechanisms, glyph-aware text encoding, progressive multi-stage training, and a cascaded video super-resolution network, establishing new performance benchmarks among open-source video generation systems. The model and its code base are available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5 (Wu et al., 24 Nov 2025).

1. System Architecture and Pipeline

HunyuanVideo 1.5 utilizes a unified, two-stage pipeline that enables high-fidelity, temporally consistent video synthesis while maintaining computational efficiency. The pipeline consists of:

  • An 8.3B parameter Diffusion Transformer (DiT) core that operates on 3D causal VAE latents.
  • An initial video synthesis step generating 5–10 second video latents at 480p–720p resolution.
  • A latent-space Video Super-Resolution (VSR) network that further upsamples content to 1080p.

This architecture is directly focused on supporting both text-to-video (T2V) and image-to-video (I2V) generation within a single framework. Peak memory usage is held at 13.6 GB for 720p × 121 frames, ensuring inference feasibility on consumer GPUs.

Pipeline Structure:

1
2
3
4
5
6
7
8
9
10
11
Text/Image Prompt
      │
  VAE Encoder
      │
  8.3B DiT (with SSTA)
      │
Latent Video (480–720p)
      │
  VSR Network
      │
  1080p Video
The 3D causal VAE provides ×16 spatial and ×4 temporal compression, facilitating efficient and scalable latent modeling (Wu et al., 24 Nov 2025).

2. Data Acquisition and Filtering

HunyuanVideo 1.5's success critically depends on large-scale, multi-stage data curation:

  • Video Data: Over 10 million hours of raw video form the basis for pre-training; post-segmentation and filtering (using PySceneDetect, transition classifiers), ≈800 million high-quality clips remain. Duration is fixed at 2–10 seconds per clip.
  • Image Data: 5 billion images are pre-filtered from an initial 10 billion pool for staged T2I bootstrapping.
  • Filtering Pipeline Steps:
  1. De-duplication, removal of padding and stitch artifacts, and exclusion of low-motion samples.
  2. Visual quality assessment (sharpness, detail, dynamic range, noise).
  3. Aesthetic filtering based on Dover scores.

Captioning:

Structured, multi-component captions are generated: for video—narrative, shot type, angle, lighting, style, color, and atmosphere; for I2V—foreground/background transitions; for T2V/I2V—natural-language tokens for recognized camera motion.

Post-training, RL-based fine-tuning (OPA-DPO) balances descriptive richness and hallucination in captions, with camera motion recognized and encoded as conditional tokens (Wu et al., 24 Nov 2025).

3. Model Design: DiT and Attention Innovations

HunyuanVideo 1.5 adopts a DiT backbone with advanced attention mechanisms to simultaneously handle large context lengths, spatiotemporal correlations, and diverse generative conditioning.

  • Core Hyperparameters:
    • 54 dual-stream DiT blocks, model dimension 2048, FFN dimension 8192.
    • 16 attention heads with head dimension 128.
  • Selective and Sliding Tile Attention (SSTA):
    • Selective attention: Top-kk block selection based on importance scores (Equation: Scorei=λScoresβScorerScore_i = \lambda Score_s - \beta Score_r) derived from Q–K similarity and K–K redundancy.
    • Sliding Tile Attention: Local 3D windowed masks reinforce sparsity, controlling receptive field size.
    • The resulting mask enables flexible, data-adaptive sparse attention, yielding up to 1.87× speedup on 10s 720p sequences versus FlashAttention-3.
  • 3D Causal VAE:

The VAE reduces the input spatial/temporal size efficiently, supporting the DiT regime and enabling latent upsampling for final output.

4. Text Encoding and Conditioning

A specialized dual-channel text encoding scheme optimizes cross-lingual generation fidelity:

  • Qwen2.5-VL Multimodal Encoder: Captures global semantics and high-level action or scene information.
  • Glyph-ByT5 Encoder: Extracts fine-grained glyph features for both Chinese and English, improving token-level discrimination.
  • Conditioning is performed by concatenated embedding streams, augmented with learnable tokens to indicate task type (T2V, I2V, or T2I). An additional cross-channel alignment loss during multi-task pre-training aligns VL/Glyph semantic spaces:

Lalign=fVL(t)fglyph(t)22\mathcal{L}_{align} = \| f_{VL}(t) - f_{glyph}(t) \|_2^2

This dual approach yields ∼17% absolute improvements in instruction following, with strong bilingual prompting and zero-shot generalization (Wu et al., 24 Nov 2025).

5. Progressive Training and Post-Training Regimen

The training strategy is highly staged, blending image and video tasks for maximal transfer and stability:

  • Pre-training: Eight progressive stages with dynamic curriculum mixing image and video data at increasing resolutions and durations. Initial T2I stages warm up DiT weights, preventing catastrophic forgetting and enhancing structural stability.
  • Continued Training (CT): Separate CT runs for T2V and I2V on 1M premium video clips each.
  • Supervised Fine-Tuning (SFT): Final stabilization using rigorously filtered, high-aesthetic video clips to maximize output realism.
  • RLHF (for I2V): Online RL (MixGRPO solver) with reward models over textual and visual alignment, image fidelity, and motion realism. For T2V, offline Direct Preference Optimization (DPO) on ranked data, then online RL.
  • Losses:

    • Denoising flow-matching:

    Ldenoise=Ex0,ϵ,t[ϵϵθ(zt,t,cond)2]\mathcal{L}_{denoise} = \mathbb{E}_{x_0,\epsilon,t}[ \| \epsilon - \epsilon_\theta(z_t, t, cond) \|^2 ] - For VSR: pixel-wise, perceptual (VGG-based), and flow-matching objectives.

The optimizer Muon is employed for faster convergence compared to AdamW (weight decay = 0.01) (Wu et al., 24 Nov 2025).

6. Video Super-Resolution and Output

To achieve 1080p output from lower-resolution latents, HunyuanVideo 1.5 applies a cascaded VSR network:

  • Low-resolution latents (480–720p) plus noise are processed by the DiT-based VSR, which aligns low-to-high resolution in latent space via a dedicated spatial upsampler.
  • The VAE decoder then reconstructs full-resolution frames.

Losses combine pixel-level differences and VGG feature-space perceptual terms, maintaining spatial and temporal consistency.

7. Comparative Evaluation and Open Source Release

Quantitative Benchmarks:

HunyuanVideo 1.5 achieves superior or competitive performance across major open and closed-source systems in both T2V and I2V tasks.

Dimension HY1.5 (T2V, 720p) Best Open Baseline Best Closed Baseline
Instruction Following 61.57 50.03 (Kling2.1) 73.77 (Veo3)
Aesthetic Quality 63.30 68.22 (Seedance) 67.98 (Veo3)
Visual Quality 57.35 60.20 (Seedance) 58.64 (Veo3)
Structural Stability 79.75 73.75 (Wan2.2) 75.62 (Veo3)
Motion Effects 57.67 58.59 (Kling2.1) 60.81 (Veo3)

Speed and Memory:

  • 720p × 121 frames: ~2.0s/step (dense); ~1.6s/step (sparse) on 8×H800 GPUs.
  • 50-step generation: 28.33s (dense), 26.41s (sparse), with 13.6 GB GPU usage for 720p sequences.

Open-Source Release:

Code and weights are available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5, including PyTorch APIs for text-to-video, image-to-video, super-resolution, and attention sparsity toggling (Wu et al., 24 Nov 2025).

8. Context, Significance, and Implications

HunyuanVideo 1.5 decisively advances the open-source video generation field, reducing the gap to closed-source models in both motion realism and text-video alignment. Key contributions include efficient SSTA for scalable attention over long video sequences, glyph-aware dual-channel encoding for robust bilingual comprehension, and a transparent, highly scalable training regimen anchored by progressive curation and curriculum.

The model’s design suggests promising directions for further research: the SSTA mechanism, joint pre-training with structured, multilingual captions, and the dual-channel language encoder establish new templates for scalable, general-purpose video generation systems. HunyuanVideo 1.5 also lowers the technical and resource barrier for academic and industrial video synthesis applications, providing a reproducible, performant foundation for the next generation of open-source generative models (Wu et al., 24 Nov 2025, Kong et al., 3 Dec 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HunyuanVideo 1.5.