Seedance 1.5 Pro: Joint AV Generation
- Seedance 1.5 Pro is a joint audio-video generation model that employs a dual-branch Diffusion Transformer and cross-modal module for precise, synchronized outputs.
- It utilizes a multi-stage training pipeline with massive audiovisual data, supervised fine-tuning, and reinforcement learning to enhance narrative coherence and lip-sync accuracy.
- The model achieves state-of-the-art performance with reduced inference latency and is deployed production-grade via the Volcano Engine API for real-time applications.
Seedance 1.5 Pro is a foundational model for native, joint audio-video generation, utilizing a dual-branch Diffusion Transformer architecture and a cross-modal joint module. It is designed to deliver high-quality, temporally-synchronized audiovisual sequences, establishing state-of-the-art results on audio-visual generation benchmarks. Key features include multilingual and dialect-specific lip-sync, dynamic cinematic camera control, and enhanced narrative coherence. Seedance 1.5 Pro is accessible through a production-ready API on Volcano Engine (Chen et al., 15 Dec 2025).
1. Model Architecture: Dual-Branch Diffusion Transformer and Cross-Modal Joint Module
Seedance 1.5 Pro features a dual-branch Diffusion Transformer, consisting of parallel video and audio branches integrated by a cross-modal joint module.
- Video Branch: Accepts sequences of visual tokens (e.g., patch embeddings per frame), processed through a stack of Transformer blocks adapted for the diffusion reverse process. At diffusion timestep , the branch predicts denoised embeddings .
- Audio Branch: Consumes mel-spectrogram tokens (80-dimensional features per audio frame), with a similarly parameterized but independently weighted Transformer stack; outputs denoised audio embedding at each .
- Cross-Modal Joint Module: Interleaved at designated layers, implementing bidirectional cross-attention. Video queries attend to audio keys/values and vice versa, with modality-type embeddings (, ) injected into both token and timestep embeddings to signal origin.
The architectural fusion is structured as:
| Component | Input Tokens | Core Mechanism |
|---|---|---|
| Video Branch | Visual patches per frame | Diffusion + Transformer layers |
| Audio Branch | Mel-spectrogram tokens | Diffusion + Transformer layers |
| Joint Module | Tokens from both modalities | Bidirectional cross-attention |
This explicit cross-modal design underpins the model's high-fidelity audio-video synchronization, enabling emergent capabilities such as accurate lip sync and dynamic audio-driven scene motion.
2. Diffusion Model Formulation and Loss Objectives
The model leverages a denoising diffusion probabilistic model (DDPM) backbone. While standard DDPM equations are invoked, explicit formulas are not provided beyond those common in the literature.
- Forward process: Gaussian noise is added via
- Reverse denoising: Approximated by the learned model
- Primary loss: Mean-squared error (MSE) on the predicted noise
- Variational lower bound (optional):
- Cross-modal alignment loss: Auxiliary MSE on paired audio-video embedding alignment:
This multimodal formulation is critical to enforce synchronization and mutual information flow between the audio and visual streams (Chen et al., 15 Dec 2025).
3. Data Pipeline, Tokenization, and Training Regime
Seedance 1.5 Pro is trained on a multi-stage pipeline designed to maximize both data diversity and alignment:
- Stage 1: Curation of ~100 million minutes of in-the-wild audio-video clips.
- Stage 2: Automated filtering for audiovisual sync and segment quality.
- Stage 3: Generation of detailed captions—comprehensively describing both visual and audio content—using an advanced captioning system.
- Stage 4: Curriculum learning schedule, starting from short, simple sequences (16s), progressing to more complex, longer clips.
Tokenization and preprocessing steps:
| Modality | Tokenization Method | Preprocessing |
|---|---|---|
| Video | 16×16 frame patch embeddings | Linear projection |
| Audio | 80-band mel-spectrogram tokens | Fixed-length segmentation |
| Text | Byte-Pair Encoding (BPE) | Subword tokenization |
This pipeline ensures high coverage of real-world data and robust prompt-conditioned generative capabilities.
4. Post-Training Optimization: Supervised Fine-Tuning and RLHF
After initial pretraining, Seedance 1.5 Pro undergoes post-training optimization via supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).
- SFT: Performed on proprietary, professional-grade audio-visual and cinematic datasets, driving alignment on prompt-to-output correspondence and overall quality.
- Training continues from pretrained weights, introducing text-prompt reconstruction loss on both modalities.
- Paper reports: learning rate , batch size 128, 100 K steps. Results in +0.4 Likert increase in prompt adherence and +0.3 in motion vividness (SeedVideoBench 1.5).
- RLHF: Employs a small multi-layer Transformer reward model, evaluating:
- Motion quality
- Audio fidelity
- AV synchronization
- Aesthetic appeal
Policy optimization uses Proximal Policy Optimization (PPO) with a clip ratio of 0.2:
Targeted infrastructure optimizations yield a 3× reduction in RLHF wallclock time versus earlier pipelines.
5. Inference Acceleration and Production Readiness
Seedance 1.5 Pro incorporates a dedicated inference acceleration framework:
Multi-Stage Distillation (HyperSD-style): A teacher model generates with a full 1,000-step DDPM sampler; a student model is distilled to 30–50 steps with consistency constraints.
Additional optimizations:
- Mixed-precision (FP16) inference
- INT8 quantization for key matrix multiplications
- Pipeline parallelism across GPUs and TPUs
Results: Over 10× reduction in Number of Function Evaluations (NFE), with \% loss in human-rated quality metrics. Example latency is 1.2s for 512×512×32-frame videos on a single A100 (30-step student), compared to 12s for the teacher model.
6. Quantitative Performance, Capabilities, and Deployment
Performance:
| Metric | Seedance 1.5 Pro | Veo 3.1 | Kling 2.6 |
|---|---|---|---|
| AV sync: speech↔lip (±40 ms) | 95% | 88% | 90% |
| Text-to-Video (Likert 1–5) | 4.2 | 3.7 | 3.9 |
| Image-to-Video (Likert 1–5) | 4.1 | 3.6 | – |
| Inference Latency (512×512×32) | 1.2s | – | – |
Notable Capabilities:
- Native lip-sync for Mandarin, Cantonese, Sichuanese, Shanghainese, Taiwanese, with tonal accuracy.
- Cinematic camera control: automated dolly zooms, tracking and orbital shots, continuous takes; color grading during generation.
- Narrative coherence: strong long-range scene understanding and multi-shot sequencing.
Deployment:
- Model size: ~$4.5$ billion parameters.
- Recommended hardware: 8A100 40GB or 4H100 80GB for real-time operation.
- Accessible via Volcano Engine API, with native endpoints for text-to-video-audio (T2VA) and image-to-video-audio (I2VA) generation modes.
Sample Code (Python):
1 2 3 4 5 6 7 8 9 |
from volcengine import Client cli = Client(key=…, secret=…) resp = cli.generate( model_id="Doubao-Seedance-1.5-pro", prompt="A rainy street scene in Paris, with soft piano music", mode="T2VA", length_s=8, ) video, audio = resp["video"], resp["audio"] |
Seedance 1.5 Pro represents a natively joint audio-video generation engine, delivering production-grade synchronous outputs with scalable, high-throughput deployment (Chen et al., 15 Dec 2025).