Seedance 1.5 Pro: Joint AV Generation

Updated 17 December 2025

Seedance 1.5 Pro is a joint audio-video generation model that employs a dual-branch Diffusion Transformer and cross-modal module for precise, synchronized outputs.
It utilizes a multi-stage training pipeline with massive audiovisual data, supervised fine-tuning, and reinforcement learning to enhance narrative coherence and lip-sync accuracy.
The model achieves state-of-the-art performance with reduced inference latency and is deployed production-grade via the Volcano Engine API for real-time applications.

Seedance 1.5 Pro is a foundational model for native, joint audio-video generation, utilizing a dual-branch Diffusion Transformer architecture and a cross-modal joint module. It is designed to deliver high-quality, temporally-synchronized audiovisual sequences, establishing state-of-the-art results on audio-visual generation benchmarks. Key features include multilingual and dialect-specific lip-sync, dynamic cinematic camera control, and enhanced narrative coherence. Seedance 1.5 Pro is accessible through a production-ready API on Volcano Engine (Chen et al., 15 Dec 2025).

Seedance 1.5 Pro features a dual-branch Diffusion Transformer, consisting of parallel video and audio branches integrated by a cross-modal joint module.

Video Branch: Accepts sequences of visual tokens (e.g., patch embeddings per frame), processed through a stack of Transformer blocks adapted for the diffusion reverse process. At diffusion timestep $t$ , the branch predicts denoised embeddings $x_{v_t}$ .
Audio Branch: Consumes mel-spectrogram tokens (80-dimensional features per audio frame), with a similarly parameterized but independently weighted Transformer stack; outputs denoised audio embedding $a_t$ at each $t$ .
Cross-Modal Joint Module: Interleaved at designated layers, implementing bidirectional cross-attention. Video queries attend to audio keys/values and vice versa, with modality-type embeddings ( $e^{vid}$ , $e^{aud}$ ) injected into both token and timestep embeddings to signal origin.

The architectural fusion is structured as:

Component	Input Tokens	Core Mechanism
Video Branch	Visual patches per frame	Diffusion + Transformer layers
Audio Branch	Mel-spectrogram tokens	Diffusion + Transformer layers
Joint Module	Tokens from both modalities	Bidirectional cross-attention

This explicit cross-modal design underpins the model's high-fidelity audio-video synchronization, enabling emergent capabilities such as accurate lip sync and dynamic audio-driven scene motion.

2. Diffusion Model Formulation and Loss Objectives

The model leverages a denoising diffusion probabilistic model (DDPM) backbone. While standard DDPM equations are invoked, explicit formulas are not provided beyond those common in the literature.

Forward process: Gaussian noise is added via

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$

Reverse denoising: Approximated by the learned model

$p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$

Primary loss: Mean-squared error (MSE) on the predicted noise

$\mathcal{L}_{\text{simple}} = \mathbb{E}_{x_0, \epsilon \sim \mathcal{N}(0,I),\,t} \left\| \epsilon - \epsilon_\theta\left( \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon,\, t \right) \right\|_2^2$

Variational lower bound (optional):

$\mathcal{L}_{\text{VLB}} = \sum_{t=1}^T D_{KL}( q(x_{t-1}|x_t,x_0) \| p_\theta(x_{t-1}|x_t))$

Cross-modal alignment loss: Auxiliary MSE on paired audio-video embedding alignment:

$x_{v_t}$ 0

This multimodal formulation is critical to enforce synchronization and mutual information flow between the audio and visual streams (Chen et al., 15 Dec 2025).

3. Data Pipeline, Tokenization, and Training Regime

Seedance 1.5 Pro is trained on a multi-stage pipeline designed to maximize both data diversity and alignment:

Stage 1: Curation of ~100 million minutes of in-the-wild audio-video clips.
Stage 2: Automated filtering for audiovisual sync and segment quality.
Stage 3: Generation of detailed captions—comprehensively describing both visual and audio content—using an advanced captioning system.
Stage 4: Curriculum learning schedule, starting from short, simple sequences ( $x_{v_t}$ 116s), progressing to more complex, longer clips.

Tokenization and preprocessing steps:

Modality	Tokenization Method	Preprocessing
Video	16×16 frame patch embeddings	Linear projection
Audio	80-band mel-spectrogram tokens	Fixed-length segmentation
Text	Byte-Pair Encoding (BPE)	Subword tokenization

This pipeline ensures high coverage of real-world data and robust prompt-conditioned generative capabilities.

4. Post-Training Optimization: Supervised Fine-Tuning and RLHF

After initial pretraining, Seedance 1.5 Pro undergoes post-training optimization via supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

SFT: Performed on proprietary, professional-grade audio-visual and cinematic datasets, driving alignment on prompt-to-output correspondence and overall quality.
- Training continues from pretrained weights, introducing text-prompt reconstruction loss on both modalities.
- Paper reports: learning rate $x_{v_t}$ 2, batch size 128, $x_{v_t}$ 3100 K steps. Results in +0.4 Likert increase in prompt adherence and +0.3 in motion vividness (SeedVideoBench 1.5).
RLHF: Employs a small multi-layer Transformer reward model, evaluating:
1. Motion quality
2. Audio fidelity
3. AV synchronization
4. Aesthetic appeal

Policy optimization uses Proximal Policy Optimization (PPO) with a clip ratio of 0.2:

$x_{v_t}$ 4

Targeted infrastructure optimizations yield a 3× reduction in RLHF wallclock time versus earlier pipelines.

5. Inference Acceleration and Production Readiness

Seedance 1.5 Pro incorporates a dedicated inference acceleration framework:

Multi-Stage Distillation (HyperSD-style): A teacher model generates with a full 1,000-step DDPM sampler; a student model is distilled to 30–50 steps with consistency constraints.
Additional optimizations:
- Mixed-precision (FP16) inference
- INT8 quantization for key matrix multiplications
- Pipeline parallelism across GPUs and TPUs

Results: Over 10× reduction in Number of Function Evaluations (NFE), with $x_{v_t}$ 5\% loss in human-rated quality metrics. Example latency is $x_{v_t}$ 61.2s for 512×512×32-frame videos on a single A100 (30-step student), compared to $x_{v_t}$ 712s for the teacher model.

6. Quantitative Performance, Capabilities, and Deployment

Performance:

Metric	Seedance 1.5 Pro	Veo 3.1	Kling 2.6
AV sync: speech↔lip (±40 ms)	95%	88%	90%
Text-to-Video (Likert 1–5)	4.2	3.7	3.9
Image-to-Video (Likert 1–5)	4.1	3.6	–
Inference Latency (512×512×32)	1.2s	–	–

Notable Capabilities:

Native lip-sync for Mandarin, Cantonese, Sichuanese, Shanghainese, Taiwanese, with tonal accuracy.
Cinematic camera control: automated dolly zooms, tracking and orbital shots, continuous takes; color grading during generation.
Narrative coherence: strong long-range scene understanding and multi-shot sequencing.

Deployment:

Model size: ~ $x_{v_t}$ 8 billion parameters.
Recommended hardware: 8 $x_{v_t}$ 9A100 40GB or 4 $a_t$ 0H100 80GB for real-time operation.
Accessible via Volcano Engine API, with native endpoints for text-to-video-audio (T2VA) and image-to-video-audio (I2VA) generation modes.

Sample Code (Python):

$a_t$ 1

Seedance 1.5 Pro represents a natively joint audio-video generation engine, delivering production-grade synchronous outputs with scalable, high-throughput deployment (Chen et al., 15 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Seedance 1.5 Pro.