Papers
Topics
Authors
Recent
2000 character limit reached

Seedance 1.5 Pro: Joint AV Generation

Updated 17 December 2025
  • Seedance 1.5 Pro is a joint audio-video generation model that employs a dual-branch Diffusion Transformer and cross-modal module for precise, synchronized outputs.
  • It utilizes a multi-stage training pipeline with massive audiovisual data, supervised fine-tuning, and reinforcement learning to enhance narrative coherence and lip-sync accuracy.
  • The model achieves state-of-the-art performance with reduced inference latency and is deployed production-grade via the Volcano Engine API for real-time applications.

Seedance 1.5 Pro is a foundational model for native, joint audio-video generation, utilizing a dual-branch Diffusion Transformer architecture and a cross-modal joint module. It is designed to deliver high-quality, temporally-synchronized audiovisual sequences, establishing state-of-the-art results on audio-visual generation benchmarks. Key features include multilingual and dialect-specific lip-sync, dynamic cinematic camera control, and enhanced narrative coherence. Seedance 1.5 Pro is accessible through a production-ready API on Volcano Engine (Chen et al., 15 Dec 2025).

1. Model Architecture: Dual-Branch Diffusion Transformer and Cross-Modal Joint Module

Seedance 1.5 Pro features a dual-branch Diffusion Transformer, consisting of parallel video and audio branches integrated by a cross-modal joint module.

  • Video Branch: Accepts sequences of visual tokens (e.g., patch embeddings per frame), processed through a stack of Transformer blocks adapted for the diffusion reverse process. At diffusion timestep tt, the branch predicts denoised embeddings xvtx_{v_t}.
  • Audio Branch: Consumes mel-spectrogram tokens (80-dimensional features per audio frame), with a similarly parameterized but independently weighted Transformer stack; outputs denoised audio embedding ata_t at each tt.
  • Cross-Modal Joint Module: Interleaved at designated layers, implementing bidirectional cross-attention. Video queries attend to audio keys/values and vice versa, with modality-type embeddings (evide^{vid}, eaude^{aud}) injected into both token and timestep embeddings to signal origin.

The architectural fusion is structured as:

Component Input Tokens Core Mechanism
Video Branch Visual patches per frame Diffusion + Transformer layers
Audio Branch Mel-spectrogram tokens Diffusion + Transformer layers
Joint Module Tokens from both modalities Bidirectional cross-attention

This explicit cross-modal design underpins the model's high-fidelity audio-video synchronization, enabling emergent capabilities such as accurate lip sync and dynamic audio-driven scene motion.

2. Diffusion Model Formulation and Loss Objectives

The model leverages a denoising diffusion probabilistic model (DDPM) backbone. While standard DDPM equations are invoked, explicit formulas are not provided beyond those common in the literature.

  • Forward process: Gaussian noise is added via

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)

  • Reverse denoising: Approximated by the learned model

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

  • Primary loss: Mean-squared error (MSE) on the predicted noise

Lsimple=Ex0,ϵN(0,I),tϵϵθ(αˉtx0+1αˉtϵ,t)22\mathcal{L}_{\text{simple}} = \mathbb{E}_{x_0, \epsilon \sim \mathcal{N}(0,I),\,t} \left\| \epsilon - \epsilon_\theta\left( \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon,\, t \right) \right\|_2^2

  • Variational lower bound (optional):

LVLB=t=1TDKL(q(xt1xt,x0)pθ(xt1xt))\mathcal{L}_{\text{VLB}} = \sum_{t=1}^T D_{KL}( q(x_{t-1}|x_t,x_0) \| p_\theta(x_{t-1}|x_t))

Lalign=Etfvid(xt,t)faud(at,t)22\mathcal{L}_{\text{align}} = \mathbb{E}_{t} \left\| f_{\text{vid}}(x_t, t) - f_{\text{aud}}(a_t, t) \right\|_2^2

This multimodal formulation is critical to enforce synchronization and mutual information flow between the audio and visual streams (Chen et al., 15 Dec 2025).

3. Data Pipeline, Tokenization, and Training Regime

Seedance 1.5 Pro is trained on a multi-stage pipeline designed to maximize both data diversity and alignment:

  1. Stage 1: Curation of ~100 million minutes of in-the-wild audio-video clips.
  2. Stage 2: Automated filtering for audiovisual sync and segment quality.
  3. Stage 3: Generation of detailed captions—comprehensively describing both visual and audio content—using an advanced captioning system.
  4. Stage 4: Curriculum learning schedule, starting from short, simple sequences (\leq16s), progressing to more complex, longer clips.

Tokenization and preprocessing steps:

Modality Tokenization Method Preprocessing
Video 16×16 frame patch embeddings Linear projection
Audio 80-band mel-spectrogram tokens Fixed-length segmentation
Text Byte-Pair Encoding (BPE) Subword tokenization

This pipeline ensures high coverage of real-world data and robust prompt-conditioned generative capabilities.

4. Post-Training Optimization: Supervised Fine-Tuning and RLHF

After initial pretraining, Seedance 1.5 Pro undergoes post-training optimization via supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

  • SFT: Performed on proprietary, professional-grade audio-visual and cinematic datasets, driving alignment on prompt-to-output correspondence and overall quality.
    • Training continues from pretrained weights, introducing text-prompt reconstruction loss on both modalities.
    • Paper reports: learning rate 1×1051 \times 10^{-5}, batch size 128, \sim100 K steps. Results in +0.4 Likert increase in prompt adherence and +0.3 in motion vividness (SeedVideoBench 1.5).
  • RLHF: Employs a small multi-layer Transformer reward model, evaluating:

    1. Motion quality
    2. Audio fidelity
    3. AV synchronization
    4. Aesthetic appeal

Policy optimization uses Proximal Policy Optimization (PPO) with a clip ratio of 0.2:

LPPO=E[min(rt(θ)At,  clip(rt(θ),1ϵ,1+ϵ)At)],rt(θ)=πθπθold\mathcal{L}_{\text{PPO}} = -\mathbb{E}\left[ \min(r_t(\theta)\,A_t,\; \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\,A_t) \right],\quad r_t(\theta)=\frac{\pi_\theta}{\pi_{\theta_{\text{old}}}}

Targeted infrastructure optimizations yield a 3× reduction in RLHF wallclock time versus earlier pipelines.

5. Inference Acceleration and Production Readiness

Seedance 1.5 Pro incorporates a dedicated inference acceleration framework:

  • Multi-Stage Distillation (HyperSD-style): A teacher model generates with a full 1,000-step DDPM sampler; a student model is distilled to 30–50 steps with consistency constraints.

  • Additional optimizations:

    • Mixed-precision (FP16) inference
    • INT8 quantization for key matrix multiplications
    • Pipeline parallelism across GPUs and TPUs

Results: Over 10× reduction in Number of Function Evaluations (NFE), with <1<1\% loss in human-rated quality metrics. Example latency is \sim1.2s for 512×512×32-frame videos on a single A100 (30-step student), compared to \sim12s for the teacher model.

6. Quantitative Performance, Capabilities, and Deployment

Performance:

Metric Seedance 1.5 Pro Veo 3.1 Kling 2.6
AV sync: speech↔lip (±40 ms) 95% 88% 90%
Text-to-Video (Likert 1–5) 4.2 3.7 3.9
Image-to-Video (Likert 1–5) 4.1 3.6
Inference Latency (512×512×32) 1.2s

Notable Capabilities:

  • Native lip-sync for Mandarin, Cantonese, Sichuanese, Shanghainese, Taiwanese, with tonal accuracy.
  • Cinematic camera control: automated dolly zooms, tracking and orbital shots, continuous takes; color grading during generation.
  • Narrative coherence: strong long-range scene understanding and multi-shot sequencing.

Deployment:

  • Model size: ~$4.5$ billion parameters.
  • Recommended hardware: 8×\timesA100 40GB or 4×\timesH100 80GB for real-time operation.
  • Accessible via Volcano Engine API, with native endpoints for text-to-video-audio (T2VA) and image-to-video-audio (I2VA) generation modes.

Sample Code (Python):

1
2
3
4
5
6
7
8
9
from volcengine import Client
cli = Client(key=, secret=)
resp = cli.generate(
    model_id="Doubao-Seedance-1.5-pro",
    prompt="A rainy street scene in Paris, with soft piano music",
    mode="T2VA",
    length_s=8,
)
video, audio = resp["video"], resp["audio"]

Seedance 1.5 Pro represents a natively joint audio-video generation engine, delivering production-grade synchronous outputs with scalable, high-throughput deployment (Chen et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Seedance 1.5 Pro.