Seedance 1.0: Exploring the Boundaries of Video Generation Models (2506.09113v2)

Published 10 Jun 2025 in cs.CV

Abstract: Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.

Summary

The paper introduces Seedance 1.0, a foundational model that balances prompt adherence, motion coherence, and visual quality in video generation.
Its architecture leverages decoupled spatial-temporal layers and Diffusion Transformers with interleaved multimodal positional encoding for efficient processing.
Enhanced post-training with SFT and RLHF along with multi-stage inference acceleration achieves rapid 1080p video generation with superior benchmark performance.

Here is a detailed summary of the paper "Seedance 1.0: Exploring the Boundaries of Video Generation Models" (2506.09113).

The paper introduces Seedance 1.0, a high-performance and inference-efficient foundational video generation model developed by ByteDance Seed. It aims to address key challenges in current video generation models, specifically balancing prompt following, motion plausibility, and visual quality simultaneously. Seedance 1.0 supports native bilingual (Chinese/English) generation and unifies text-to-video (T2V) and image-to-video (I2V) tasks within a single model.

The technical foundation of Seedance 1.0 is built upon four core pillars:

Multi-Source Data Curation with Comprehensive Video Captioning: The model is trained on a large-scale, high-quality video dataset curated from diverse sources, covering various categories, styles, and scenarios. A multi-stage data processing pipeline is used, including diversity-oriented sourcing, shot-aware temporal segmentation, visual overlay rectification, quality and safety filtering, semantic deduplication, and distribution rebalancing. A precision video captioning system, trained on manually annotated data using a video understanding model like Tarsier2 (Yuan et al., 14 Jan 2025), provides dense captions describing both dynamic (actions, camera) and static (characters, scenes) features. A separate Prompt Engineering (PE) module, based on a fine-tuned LLM (Qwen2.5-14B (Qwen et al., 19 Dec 2024)) trained via SFT and RL (DPO (2312.6114)), is used to translate user prompts into this dense caption format for the Diffusion Transformer (DiT).
Efficient Architecture Design: The model architecture is based on a Diffusion Transformer (DiT). To handle both spatial and temporal aspects efficiently and support multi-task learning, it employs decoupled spatial and temporal layers. Spatial layers focus on intra-frame attention, while temporal layers use carefully designed window attentions for efficient inter-frame computation. The model utilizes an interleaved multimodal positional encoding (3D RoPE (Challagundla et al., 8 Apr 2024, Guo et al., 13 Mar 2025) for visual tokens, 1D RoPE for textual tokens) to integrate visual and textual information. Following the MMDiT design (similar to Stable Diffusion 3 (Wang et al., 18 Sep 2024)), spatial layers use multi-modality self-attention with separate weights for visual and textual tokens, while temporal layers use visual-only self-attention. The architecture natively supports multi-shot generation by organizing shots temporally with individual captions. A unified task formulation, similar to ControlNet (Yamaguchi et al., 2023), allows joint training for T2I, T2V, and I2V by conditioning on noisy inputs concatenated with cleaned or zero-padded frames and binary masks.
Enhanced Post-Training Optimization: Seedance 1.0 undergoes comprehensive post-training optimization stages to align with human preferences. After initial pre-training and a Continue Training (CT) phase focusing on enhancing I2V performance and multitask capabilities with high-quality data and specialized captions, supervised fine-tuning (SFT) is performed on a curated dataset of high-quality video-text pairs with manually verified captions. This improves visual aesthetics and motion coherence. The SFT phase involves training separate models on data subsets and merging them. Finally, video-tailored Reinforcement Learning from Human Feedback (RLHF) is applied. This involves collecting human preference data using a multi-dimensional annotation approach. A sophisticated reward system with three specialized reward models (Foundational, Motion, Aesthetic) is used. Foundational RM (Vision-LLM based) assesses image-text alignment and structural stability, Motion RM enhances motion quality and mitigates artifacts, and Aesthetic RM (image-space, similar to Seedream (Gong et al., 10 Mar 2025) on keyframes) evaluates visual appeal. The base model's training maximizes composite rewards, claiming better efficiency than methods like DPO/PPO/GRPO (Xue et al., 12 May 2025, Zhang et al., 19 Dec 2024, Liu et al., 8 May 2025, Liu et al., 23 Jan 2025). RLHF is also applied to the Diffusion Refiner, which upscales low-resolution (480p) base model outputs to high-resolution (720p/1080p) by conditioning on the low-resolution video and maximizing rewards from the same RMs.
Inference Acceleration: Significant effort was put into achieving ultra-fast generation. The DiT inference is accelerated using multi-stage diffusion distillation techniques, including Trajectory Segmented Consistency Distillation (TSCD) (Shao et al., 10 Mar 2025) for step reduction (4x speedup), Score Distillation (Shao et al., 10 Mar 2025) for improved stability at low NFEs, and an extended Adversarial Post-Training (APT) (Lin et al., 14 Jan 2025) approach using human preference data to mitigate acceleration artifacts and improve visual quality. The Variational Autoencoder (VAE) decoder is optimized into a "Thin VAE" by narrowing channel widths in later layers and retraining with a fixed encoder, achieving a 2x speedup without quality loss. The underlying inference infrastructure incorporates high-performance kernel fusion (e.g., for Attention and Gemm) for throughput gains, fine-grained mixed-precision quantization and adaptive sparsity (extending AdaSpa (Xia et al., 28 Feb 2025)) tailored for DiT operations, an adaptive hybrid parallel strategy for memory efficiency with long sequences (context parallelism, FP8 communication, reducing overhead compared to Ulysses (Jacobs et al., 2023)), an automated Async Offloading strategy for deployment on memory-limited devices, hybrid parallelism for the VAE Decoder to reduce memory consumption, and pipeline optimizations including continuous batching and prefix caching. These optimizations allow Seedance 1.0 to generate a 5-second, 1080p video in 41.4 seconds on an NVIDIA-L20 GPU.

Seedance 1.0 demonstrates strong performance, topping the Artificial Analysis leaderboards for both text-to-video and image-to-video tasks. Internal evaluations using the SeedVideoBench-1.0 benchmark and expert human evaluation (Absolute Score, GSB metric) show its superiority in prompt following, motion quality, visual fidelity, and I2V preservation compared to state-of-the-art models like Kling 2.1, Veo 3, Wan 2.1, and Sora. The model particularly excels in precise instruction adherence in complex scenarios and exhibits robust capabilities in multi-shot narrative generation with subject consistency and stylistic coherence across shots, as well as multi-style alignment (generating diverse cinematic and artistic styles). Seedance 1.0 is planned for integration into ByteDance platforms like Doubao and Jimeng.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (44)

First 10 authors:

Tweets

https://twitter.com/scaling01/status/1933048431775527006

https://twitter.com/AngryTomtweets/status/1934739993869861362

https://twitter.com/TheTuringPost/status/1935505701889261946

https://twitter.com/TheAhmadOsman/status/1933178832166469878

https://twitter.com/TheAhmadOsman/status/1933144805091197170

https://twitter.com/liuliu/status/1933206624510095763

YouTube

Show All Videos