- The paper introduces Seedance 1.0, a foundational model that balances prompt adherence, motion coherence, and visual quality in video generation.
- Its architecture leverages decoupled spatial-temporal layers and Diffusion Transformers with interleaved multimodal positional encoding for efficient processing.
- Enhanced post-training with SFT and RLHF along with multi-stage inference acceleration achieves rapid 1080p video generation with superior benchmark performance.
Here is a detailed summary of the paper "Seedance 1.0: Exploring the Boundaries of Video Generation Models" (2506.09113).
The paper introduces Seedance 1.0, a high-performance and inference-efficient foundational video generation model developed by ByteDance Seed. It aims to address key challenges in current video generation models, specifically balancing prompt following, motion plausibility, and visual quality simultaneously. Seedance 1.0 supports native bilingual (Chinese/English) generation and unifies text-to-video (T2V) and image-to-video (I2V) tasks within a single model.
The technical foundation of Seedance 1.0 is built upon four core pillars:
- Multi-Source Data Curation with Comprehensive Video Captioning: The model is trained on a large-scale, high-quality video dataset curated from diverse sources, covering various categories, styles, and scenarios. A multi-stage data processing pipeline is used, including diversity-oriented sourcing, shot-aware temporal segmentation, visual overlay rectification, quality and safety filtering, semantic deduplication, and distribution rebalancing. A precision video captioning system, trained on manually annotated data using a video understanding model like Tarsier2 (Yuan et al., 14 Jan 2025), provides dense captions describing both dynamic (actions, camera) and static (characters, scenes) features. A separate Prompt Engineering (PE) module, based on a fine-tuned LLM (Qwen2.5-14B (Qwen et al., 19 Dec 2024)) trained via SFT and RL (DPO (2312.6114)), is used to translate user prompts into this dense caption format for the Diffusion Transformer (DiT).
- Efficient Architecture Design: The model architecture is based on a Diffusion Transformer (DiT). To handle both spatial and temporal aspects efficiently and support multi-task learning, it employs decoupled spatial and temporal layers. Spatial layers focus on intra-frame attention, while temporal layers use carefully designed window attentions for efficient inter-frame computation. The model utilizes an interleaved multimodal positional encoding (3D RoPE (Challagundla et al., 8 Apr 2024, Guo et al., 13 Mar 2025) for visual tokens, 1D RoPE for textual tokens) to integrate visual and textual information. Following the MMDiT design (similar to Stable Diffusion 3 (Wang et al., 18 Sep 2024)), spatial layers use multi-modality self-attention with separate weights for visual and textual tokens, while temporal layers use visual-only self-attention. The architecture natively supports multi-shot generation by organizing shots temporally with individual captions. A unified task formulation, similar to ControlNet (Yamaguchi et al., 2023), allows joint training for T2I, T2V, and I2V by conditioning on noisy inputs concatenated with cleaned or zero-padded frames and binary masks.
- Enhanced Post-Training Optimization: Seedance 1.0 undergoes comprehensive post-training optimization stages to align with human preferences. After initial pre-training and a Continue Training (CT) phase focusing on enhancing I2V performance and multitask capabilities with high-quality data and specialized captions, supervised fine-tuning (SFT) is performed on a curated dataset of high-quality video-text pairs with manually verified captions. This improves visual aesthetics and motion coherence. The SFT phase involves training separate models on data subsets and merging them. Finally, video-tailored Reinforcement Learning from Human Feedback (RLHF) is applied. This involves collecting human preference data using a multi-dimensional annotation approach. A sophisticated reward system with three specialized reward models (Foundational, Motion, Aesthetic) is used. Foundational RM (Vision-LLM based) assesses image-text alignment and structural stability, Motion RM enhances motion quality and mitigates artifacts, and Aesthetic RM (image-space, similar to Seedream (Gong et al., 10 Mar 2025) on keyframes) evaluates visual appeal. The base model's training maximizes composite rewards, claiming better efficiency than methods like DPO/PPO/GRPO (Xue et al., 12 May 2025, Zhang et al., 19 Dec 2024, Liu et al., 8 May 2025, Liu et al., 23 Jan 2025). RLHF is also applied to the Diffusion Refiner, which upscales low-resolution (480p) base model outputs to high-resolution (720p/1080p) by conditioning on the low-resolution video and maximizing rewards from the same RMs.
- Inference Acceleration: Significant effort was put into achieving ultra-fast generation. The DiT inference is accelerated using multi-stage diffusion distillation techniques, including Trajectory Segmented Consistency Distillation (TSCD) (Shao et al., 10 Mar 2025) for step reduction (4x speedup), Score Distillation (Shao et al., 10 Mar 2025) for improved stability at low NFEs, and an extended Adversarial Post-Training (APT) (Lin et al., 14 Jan 2025) approach using human preference data to mitigate acceleration artifacts and improve visual quality. The Variational Autoencoder (VAE) decoder is optimized into a "Thin VAE" by narrowing channel widths in later layers and retraining with a fixed encoder, achieving a 2x speedup without quality loss. The underlying inference infrastructure incorporates high-performance kernel fusion (e.g., for Attention and Gemm) for throughput gains, fine-grained mixed-precision quantization and adaptive sparsity (extending AdaSpa (Xia et al., 28 Feb 2025)) tailored for DiT operations, an adaptive hybrid parallel strategy for memory efficiency with long sequences (context parallelism, FP8 communication, reducing overhead compared to Ulysses (Jacobs et al., 2023)), an automated Async Offloading strategy for deployment on memory-limited devices, hybrid parallelism for the VAE Decoder to reduce memory consumption, and pipeline optimizations including continuous batching and prefix caching. These optimizations allow Seedance 1.0 to generate a 5-second, 1080p video in 41.4 seconds on an NVIDIA-L20 GPU.
Seedance 1.0 demonstrates strong performance, topping the Artificial Analysis leaderboards for both text-to-video and image-to-video tasks. Internal evaluations using the SeedVideoBench-1.0 benchmark and expert human evaluation (Absolute Score, GSB metric) show its superiority in prompt following, motion quality, visual fidelity, and I2V preservation compared to state-of-the-art models like Kling 2.1, Veo 3, Wan 2.1, and Sora. The model particularly excels in precise instruction adherence in complex scenarios and exhibits robust capabilities in multi-shot narrative generation with subject consistency and stylistic coherence across shots, as well as multi-style alignment (generating diverse cinematic and artistic styles). Seedance 1.0 is planned for integration into ByteDance platforms like Doubao and Jimeng.