Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model (2502.10248v3)

Published 14 Feb 2025 in cs.CV and cs.CL

Abstract: We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

Summary

The paper introduces Step-Video-T2V, a 30B parameter text-to-video model that employs a deep compression VAE and 3D full attention to generate videos up to 204 frames from bilingual prompts.
It utilizes a video-based Direct Preference Optimization approach and a distillation strategy to reduce artifacts and inference steps while maintaining high visual quality.
Evaluations using the Step-Video-T2V-Eval benchmark demonstrate state-of-the-art performance in motion dynamics and video reconstruction compared to open-source and commercial models.

The paper introduces Step-Video-T2V, a 30B parameter text-to-video foundation model capable of generating videos up to 204 frames from both Chinese and English prompts. The model employs a deep compression Video-Variational Autoencoder (VAE) that achieves $16{\times}16$ spatial and $8{\times}$ temporal compression ratios. This DiT (Diffusion Transformer) architecture uses 3D full attention trained with Flow Matching to convert input noise into latent frames. The model also utilizes a video-based Direct Preference Optimization (DPO) approach to reduce artifacts and improve visual quality. The paper introduces Step-Video-T2V-Eval, a benchmark for evaluating text-to-video generation quality, and discusses the limitations of current diffusion-based models, suggesting future research directions for video foundation models.

The report defines two levels of video foundation models: translational (Level-1) and predictable (Level-2). Step-Video-T2V is categorized as Level-1, which focuses on cross-modal translation, generating videos from text, visual, or multimodal inputs. The paper argues that current diffusion-based models struggle with complex action sequences and adherence to physical laws because they lack explicit causal relationship modeling.

Key components of Step-Video-T2V include:

Video-VAE: A deep compression VAE that reduces computational complexity by achieving high spatial and temporal compression ratios.
Bilingual Text Encoders: Two encoders, Hunyuan-CLIP and Step-LLM, process English and Chinese prompts. Hunyuan-CLIP aligns text with visual space but has limited input length, while Step-LLM handles longer sequences with a redesigned Alibi-Positional Embedding.
DiT with 3D Full Attention: The DiT architecture comprises 48 layers, each with 48 attention heads, and uses 3D full attention for modeling spatial and temporal information. It incorporates cross-attention layers for text prompt integration, AdaLN (Adaptive Layer Normalization) with optimized computation, RoPE-3D (Rotation-based Positional Encoding-3D) for handling video data, and QK-Norm (Query-Key Normalization) to stabilize the self-attention mechanism. The model is trained using Flow Matching, minimizing the Mean Squared Error (MSE) loss between predicted and true velocities.
Video-DPO: A video-based DPO approach that leverages human feedback to reduce artifacts and enhance the visual quality of generated videos. It uses a preference loss function:

$\mathcal{L}_{\text{DPO} = -\mathbb{E}_{(y, x_w, x_l) \sim \mathcal{D} \left[ \log \sigma \left( \beta \left( \log \frac{\pi_\theta(x_w|y)}{\pi_{\text{ref}(x_w|y)} - \log \frac{\pi_\theta(x_l|y)}{\pi_{\text{ref}(x_l|y)} \right) \right) \right]}$

where:
- $\pi_\theta$ is the current policy.
- $\pi_{ref}$ is the reference policy.
- $x_w$ is the preferred sample.
- $x_l$ is the non-preferred sample.
- $y$ is the condition.
- $\mathcal{D}$ is the data distribution.
- $\beta$ is a hyperparameter.
- $\sigma$ is the sigmoid function.
Distillation: The paper also proposes a distillation strategy to reduce the number of function evaluations (NFE) required during inference. By training a 2-rectified-flow model using self-distillation with a rectified flows objective, the model can reduce NFE to as few as 8 steps with minimal performance degradation. A U-shaped distribution is used for timestep sampling, and a linear diminishing Classifier-Free Guidance (CFG) schedule is applied during inference.

The training process involves a cascaded approach: text-to-image pre-training, text-to-video/image pre-training, supervised fine-tuning (SFT), and DPO training. Key findings include the importance of text-to-image pre-training for acquiring visual knowledge, the significance of low-resolution text-to-video pre-training for learning motion dynamics, the impact of high-quality videos with accurate captions in SFT, and the effectiveness of video-based DPO for enhancing visual quality.

The system architecture for Step-Video-T2V includes an offline stage using Step Emulator (SEMU) for resource allocation and parallelism strategy estimation. The training is deployed across GPU clusters, with data transmission facilitated by StepRPC, a high-performance Remote Procedure Call (RPC) framework. StepTelemetry is used for monitoring and analysis, collecting data statistics from inference clusters and performance metrics from training clusters. The infrastructure comprises thousands of NVIDIA H800 GPUs interconnected by a RoCEv2 fabric.

The paper details the training framework optimizations, such as Step Emulator for estimating resource consumption, distributed training strategies, and tailored computation and communication modules like VAE computation acceleration and custom RoPE-3D kernels. A hybrid-grained load balancing strategy is introduced to address load imbalance issues when processing mixed-resolution videos and images within the same global iteration.

A large-scale video dataset comprising 2B video-text pairs and 3.8B image-text pairs was constructed using a comprehensive data pipeline consisting of video segmentation, video quality assessment, video motion assessment, video captioning, video concept balancing, and video-text alignment.

Experiments compare Step-Video-T2V with open-source models like HunyuanVideo and commercial engines using the Step-Video-T2V-Eval benchmark, which consists of 128 diverse prompts across 11 categories. The evaluation metrics include a win/tie/loss comparison and scores for instruction following, motion smoothness, physical plausibility, and aesthetic appeal. Results indicate that Step-Video-T2V achieves state-of-the-art performance compared to open-source models and strong performance relative to commercial engines, particularly in generating videos with high motion dynamics.

The paper also compares Video-VAE with open-source baselines, demonstrating state-of-the-art reconstruction quality despite a higher compression ratio. Ablation studies confirm the effectiveness of the proposed Video-DPO algorithm in generating videos more aligned with user preferences.

The discussion section highlights the importance of instruction following, adherence to the laws of physics, and high-quality labeled data for post-training. The paper concludes by outlining future research directions, including the development of advanced model paradigms and the construction of a comprehensive video knowledge base with structured labels.