Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SkyReels-V2: Infinite-length Film Generative Model (2504.13074v3)

Published 17 Apr 2025 in cs.CV

Abstract: Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal LLM (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

Summary

  • The paper introduces SkyReels-V2, a framework that synergizes MLLMs, multi-stage pretraining, reinforcement learning, and diffusion forcing to generate long-form films.
  • It details a comprehensive data processing pipeline with advanced shot segmentation, progressive filtering, and human-in-the-loop validation to ensure high-quality training data.
  • The model achieves state-of-the-art performance in video generation tasks, excelling in instruction adherence, motion quality, and cinematic consistency in both T2V and I2V applications.

SkyReels-V2 is presented as an Infinite-length Film Generative Model designed to overcome limitations in existing video generation models, particularly regarding prompt adherence, visual quality, motion dynamics, and video duration, which often fall short for realistic long-form and professional film-style content. The core contribution is a framework that synergizes Multi-modal LLMs (MLLM), Multi-stage Pretraining, Reinforcement Learning (RL), and a Diffusion Forcing Framework.

The practical implementation of SkyReels-V2 involves several key components and stages:

  1. Data Processing: A comprehensive pipeline (Figure 3) is crucial, utilizing diverse data sources (general web videos, self-collected films/TV, artistic assets) reaching O(100M) scale. The pipeline involves:
    • Pre-processing: Shot segmentation splits videos into single-shot clips. These clips are then annotated using a hierarchical captioning system.
    • Filtering: A progressive filtering strategy moves from loose to strict criteria across training stages to control data quality. Categories of issues addressed include basic quality (low resolution/FPS, static, shake, unstable motion, transitions), video type (surveillance, game, animation, static), and post-processing artifacts (subtitles, logos, editing, borders, split screens, effects, mosaics) (Table 1). Specific element filters (Black/Static Screen, Aesthetic, Deduplication, OCR, Mosaic, Special effect) and quality filters (VQA, IQA, VTSS) are used.
    • Data Croppers: Techniques like Black Border Crop and a specific algorithm (Algorithm 1, Appendix A.2) for Subtitle and Logo Cropping (Figure 5) are applied to salvage data with overlays, maximizing data utility.
    • Data Balancing: In post-training, detailed concept balancing is applied using subject categories from the captioner (Figure 6, Table 2) to ensure model generalization.
    • Human-In-The-Loop Validation: Manual checks at every stage (sources, segmentation, pre/post-training) using strict criteria and sampling rates (e.g., 0.01% sample for pre-training, 0.1% for post-training with tighter limits) are performed to maintain high data quality.
  2. Video Captioner (SkyCaptioner-V1): This model addresses the limitation of general MLLMs in interpreting cinematic grammar. It generates detailed structural captions (Figure 7) including subject types, appearance, action, expression, position, and shot metadata (type, angle, position, camera motion, environment, lighting).
    • Sub-expert Captioners: Specialized models are trained for Shot (type, angle, position, evaluated at 82.2%, 78.7%, 93.1% accuracy respectively), Expression (emotion, intensity, features, temporal, evaluated at 88% emotion, 95% intensity, 85% features, 93% temporal accuracy), and Camera Motion (hierarchical classification, 6DoF parameterization, active learning + synthetic data, evaluated at 89% single-type, 78-83% complex motion accuracy).
    • SkyCaptioner-V1 Training: Knowledge from Qwen2.5-VL-32B and these sub-expert models is distilled into SkyCaptioner-V1 (based on Qwen2.5-VL-7B-Instruct) using a balanced dataset of ~2 million videos. Evaluated on a 1000-sample test set, SkyCaptioner-V1 significantly outperforms baselines, particularly in shot-related fields (Table 3), achieving 76.3% average accuracy.
    • Caption Fusion: A fusion pipeline using Qwen2.5-32B-Instruct combines the structural fields into natural language captions, with different prompts for T2V (dense descriptions) and I2V (subject + action/expression + camera motion) (Appendix A.3).
  3. Multistage Pretraining: The model architecture is based on Wan2.1 (2503.05346) using a DiT, Flow Matching (2205.07717, 2403.08567) framework.
    • Objective: Minimize the difference between the predicted velocity field uθ(xt,c,t)\mathbf{u}_\theta(\mathbf{x}_t, \mathbf{c}, t) and the ground truth velocity vt=x1x0\mathbf{v}_t = \mathbf{x}_1 - \mathbf{x}_0 (Equation 3).
    • Data Handling: Dual-axis Bucketing (BT×BARB_T \times B_{AR}) handles spatiotemporal heterogeneity, with adaptive batch sizing. FPS normalization uses a residue-aware downsampling (Equation original_fpsmodf\mathrm{original\_fps} \bmod f) and learnable frequency embeddings in the DiT.
    • Stages: Three stages progressively increase resolution: Stage 1 (256p, joint image-video, lenient filtering, basic generation), Stage 2 (360p, joint image-video, moderate filtering, improved clarity), Stage 3 (540p, video only, strict filtering, source filtering, enhanced quality and cinematic properties). AdamW optimizer is used with varying learning rates across stages.
  4. Post Training: This involves four stages for performance enhancement.
    • Initial High-quality SFT (540p): Fine-tunes the pretrained model using balanced data to set a good initialization. FPS embeddings are removed as only 24 FPS data is used.
    • Reinforcement Learning (RL): Focuses on improving motion quality, especially for large/deformable motions and physical plausibility (Figure 8).
      • Preference Data: A semi-automatic pipeline combines human-annotated and automatic data. Human annotation uses professional evaluators to rate paired videos based on a detailed motion quality criteria taxonomy (Table 5, Appendix A.4). Automatic generation creates distorted samples (V2V, I2V, T2V variants, Figure 8) from real videos by manipulating noise, sampling rates, and using different models/timesteps.
      • Reward Model: A motion quality reward model (based on Qwen2.5-VL-7B-Instruct) is trained on 30k sample pairs using the BradleyTerry model with ties (BTT) loss.
      • DPO Training: Direct Preference Optimization for Flow (Flow-DPO (2503.03579)) is applied (Loss Equation) using triplets (chosen, rejected, prompt) generated by sampling from the model and ranking with the reward model. The process is staged, refreshing the reference model as training progresses.
    • Diffusion Forcing Training: Transforms the full-sequence diffusion model into a diffusion forcing model capable of generating variable-length videos by assigning independent noise levels per token.
      • Training: Uses the Frame-oriented Probability Propagation (FoPP) timestep scheduler based on dynamic programming and non-decreasing timestep constraints (steps 1-5, Equation for di,jd_{i,j}) to stabilize training compared to previous methods.
      • Inference: Employs the Adaptive Difference (AD) timestep scheduler (Equation for tit_i) which supports asynchronous autoregressive and synchronous generation based on an adaptive difference variable ss. A causal attention mechanism, leveraging cleaner historical samples as conditions, and K/V caching optimize inference efficiency.
    • Final High-quality SFT (720p): Further fine-tunes the model at 720p resolution using manually filtered, higher-quality concept-balanced datasets to enhance overall visual quality.

Infrastructure and Optimization:

  • Training: Memory optimization uses efficient operator fusion, BF16 gradient checkpointing (GC), and selective activation offloading. Training stability is addressed with a self-healing framework. Parallelization uses FSDP for model/optimizer states and Sequence Parallel (2305.15711) for activation memory pressure, especially at 720p. Pre-computation of VAE/text encoder results is also used.
  • Inference: Optimizations target reducing latency on hardware like RTX 4090 (24GB). VRAM optimization uses FP8 quantization and parameter-level offloading. Quantization applies FP8 dynamic quantization with GEMM acceleration for linear layers (1.10×\times speedup) and sageAttn2-8bit (2502.15115) for attention (1.30×\times speedup). Multi-GPU parallelization (Content, CFG, VAE parallel) reduces latency (1.8×\times on 8 GPUs). Distillation using DMD techniques (2403.18691, 2403.18703) accelerates generation (e.g., 4-step generation).

Performance Evaluation:

  • SkyReels-Bench (Human Eval): A benchmark with 1020 prompts evaluates Instruction Adherence (motion, subject, spatial, shot, expression, camera motion, hallucination), Motion Quality (dynamism, fluidity, plausibility), Consistency (subject, scene, first-frame fidelity for I2V), and Visual Quality (clarity, color, integrity) (Table 4, Appendix A.1). SkyReels-V2 achieves state-of-the-art among open-source models for T2V (Table 6), particularly excelling in Instruction Adherence. SkyReels-V2-I2V and SkyReels-V2-DF demonstrate SOTA open-source performance for I2V (Table 8).
  • VBench1.0 (Automated Eval): SkyReels-V2 performs best among open-source models on the long prompt version (Table 7), with the highest Total and Quality scores, slightly lower Semantic score compared to Wan2.1, which the authors attribute to V-Bench's limited evaluation of shot semantics.

Applications:

  • Story Generation: The diffusion forcing framework enables generating videos theoretically of infinite length using a sliding window approach, conditioning on previous frames and prompts (Figure 9, Figure 10). A stabilization technique using slight noise on generated frames helps prevent error accumulation in long rollouts. Sequential prompts allow orchestrating narratives (Figure 10).
  • Image-to-Video Synthesis (I2V): Two methods are explored: Fine-tuning the T2V model by injecting the first frame latent and mask channels (SkyReels-V2-I2V) and using the Diffusion Forcing T2V model with the first frame as a clean condition (SkyReels-V2-DF). Both achieve competitive I2V results (Table 8). SkyReels-V2-I2V is open-sourced.
  • Camera Director: Fine-tuning the I2V model on a balanced dataset of camera motions improves control over cinematographic effects like fluidity and diversity.
  • Elements-to-Video (E2V): Building on prior work (SkyReels-A2 (2503.16125), Figure 11), the framework supports composing arbitrary visual elements while maintaining fidelity to reference images. Future work aims for a unified framework supporting audio and pose input for broader applications like short films, music videos, and e-commerce content.

In summary, SkyReels-V2 provides a practical, open-source framework for high-quality, long-form video generation with improved prompt adherence and motion dynamics, enabled by specialized data processing, captioning, multi-stage training including RL, and a diffusion forcing architecture. While infinite length is theoretically possible, error accumulation remains a practical challenge for very long generations. The project includes open-sourced models and code.

Youtube Logo Streamline Icon: https://streamlinehq.com