Open-Sora-Plan: Open-Source Video Generation Suite
- Open-Sora-Plan is an open-source video generation suite that integrates WF-VAE, joint Skiparse Denoiser, and condition controllers for high-resolution, long-duration video synthesis.
- It employs a Wavelet-Flow Variational Autoencoder with cascaded Haar wavelet transforms and causal 3D convolutions to ensure structural consistency and artifact-free streaming inference.
- The suite features a multi-dimensional data curation pipeline and adaptive training strategies, achieving superior quantitative and qualitative performance compared to state-of-the-art methods.
Open-Sora-Plan is an open-source, large-scale video generation suite that provides an integrated architecture and workflow for producing high-resolution, long-duration video synthesis from multiple conditioning sources. It encompasses novel model components, optimized training/inference strategies, and multi-dimensional data curation, aiming for both high efficiency and strong quantitative/qualitative outcomes. All code, model weights, and documentation are publicly released for academic and applied use (Lin et al., 2024).
1. Wavelet-Flow Variational Autoencoder (WF-VAE)
The WF-VAE forms the foundational latent-space representation for Open-Sora-Plan. It consists of encoder and decoder modules leveraging 3D convolutional backbones. Compression is realized through cascaded multi-level Haar wavelet transforms, decomposing the input video into eight subbands per scale:
A main energy path ("Flow") injects low-frequency wavelet features into the decoder, supporting symmetry and structural consistency across scales.
The VAE objective follows standard ELBO-based training augmented with:
- and %%%%2%%%% for reconstruction.
- adversarial loss with dynamic reweighting ().
- for wavelet-flow consistency, enforcing feature preservation in decomposed bands.
- KL regularization.
The VAE achieves compression via in , followed by patch embedding. Block-wise inference utilizes causal 3D convolutions and a cache-based tile boundary handling:
This approach eliminates boundary artifacts during streaming inference.
2. Joint Image-Video Skiparse Denoiser
The Skiparse Denoiser operates on VAE-compressed latent space, providing efficient and scalable denoising via a sparse attention mechanism:
- Embedding through 3D convolution for video patches (stride , ).
- Text encoding via mT5-XXL; embedded prompt projected for cross-attention.
- AdaLN-Zero adapts per timestep with scale/shift/gate injection.
- 3D RoPE (Rotary Positional Embedding) applied by feature partition.
Skiparse Attention alternates between "single skip" and "group skip" spatial patterns, reducing complexity by without loss of cross-attention. The average attention distance for pattern is
The denoising loss uses v-prediction with min-SNR reweighting (threshold ):
Training proceeds in stages: initial T2I pretrain (100k steps full 3D; 100k Skiparse), followed by joint image/video pretrain (200k steps), and fine-tuning (30k steps) on filtered video data.
3. Condition Controllers
Open-Sora-Plan includes a suite of controllers for diverse conditioning modalities:
- Text Controller: Encodes prompts (via mT5-XXL, projected for cross-attn keys/values).
- Image Condition Controller: Implements temporal inpainting for image-to-video, transition, and continuation. Noisy latent, mask, and masked video latent are concatenated along the channel dimension for U-Net processing.
- Structure Condition Controller: Guides generation by adding representations from edge, depth, or sketch conditions to latent tokens per transformer block: where is an encoded representation projected to match dimensions.
Structure guidance is trained on Panda70M for 20k steps using a light 3D convolutional encoder and transformer blocks.
4. Assistant Strategies for Training and Inference
Several operational strategies optimize scalability, efficiency, and robustness:
- Min-Max Token Strategy: Unifies batches of different resolutions/durations by selecting a downsample factor across aspect-ratio buckets, yielding constant token count per batch.
- Adaptive Gradient Clipping: Gradient norms () are dynamically distributed: abnormal GPUs are zeroed, normal ones are rescaled , with all-reduce performed afterward. EMA tracking prevents isolated loss spikes.
- Prompt Refiner: LoRA-adapted LLaMA-3.1-8B is fine-tuned to expand prompts for improved semantic richness and VBench scores. During inference, user prompts are rewritten for optimal conditioning.
5. Multi-Dimensional Data Curation Pipeline
Data collection and processing comprise image and video datasets such as SAM (11.1M), Anytext (1.8M), Panda70M (21.2M), VIDAL (2.8M), and internal sources. Filtering is multi-step:
- Slicing to 16s.
- Jump-cut detection via LPIPS (removes 3%).
- Motion filter (mean LPIPS within [0.001, 0.3], retains 89%).
- OCR cropping (EasyOCR removes up to 20% edges).
- Aesthetics (Laion pred 4.75, 49% remain).
- Technical quality (DOVER 0, 44%).
- Final motion double-check (42%).
- Captions via LLaVA, InternVL2, Qwen2-VL (images), Qwen2-VL, ShareGPT4Video (videos).
6. Quantitative and Qualitative Evaluation
WF-VAE exhibits SOTA or superior performance relative to CV-VAE, OD-VAE, Allegro, and CogVideoX. Reconstruction metrics (33-frame @256²):
- PSNR, LPIPS, SSIM, rFVD all indicate competitive or best-in-class results.
- WF-VAE-S achieves 11.11 vids/s @512²×33, 4.7GB memory.
- Causal Cache preserves lossless tiling (no reconstruction degradation).
Text-to-video: OpenSora v1.2 (1.2B) and OpenSora Plan v1.3 (2.7B) outperform in VBench (Aesthetic=59.00, Action=81.8, Scene=71.00; prompt-refined: Object=84.7, Scene=52.9, MTScore=2.95).
Image-to-video and structure-video workflows are benchmarked via qualitative displays (I2V, transitions, arbitrary structure insertion).
7. Implementation and Usage
All Open-Sora-Plan resources and pretrained weights are distributed via [https://github.com/PKU-YuanGroup/Open-Sora-Plan]. The Python interface supports:
1 2 3 |
from opensora import VideoGenerator gen = VideoGenerator(model="opensora-plan-v1.3") out = gen.generate(text="…", seed=42, steps=50, cfg=7.5) |
Supports text-to-video, image-to-video, inpainting, structure-guided modes, and multilingual prompting (integrated refiner). Mixed-precision inference (fp16) and patch-based tiling are natively supported (Lin et al., 2024).
All components, pipelines, and evaluation metrics are detailed explicitly in the primary publication to facilitate direct reproduction and further research.