Open-Sora-Plan: Open-Source Video Generation Suite

Updated 6 December 2025

Open-Sora-Plan is an open-source video generation suite that integrates WF-VAE, joint Skiparse Denoiser, and condition controllers for high-resolution, long-duration video synthesis.
It employs a Wavelet-Flow Variational Autoencoder with cascaded Haar wavelet transforms and causal 3D convolutions to ensure structural consistency and artifact-free streaming inference.
The suite features a multi-dimensional data curation pipeline and adaptive training strategies, achieving superior quantitative and qualitative performance compared to state-of-the-art methods.

Open-Sora-Plan is an open-source, large-scale video generation suite that provides an integrated architecture and workflow for producing high-resolution, long-duration video synthesis from multiple conditioning sources. It encompasses novel model components, optimized training/inference strategies, and multi-dimensional data curation, aiming for both high efficiency and strong quantitative/qualitative outcomes. All code, model weights, and documentation are publicly released for academic and applied use (Lin et al., 2024).

1. Wavelet-Flow Variational Autoencoder (WF-VAE)

The WF-VAE forms the foundational latent-space representation for Open-Sora-Plan. It consists of encoder and decoder modules leveraging 3D convolutional backbones. Compression is realized through cascaded multi-level Haar wavelet transforms, decomposing the input video $\mathbf{V} \in \mathbb{R}^{C\times T\times H\times W}$ into eight subbands per scale:

$\mathbf{S}^{(\ell)}_{ijk} = \mathbf{S}^{(\ell-1)} * (f_i \otimes f_j \otimes f_k) \text{ with } h = \tfrac{1}{\sqrt2}[1, 1], g = \tfrac{1}{\sqrt2}[1, -1]$

A main energy path ("Flow") injects low-frequency wavelet features into the decoder, supporting symmetry and structural consistency across scales.

The VAE objective follows standard ELBO-based training augmented with:

$\mathcal{L}_{\ell_1}$ and %%%%2%%%% for reconstruction.
$\mathcal{L}_{adv}$ adversarial loss with dynamic reweighting ( $\lambda_{adv}$ ).
$\lambda_{WL} \mathcal{L}_{WL}$ for wavelet-flow consistency, enforcing feature preservation in decomposed bands.
KL regularization.

The VAE achieves compression via $4 \times 8 \times 8$ in $(T \times H \times W)$ , followed by $1 \times 2 \times 2$ patch embedding. Block-wise inference utilizes causal 3D convolutions and a cache-based tile boundary handling:

$T_{cache}(m) = k_t + m\,T_{chunk} - s_t \Bigl\lfloor\frac{m\,T_{chunk}}{s_t} + 1\Bigr\rfloor$

This approach eliminates boundary artifacts during streaming inference.

2. Joint Image-Video Skiparse Denoiser

The Skiparse Denoiser operates on VAE-compressed latent space, providing efficient and scalable denoising via a sparse attention mechanism:

Embedding through 3D convolution for video patches (stride $k_t=1$ , $k_h=k_w=2$ ).
Text encoding via mT5-XXL; embedded prompt projected for cross-attention.
AdaLN-Zero adapts per timestep with scale/shift/gate injection.
3D RoPE (Rotary Positional Embedding) applied by feature partition.

Skiparse Attention alternates between "single skip" and "group skip" spatial patterns, reducing complexity by $\sim 1/k$ without loss of cross-attention. The average attention distance for pattern $k$ is

$AD_{avg} \approx 2 - \frac{2}{k} + \frac{1}{k^2}, \qquad (1<k \ll THW)$

The denoising loss uses v-prediction with min-SNR reweighting (threshold $\gamma=5.0$ ):

$\mathcal{L}_{\text{vpred}} = \mathbb{E}_{x_0, \epsilon, t} [w(t) \|v_\theta(x_t, t) - v\|^2], \quad w(t) = \min(\mathrm{snr}(t), \gamma).$

Training proceeds in stages: initial T2I pretrain (100k steps full 3D; 100k Skiparse), followed by joint image/video pretrain (200k steps), and fine-tuning (30k steps) on filtered video data.

3. Condition Controllers

Open-Sora-Plan includes a suite of controllers for diverse conditioning modalities:

Text Controller: Encodes prompts (via mT5-XXL, projected for cross-attn keys/values).
Image Condition Controller: Implements temporal inpainting for image-to-video, transition, and continuation. Noisy latent, mask, and masked video latent are concatenated along the channel dimension for U-Net processing.
Structure Condition Controller: Guides generation by adding representations from edge, depth, or sketch conditions to latent tokens per transformer block: $X_j \leftarrow X_j + F_j$ where $F_j$ is an encoded representation projected to match $X_j$ dimensions.

Structure guidance is trained on Panda70M for 20k steps using a light 3D convolutional encoder and transformer blocks.

4. Assistant Strategies for Training and Inference

Several operational strategies optimize scalability, efficiency, and robustness:

Min-Max Token Strategy: Unifies batches of different resolutions/durations by selecting a downsample factor $k$ across aspect-ratio buckets, yielding constant token count per batch.
Adaptive Gradient Clipping: Gradient norms ( $gn_i$ ) are dynamically distributed: abnormal GPUs are zeroed, normal ones are rescaled $(N/M \times g_i)$ , with all-reduce performed afterward. EMA tracking prevents isolated loss spikes.
Prompt Refiner: LoRA-adapted LLaMA-3.1-8B is fine-tuned to expand prompts for improved semantic richness and VBench scores. During inference, user prompts are rewritten for optimal conditioning.

5. Multi-Dimensional Data Curation Pipeline

Data collection and processing comprise image and video datasets such as SAM (11.1M), Anytext (1.8M), Panda70M (21.2M), VIDAL (2.8M), and internal sources. Filtering is multi-step:

Slicing to 16s.
Jump-cut detection via LPIPS (removes 3%).
Motion filter (mean LPIPS within [0.001, 0.3], retains 89%).
OCR cropping (EasyOCR removes up to 20% edges).
Aesthetics (Laion pred $\geq$ 4.75, 49% remain).
Technical quality (DOVER $\geq$ 0, 44%).
Final motion double-check ( $\sim$ 42%).
Captions via LLaVA, InternVL2, Qwen2-VL (images), Qwen2-VL, ShareGPT4Video (videos).

6. Quantitative and Qualitative Evaluation

WF-VAE exhibits SOTA or superior performance relative to CV-VAE, OD-VAE, Allegro, and CogVideoX. Reconstruction metrics (33-frame @256²):

PSNR, LPIPS, SSIM, rFVD all indicate competitive or best-in-class results.
WF-VAE-S achieves 11.11 vids/s @512²×33, 4.7GB memory.
Causal Cache preserves lossless tiling (no reconstruction degradation).

Text-to-video: OpenSora v1.2 (1.2B) and OpenSora Plan v1.3 (2.7B) outperform in VBench (Aesthetic=59.00, Action=81.8, Scene=71.00; prompt-refined: Object=84.7, Scene=52.9, MTScore=2.95).

Image-to-video and structure-video workflows are benchmarked via qualitative displays (I2V, transitions, arbitrary structure insertion).

7. Implementation and Usage

All Open-Sora-Plan resources and pretrained weights are distributed via [https://github.com/PKU-YuanGroup/Open-Sora-Plan]. The Python interface supports:

1
2
3

from opensora import VideoGenerator
gen = VideoGenerator(model="opensora-plan-v1.3")
out = gen.generate(text="…", seed=42, steps=50, cfg=7.5)

Supports text-to-video, image-to-video, inpainting, structure-guided modes, and multilingual prompting (integrated refiner). Mixed-precision inference (fp16) and patch-based tiling are natively supported (Lin et al., 2024).

All components, pipelines, and evaluation metrics are detailed explicitly in the primary publication to facilitate direct reproduction and further research.

PDF Markdown Chat (Pro)

References (1)

Open-Sora Plan: Open-Source Large Video Generation Model (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Open-Sora-Plan.