Open-Sora: Open-Source Video Synthesis

Updated 10 April 2026

Open-Sora is a community-driven open-source suite enabling state-of-the-art text-to-video and image-to-video synthesis with a transparent, modular design.
It utilizes a diffusion transformer and dual-stage 3D autoencoder to achieve high-fidelity video generation and efficient spatial-temporal compression.
The platform democratizes research by releasing code, datasets, and training methodologies, setting new baselines in generative video synthesis.

Open-Sora refers to a suite of fully open-source, state-of-the-art text-to-video (T2V), image-to-video (I2V), and related generative video models grounded in large-scale deep diffusion architectures with transparent training, evaluation, and code release. The Open-Sora platform emerged in 2024–2025 as a community-driven response to proprietary SORA-style video generators, with the explicit objective to democratize access to scalable video synthesis, establish rigorous research baselines, and accelerate methodological advances across creative content, HCI, and AI vision research domains (Zheng et al., 2024, Peng et al., 12 Mar 2025, Lin et al., 2024, Zeng et al., 2024).

1. Design Philosophy and Open-Source Objectives

Open-Sora was established to address the significant disparity between closed, large-scale generative video models (e.g., OpenAI Sora) and the accessibility needs of the global research and developer community. Its core principles are:

Full transparency: Datasets, code, training scripts, pre-trained weights, and detailed data curation recipes are all publicly released.
Generality: The architecture supports text-to-image, text-to-video, image-to-video synthesis, and allows arbitrary aspect ratios, resolutions up to 720p+ (trained and fine-tuned on up to 1080p), and video durations up to 16 seconds, extensible in newer versions (Zheng et al., 2024, Peng et al., 12 Mar 2025).
Modularity: The codebase is organized for extensible research—major components (autoencoder, diffusion transformer, conditioning heads) can be interchanged or adapted.

By leveraging open pretrained models (e.g., PixArt-Σ for images), state-of-the-art spatial-temporal compression, and scalable GPU training schedules, Open-Sora provides an open laboratory for reproducible research, benchmarking, and innovation (Zheng et al., 2024, Lin et al., 2024, Zeng et al., 2024).

2. Model Architecture and Training Pipeline

2.1. Spatial-Temporal Diffusion Transformer (STDiT)

The core generator in Open-Sora is a DiT-based (Diffusion Transformer) architecture, with spatial and temporal self-attention decoupled for both efficiency and fidelity (Zheng et al., 2024, Peng et al., 12 Mar 2025, Zeng et al., 2024). At each layer, the model alternates:

Spatial self-attention (within-frame): Operates on tokens of shape $(B, H'W', D)$ for each frame.
Temporal self-attention (across-frames): For fixed spatial location $(i, j)$ , attends across all frames $T$ .

Additional architectural components:

Rotary positional embeddings (RoPE) for robust temporal modeling.
QK-normalization to stabilize attention, with layerwise normalization of query/key matrices.
Classifier-free guidance and cross-attention layers for prompt conditioning.

2.2. Highly Compressive 3D Autoencoder

To enable high-resolution, long-duration synthesis without prohibitive compute cost, Open-Sora uses a two-stage stacked VAE:

Stage 1: Pretrained 2D VAE (e.g., SDXL), compressing frames spatially by $8\times$ .
Stage 2: Trainable 3D VAE, compressing temporally by $4\times$ , enabling an aggregate compression ratio of $256\times$ (for 1.2 release) (Zheng et al., 2024).

Open-Sora Plan (a parallel development) introduces a multi-level 3D Haar Wavelet-Flow VAE, decomposing video tensors into frequency subbands for efficient coding and fast causal block-wise inference, increasing throughput while maintaining PSNR ≈ 32 dB and LPIPS ≈ 0.051 (Lin et al., 2024).

2.3. Conditioning and Controller Modules

In addition to text-based prompt conditioning (via T5-XXL, mT5-XXL, or CLIP), Open-Sora Plan implements:

Image/Mask Controllers: Allows image-to-video, transition, and continuation via temporal inpainting with binary and structured mask concatenation.
Structure Controllers: Accepts canny edges, depth, sketches as auxiliary input, mapped through small 3D convencoders and projected into denoiser blocks (Lin et al., 2024).

3. Training Strategies, Data Curation, and System Optimization

3.1. Data Curation

Data pipelines filter tens of millions of video clips through multi-stage selection:

Remove shorts (<2s), low bpp, anomalous aspect ratios, and low frame-rate outliers.
Apply five-way filtering: CLIP-based aesthetics, motion intensity (VMAF), blur/clarity (Laplacian variance), OCR text coverage, and camera jitter (Peng et al., 12 Mar 2025).
Multi-stage curriculum: initial training on lower-res/shorter clips (e.g., 256px T2V, 70M clips), progressing to high-res (768px) with more selective, high-quality subsets.

Captioning leverages LLaVA-Video, Qwen 2.5 Max, and motion-intensity scores for improved semantic control.

3.2. Training/Inference Schedules

Three-stage training (Open-Sora 2.0): progressive adaptation and upscaling from 256px T2V to 768px T/I2V with bucketed batch construction for constant token budget (Peng et al., 12 Mar 2025).
Parallelism: Data parallel (ZeRO2), context and tensor parallel, activation checkpointing, and advanced CPU offloading (ColossalAI, PyTorch 2.0 compile, Triton kernels).
Cost: Large-scale models (11B parameters) trained to near-global SOTA at ≈$200k USD, with explicit breakdown by GPU-days and optimization for efficiency.

3.3. Inference Optimization and Device Deployment

On-device Sora applies three architectural accelerations for mobile inference, all training-free:

Linear Proportional Leap (LPL): Early stop after $n \ll K $diffusion steps then “leap” to$ z_K $with a single <a href="https://www.emergentmind.com/topics/neural-ordinary-differential-equation-ode-models" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">ODE</a> step.</li> <li><strong>Temporal Dimension Token Merging (TDTM)</strong>: Average-merge consecutive frame tokens during early steps, reducing attention complexity from$ O((ST)^2) $to$ O((ST/2)^2)$.
Concurrent Inference with Dynamic Loading (CI-DL): Pipeline model block loading and computation, reusing as much as fits in RAM, with quantization of T5 and custom memory scheduling for CoreML deployment (Kim et al., 31 Mar 2025, Kim et al., 5 Feb 2025).

VBench metrics show only a 2–4% quality trade-off (FVD, subject consistency) for 2–4× speedup on commodity smartphones (Kim et al., 31 Mar 2025).

4. Quantitative Results and Benchmarking

4.1. Evaluation Metrics

VBench: Multi-dimensional, including Subject Consistency, Flickering, Aesthetics, Imaging Quality, Action, Object Classification, Scene, Spatial, Multi-object, and GPT4o Score (Zheng et al., 2024, Lin et al., 2024, Peng et al., 12 Mar 2025).
FVD: Fréchet Video Distance, lower is better.
LPIPS, PSNR, SSIM: For VAE reconstruction quality.
Human preference scores: Blind A/B, multi-aspect criteria.

4.2. Comparative Performance Table

Model/Version

VBench (Total)

FVD

PSNR (dB)

SSIM

Cost ($k)</th> <th>Max Res, Duration</th> </tr> </thead><tbody><tr> <td>OpenAI Sora</td> <td>88.2</td> <td>—</td> <td>—</td> <td>—</td> <td>N/A</td> <td>1080p, up to 1 min</td> </tr> <tr> <td>Open-Sora 2.0</td> <td>87.5</td> <td>—</td> <td>30.5</td> <td>0.86</td> <td>200</td> <td>768p, 8s+</td> </tr> <tr> <td>Open-Sora 1.2</td> <td>83.8</td> <td>—</td> <td>30.6</td> <td>0.88</td> <td>~—</td> <td>720p, 16s</td> </tr> <tr> <td>OpenSoraPlan 1.3*</td> <td>68.4†–71.0</td> <td>186</td> <td>32.3</td> <td>0.05‡</td> <td>~—</td> <td>640p, 8s</td> </tr> </tbody></table></div> <p>† GPT4o MTScore; ‡ LPIPS, lower is better.</p> <ul> <li>Open-Sora 2.0’s performance gap to closed Sora is <1% on VBench, while being 10×–20× cheaper to train (<a href="/papers/2503.09642" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Peng et al., 12 Mar 2025</a>).</li> <li>Human preference (blind): Open-Sora 2.0 win rate over Runway Gen-3 <a href="https://www.emergentmind.com/topics/ai-autonomy-coefficient-alpha" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Alpha</a> 56–60%; over HunyuanVideo 63–65% (<a href="/papers/2503.09642" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Peng et al., 12 Mar 2025</a>).</li> </ul> <h2 class='paper-heading' id='failure-modes-limitations-and-safety-concerns'>5. Failure Modes, Limitations, and Safety Concerns</h2> <ul> <li><strong>Physical realism & temporal consistency</strong>: Open-Sora shares known Sora-family challenges: object permanence failures, left/right confusion, scene cut incoherence, rigid object deformations.</li> <li><strong>Bias and fairness</strong>: Without balanced training distributions, severe gender and occupation bias is observed. For prompt $(i, j)$0, disparity ratios $(i, j)$1 reveal extreme imbalances (e.g., “Muscular”: 10M/0F; “Nurse”: 0M/10F; “CEO”: 8M/2F), quantified by $(i, j)$2 (Nadeem et al., 2024).

Ethical risks: Misinformation, deepfake potential, exploitation of prompts as IP, and opaque or inconsistent moderation are critical challenges.

Mitigation: Prompt-level debiasing (explicit gender control), dataset rebalancing, fairness-aware fine-tuning, adversarial debiasing; open documentation and watermarking for output provenance (Shen et al., 5 Dec 2025, Nadeem et al., 2024).

6. Ecosystem, Community Impact, and Future Directions

Open-source releases: All code, data curation, model weights, and documentation are available under permissive licenses (Apache 2.0, MIT) at https://github.com/hpcaitech/Open-Sora and for Open-Sora Plan at https://github.com/PKU-YuanGroup/Open-Sora-Plan (Zheng et al., 2024, Lin et al., 2024, Peng et al., 12 Mar 2025).
Community impact: Within months, Open-Sora models were adopted as research baselines for robotics, multi-view video synthesis, and video editing tools. Leaderboards and comparative benchmarks have accelerated empirical progress (Zeng et al., 2024).
Governance and IP proposals: Community negotiation of authorship rights, participatory moderation, visible watermarking, and on-platform prompt attribution/remix monetization are actively researched for next-generation open video platforms (Shen et al., 5 Dec 2025).
Next Directions: Scaling context windows, joint video/audio synthesis, symbolic/physical grounding, advanced safety/classification classifiers, RLHF for video, and longitudinal bias audits are prioritized research areas.

7. References

(Zheng et al., 2024) Open-Sora: Democratizing Efficient Video Production for All
(Peng et al., 12 Mar 2025) Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
(Lin et al., 2024) Open-Sora Plan: Open-Source Large Video Generation Model
(Kim et al., 31 Mar 2025, Kim et al., 5 Feb 2025): On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
(Shen et al., 5 Dec 2025): User Negotiations of Authenticity, Ownership, and Governance on AI-Generated Video Platforms: Evidence from Sora
(Nadeem et al., 2024): Gender Bias in Text-to-Video Generation Models: A case study of Sora
(Zeng et al., 2024): The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

Open-Sora establishes an extensible, rigorously evaluated platform for large-scale generative video research, targeting both high-fidelity synthesis and transparent, reproducible community science.