Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

SANA-Video: Efficient Video Synthesis

Updated 30 September 2025
  • SANA-Video is an efficient video generation framework that uses advanced diffusion transformer architectures, linear attention, and constant-memory key-value caching to produce high-resolution, minute-long videos.
  • It employs block-wise autoregressive strategies and Spatial–Temporal Mix-FFN to capture motion dynamics and maintain temporal continuity with fixed memory requirements.
  • The model achieves competitive semantic alignment and video quality while reducing latency by up to 16× compared to other contemporary video synthesis systems.

SANA-Video is an efficient video generation framework that leverages advanced diffusion transformer architectures to produce high-resolution, minute-long videos with strong semantic alignment, low latency, and reduced computational cost. Designed for text-to-video and image-to-video synthesis, SANA-Video achieves competitive quality relative to leading contemporaneous models, but with a marked focus on architectural efficiency via linear attention and constant-memory key-value caching. This technology enables practical deployment on commodity hardware and facilitates real-time applications in creative industries, simulation environments, and robotics.

1. Architectural Foundations: Linear Diffusion Transformer

The backbone of SANA-Video is the Linear Diffusion Transformer ("Linear DiT"), which fundamentally differs from conventional standard self-attention mechanisms by employing linear attention. The complexity is reduced from O(N2)O(N^2) to O(N)O(N), enabling scalable handling of the large token count inherent in video synthesis. The linear attention module utilizes a ReLU-based kernel, with key innovation in the application of Rotary Position Embeddings (RoPE) after kernel activation for the numerator, and omission of positional encoding in the denominator to avoid instability. Formally, the attention output for the ii-th token is:

Oi=RoPE(ϕ(Qi))j=1NRoPE(ϕ(Kj))Vjϕ(Qi)j=1Nϕ(Kj)O_i = \frac{\text{RoPE}(\phi(Q_i)) \cdot \sum_{j=1}^N \text{RoPE}(\phi(K_j))^\top V_j}{\phi(Q_i) \cdot \sum_{j=1}^N \phi(K_j)^\top}

where ϕ()\phi(\cdot) denotes the activation kernel (ReLU).

A crucial design component for temporal modeling is the Spatial–Temporal Mix-FFN, which includes a 1D temporal convolution layer, enhancing the model's ability to capture motion dynamics and maintain temporal continuity across frames.

2. Block-Wise Autoregressive Constant-Memory KV Cache

Traditional transformer-based approaches use a key-value (KV) cache whose memory footprint scales with the number of processed tokens, posing a bottleneck for long-form video. SANA-Video circumvents this via a block-wise autoregressive Linear DiT with a constant-memory KV state, drawing from the cumulative properties of linear attention. At each block, the cumulative state is updated via sums over kernel-transformed keys and key-value products, maintaining a global context for generation at a fixed O(D2)O(D^2) memory cost, where DD is the token dimension.

This mechanism enables efficient minute-long video generation without quadratic scaling of memory or runtime, and is central to SANA-Video's ability to process extensive video sequences with resource constraints.

3. Training Efficiency and Cost Reduction

The training of SANA-Video employs data filters and block autoregressive strategies to narrow overall training cost. The total training duration is reported as 12 days on 64 NVIDIA H100 GPUs, equivalent to about 1% of the cost required for MovieGen, a large-scale contemporary video generation model.

Pre-training is performed on a text-to-image model, which is then extended to video synthesis, reducing the need for full retraining. The model utilizes specialized block curriculum learning during training to adapt effectively to long sequence contexts.

4. Performance Benchmarks and Latency

SANA-Video's performance is benchmarked against Wan2.1-1.3B and SkyReel-V2-1.3B across latency, quality, and semantic alignment metrics. On H100 GPUs, SANA-Video (2B variant) synthesizes 5-second 720p video in 36s, a 16× latency improvement. On RTX 5090 GPUs and NVFP4 quantization (SVDQuant), generation time is further reduced from 71s (BF16) to 29s, achieving a 2.4× speedup.

Model Latency (s) Total Score Quality Semantic
Wan2.1-1.3B 400 83.38 85.67 74.22
SkyReel-V2-1.3B 132 82.67 84.70 74.53
SANA-Video-2B 36 84.05 84.63 81.73

SANA-Video exhibits a higher semantic score, indicating robust text–video alignment, and maintains output quality comparable to significantly larger models.

5. Deployment and Precision Optimization

SANA-Video is deployable on RTX 5090 GPUs, making it accessible for both server-grade and prosumer settings. The adoption of NVFP4 precision via SVDQuant allows substantial acceleration of inference while retaining visual fidelity—a crucial requirement for on-device synthesis and real-time generation use cases.

6. Universal Tasks and Applications

SANA-Video supports multiple generative tasks: text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V), unified under a common diffusion model architecture differentiated only by condition embeddings. This efficiency extends its applicability to domains such as:

  • Creative video production,
  • Interactive media and storytelling,
  • Simulation environments (autonomous driving, embodied AI, game content),
  • On-device video editing and generative augmentation.

Rapid synthesis and low compute requirements democratize high-quality video generation for both research and commercial applications.

7. Implications and Impact

The design principles underlying SANA-Video—linear attention, block-wise constant-memory caching, and quantization-friendly architecture—offer a template for future efficient video synthesis. Its ability to scale to minute-length, high-resolution videos at a fraction of previous costs highlights the potential for widespread deployment in domains where low latency and resource efficiency are critical.

The ability to maintain strong semantic alignment, competitive VBench quality, and universal generative flexibility positions SANA-Video as a reference model for next-generation video synthesis systems. Its architectural innovations also suggest further directions in model efficiency for scaling temporally and spatially complex generative tasks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SANA-Video.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube