SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Published 14 May 2026 in cs.CV | (2605.15178v1)

Abstract: We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a hybrid linear diffusion transformer that integrates GDN and softmax attention for minute-scale, high-fidelity 720p video generation.
It employs dual-branch camera control with UCPE and Plücker ray mixing to achieve precise 6-DoF trajectory adherence in action-conditioned scenarios.
The method attains 34s single-GPU inference, 36x throughput improvement, and superior action-following fidelity compared to state-of-the-art baselines.

Authoritative Technical Summary of "SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer"

Motivation and Context

World models that generate action-conditioned videos are crucial for simulation, embodied AI, and robotics. However, existing systems capable of minute-scale rollouts often suffer from prohibitive training and inference costs, excessive model sizes, reliance on proprietary or massive datasets, and limited support for precise camera control at high resolution. SANA-WM addresses the practical requirements for accessible, high-fidelity, action-conditioned 720p video generation on minute timescales, enabling single-GPU inference and efficient training regimes.

Architecture and Methodology

SANA-WM is a 2.6B-parameter diffusion-based world model, natively trained for one-minute 720p video generation with precise camera trajectory conditioning. The architecture integrates four core technical contributions:

Hybrid Linear Attention Backbone:

SANA-WM utilizes a hybrid attention scheme interleaving frame-wise Gated DeltaNet (GDN) blocks with periodic softmax attention. The GDN recurrence maintains memory-efficient context aggregation and introduces decay gates and delta-rule corrections for token drift mitigation. Every fourth block employs softmax attention to recover salient exact long-range dependencies crucial for high-fidelity minute-scale rollouts. The backbone operates over compressed LTX2 latents, producing up to 8x higher temporal efficiency than prior approaches.

Dual-Branch Camera Control:

Precise 6-DoF camera control is achieved via a dual-rate mechanism. The coarse latent-branch employs UCPE (Unified Camera Positional Encoding) to anchor global trajectory consistency, while the fine-grained raw-frame branch uses Plücker ray mixing to restore detailed per-frame camera motion within VAE strides. This design ensures metric-scale adherence and robust action-following even under aggressive video compression and long horizons.

Two-Stage Visual Refinement:

A dedicated refiner, initialized from the 17B LTX-2 model and trained via truncated-o flow matching, operates as a second stage. The refiner sharpens details, corrects structural artifacts, and stabilizes scene identity across minute-long rollouts. Reference conditioning and LoRA adapters ensure identity preservation and lightweight finetuning for long-horizon sequences.

Robust Annotation and Benchmarking Pipeline:

The training corpus combines ~213K metric-scale pose-annotated video clips. Annotation leverages modified VIPE with Pi3X and MoGe-2 depth for consistent metric pose extraction from public sources. Synthetic augmentation using 3D Gaussian splatting and DiFix3D refinement expands static-3D domains to minute-scale ground-truth trajectories. Rigorous filtering, camera-calibration checks, and scene-static captions yield a high-quality corpus suitable for benchmarking action-following and visual quality over 80 diverse scenes and trajectories.

Training Protocol and Efficiency Analysis

SANA-WM employs a progressive multi-stage training pipeline:

Stage 1: VAE adaptation (LTX2) for latent compression.
Stage 2: Backbone transition to hybrid GDN/softmax blocks.
Stage 3: Extension to minute-scale sequences and integration of dual camera control.
Stage 4: Chunk-causal fine-tuning and few-step autoregressive distillation.

Context-Parallel training shards minute-scale clips across 64 H100 GPUs, enabling full-context recurrent composition and halo exchange for convolutional layers. Triton-fused kernels accelerate GDN operations. Full training completes in 15 days on 64 H100s. Inference supports three modes — bidirectional, chunk-causal, and distilled — with single-GPU deployment.

The distilled variant achieves remarkable efficiency: minute-long 720p generation in 34s on a single RTX 5090 with NVFP4 quantization.

Quantitative Evaluation and Ablations

SANA-WM was systematically benchmarked against recent baselines including LingBot-World [7], HY-WorldPlay [6], Infinite-World [8], and Matrix-Game 3.0 [9]. Key findings:

Action-Following Accuracy:

SANA-WM exhibits the lowest rotation and camera-motion error, with RotErr dropping to 4.50°/8.34° and CamMC to 1.41/1.44 post-refinement for simple/hard trajectory splits. These represent substantial improvements over all baseline methods.

Visual Quality:

VBench scores reach 80.62/81.89 in overall visual fidelity, matching large industrial baselines while maintaining accessible compute requirements. The refiner reduces visual drift, with AIQ (temporal degradation) dropping from 3.79/3.09 to 1.17/0.31.

Inference Throughput and Memory:

SANA-WM achieves 24.1 videos/hour (single-GPU) for stage-1 and 22.0/hr (pipeline with refiner), up to 36x improvement in throughput versus 480p baselines. Bidirectional and chunk-causal variants consistently fit in a single H100, unlike all-softmax architectures which run out of memory at 60s.

Revisit Memory and Scene Consistency:

PSNR/SSIM/LPIPS metrics confirm strong loop-closure memory and minimal scene drift under hard trajectories.

Ablation Studies:

GDN key scaling (1/√DS): Only spatially-aware scaling guarantees training stability for minute-scale context.
Camera conditioning: Dual UCPE + Plücker mixing outperforms all alternatives on OmniWorld, minimizing pose errors.
Efficiency and latency: Hybrid GDN-softmax matches visual quality and consistency with sub-linear memory scaling, whereas all-softmax fails at scale.

Implications, Limitations, and Future Directions

SANA-WM fundamentally expands practical access to long-horizon world modeling at high resolution and precise action control. The solution is particularly relevant for academic simulation, interactive content, and robotics, where compute and data efficiency are essential. The architecture's advances in hybrid attention, camera conditioning, and data annotation are highly generalizable to other generative and simulation domains.

Current limitations include lack of explicit persistent 3D scene memory, susceptibility to drift in rare or highly dynamic settings, and scale-bound visual capacity. The method assumes high-quality pose annotation and struggles to generalize to uncommon environments or cultural contexts. The streamlined efficiency may drive broader adoption and research activity; however, provenance management and responsible deployment are critical, especially given potential risks from synthetic video in high-stakes applications.

Future work should investigate scaling to larger backbone architectures, further data augmentation, development of robot or fine-grained action control, persistent memory integration, and robust, real-time refiners for streaming deployment.

Conclusion

SANA-WM demonstrates that accessible, high-fidelity minute-scale world models with strong action conditioning and visual consistency are achievable through principled architectural design, efficient training pipelines, and robust annotation. The hybrid GDN/softmax transformer, dual camera conditioning, and long-video refinement collectively enable practical single-GPU 720p generation, setting new benchmarks for action-following and throughput within open-source world modeling. The framework materially advances simulation and embodied AI, providing a scalable and efficient foundation for future research (2605.15178).

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

There was an error generating the whiteboard.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces SANA-WM, a computer system that can create long, high-quality videos (up to one minute, 720p) from a single starting image, a short text description, and a planned camera path. Think of it like a smart movie camera inside a virtual world: you tell it where the camera should move and what the scene should look like, and it generates the whole video while keeping the scene consistent.

The big idea: make these “world models” fast, accurate, and affordable to run on a single graphics card, instead of needing huge computers.

What questions did the researchers ask?

Can we train a video generator that follows precise camera movements for a full minute without needing massive datasets, long training times, or many GPUs?
Can we keep the scene stable and realistic over a long time, even while the camera turns, moves, and revisits places?
Can we match the visual quality of very large, industrial systems while being much more efficient?

How did they build the system?

To make minute-long, 720p video generation both accurate and efficient, the team combined four key ideas. You can think of each idea as a tool that solves a different problem:

Hybrid memory and attention (efficiency): Long videos mean lots of information. SANA-WM uses a mix of “fast memory” and “exact recall.” The fast memory (called Gated DeltaNet, or GDN) is like keeping compact notes so you don’t run out of space. The exact recall (softmax attention) is like occasionally going back to the full textbook to double-check important details. Mixing the two keeps generation fast without forgetting crucial context.
Dual-branch camera control (precision): The camera’s motion is described in 6 degrees of freedom (6-DoF): move left/right, up/down, forward/backward, and rotate around three axes. SANA-WM controls this with two “branches.” One branch follows the big, overall path (like a map). The other fixes fine motion inside each compressed video chunk (like tiny step-by-step instructions), so the camera sticks to the path exactly.
Two-stage generation (quality): The model first creates the full minute video, then a separate “refiner” pass cleans it up. Think of the refiner as a detail-enhancing photo editor that sharpens structure and reduces flicker across the whole minute.
Robust data pipeline (reliable training info): The team built a pipeline to estimate accurate camera positions and directions from public videos. This gives the model trustworthy “ground truth” for where the camera should be in 3D space, so it learns to follow paths precisely.

How they trained it, step by step:

Start small: train on short clips to make the basic system stable and efficient.
Scale up: extend to one-minute videos while adding the precise camera-control branches.
Make it practical: add an autoregressive mode (generate chunk by chunk) and distill the model so it needs only a few steps to sample, speeding up deployment.
Keep it compact: use a high-compression video tokenizer (LTX2) so the model processes fewer tokens while keeping detail, like zipping a file before sending it.

Hardware and data in plain terms:

Training used about 213,000 curated clips with good camera pose labels.
Full training took around 15 days on 64 NVIDIA H100 GPUs.
At test time, it can generate each 60-second, 720p video on a single GPU. The distilled version can do it on a single RTX 5090 in about 34 seconds with a special low-precision setting (NVFP4).

What did they find, and why is it important?

Here are the main results and why they matter:

Precise camera following: SANA-WM tracks the planned camera path more accurately than prior open-source systems, even on hard, twisty paths. This is crucial for simulations, robotics, and filmmaking where movement must match a plan.
High visual quality at 720p: With the refiner, its video quality is comparable to much bigger systems, but it runs on a single GPU per video. That means good-looking results without huge computing costs.
Big efficiency gains: On their one-minute benchmark, SANA-WM matches or beats previous baselines in action-following and achieves similar visual quality while offering up to 36x higher throughput. In simple terms: much faster to generate many videos.
Single-GPU inference: You don’t need a server farm. Minute-long, 720p generation on one GPU makes research and creative projects more accessible.
Stability over a minute: The refiner reduces long-term drift (scenes changing or getting blurry late in the video), so the look stays consistent from start to finish.

What could this change?

More accessible simulation and research: Schools, labs, and small studios can create long, controlled videos for testing robots, making interactive demos, or prototyping games without enormous budgets.
Better tools for camera-driven storytelling: Filmmakers and creators can sketch a camera path and quickly see a coherent, minute-long scene at decent resolution.
Stronger foundations for embodied AI: Robots and agents that learn from videos can get long, consistent training sequences where actions and camera movements matter.

The authors also note limits and future steps: the model still lacks an explicit 3D memory of the scene, can drift in very unusual views, and could benefit from scaling up data and improving real-time refinement.

Simple explanations of tricky terms

World model: A program that generates what you would “see” in a world as you move or act, like a virtual camera exploring a scene.
6-DoF camera: A camera that can move in 3D (left/right, up/down, forward/backward) and rotate around three axes (tilt, pan, roll).
Latents/tokenizer (LTX2): Compressed video “codes.” Like zipping video so the model handles fewer pieces while keeping detail.
Attention: A way for the model to decide which past info matters right now. Hybrid attention = mostly fast, compact memory plus some exact lookups to avoid forgetting.
Refiner: A second pass that cleans and sharpens the video after the first pass, reducing artifacts and flicker.
Distillation: Teaching a smaller or faster model to mimic a bigger/slower one, so it runs faster with similar output.

Quick recap

SANA-WM is a fast, precise, and practical system for making one-minute, 720p videos that follow exact camera paths. It reaches near state-of-the-art visual quality while running on a single GPU, thanks to a smart mix of efficient memory, precise camera control, a two-step generation process, and carefully labeled training data. This can open the door to more affordable simulation, robotics research, and creative video tools.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper advances minute-scale, camera-controlled video generation but leaves several concrete gaps and open questions. The following list highlights what remains uncertain or unexplored and suggests actionable directions for future work:

Long-horizon persistence without explicit 3D memory
- The model lacks persistent 3D/4D scene memory; recurrent GDN state alone may not preserve structure over revisits or very long rollouts. Evaluate and integrate explicit scene memories (e.g., BEV/occupancy, 3D Gaussian/NeRF, or feature grids) and quantify gains on multi-minute (5–10 min) trajectories with repeated revisits.
Dynamic-scene handling and object permanence
- Authors note drift in dynamic scenes; training data contains substantial static 3DGS renderings. Measure performance with controlled dynamic-object benchmarks (moving agents, occlusions) and explore conditioning on object/point tracks or dynamic state representations.
Control modalities beyond camera trajectories
- Control is limited to 6-DoF camera motion. Assess extensibility to robot/embodied actions, explicit object controls, point/landmark tracking, or higher-level event commands, including multi-branch conditioning interactions and conflict resolution.
Real-time and interactive use
- Distilled AR generation takes ~34–48 s for a 60 s sequence on a single GPU (not real-time). Characterize per-frame latency and responsiveness for interactive rollouts (e.g., streaming 30/60 fps with <100 ms control-to-frame latency) and develop a streaming-stage refiner compatible with causal constraints.
Robustness to camera intrinsics, distortion, and sensor pathologies
- Generation assumes pinhole intrinsics; robustness to rolling shutter, varying or erroneous intrinsics, lens distortion, and fisheye cameras is untested. Conduct stress tests with perturbed intrinsics and introduce distortion-aware conditioning and training augmentation.
Sensitivity to pose-annotation errors
- Camera-control accuracy is measured using poses recovered by Pi3X with Umeyama alignment; training also relies on annotated metric poses. Quantify sensitivity by injecting controlled noise/bias in poses during training and evaluation, and report failure thresholds and calibration strategies.
Benchmark realism and diversity
- The 1-minute benchmark uses initial images synthesized by Nano Banana Pro and two engineered trajectory families, which may not reflect real-world imagery or user-driven paths. Expand to real-image initializations, richer trajectory distributions (fast spins, sudden stops, erratic paths), and multi-visit, multi-minute scenarios.
Metric validity and dependence on estimators
- Action-following metrics depend on third-party pose estimators (Pi3X), which may be biased by generative artifacts. Cross-validate with multiple estimators and propose estimator-agnostic geometric metrics (e.g., reprojection consistency against known 3D scenes or rendered ground truth).
Human evaluation and perceptual validation
- Quality is assessed mainly with VBench; no human studies are reported. Add human preference tests and task-relevant assessments (e.g., navigability, layout consistency, and motion sickness proxies for VR).
Scaling laws and capacity limits
- The paper does not quantify how performance scales with model size, data size, or sequence length. Establish scaling curves for camera-following accuracy, temporal stability, and VBench under increased parameters, data, and >60 s horizons.
Compression/tokenizer trade-offs
- LTX2 compression is described as quality-neutral on short clips; its impact on fine-detail fidelity, text legibility, and thin structures over minute-long sequences is not deeply analyzed. Compare alternative tokenizers and investigate adaptive or learned spatiotemporal rates.
Hybrid attention design choices
- Every fourth block uses softmax attention; frequency and placement are not systematically ablated. Explore different interleavings, window sizes, and content-aware scheduling to balance recall and efficiency.
GDN variants and long-range forgetting
- Key-scaling stabilizes GDN, but the long-term forgetting behavior and information bottlenecks over minutes remain underexplored. Analyze memory horizons, salient-event retention, and hybridizing with sparse exact-attention or retrieval over learned caches.
Chunk-causal reset effects and very long rollouts
- Chunk resets in reversed-time scan may impair extremely long-horizon consistency. Evaluate chunk size vs. consistency trade-offs and test hours-long rollouts, reporting AIQ/PSNR drift and catastrophic failure modes.
Stage-2 refiner causality and streaming constraints
- The refiner is applied post hoc and initialized from a bidirectional model; causal, online-compatible refiners are absent. Develop and benchmark streaming refiners that preserve control and temporal causality.
Distillation stability over long horizons
- Few-step self-forcing distillation reduces latency, but its influence on long-horizon drift, mode collapse, or error accumulation is not fully characterized. Provide long-run degradation curves and robustness tests across unseen trajectories.
Generalization across domains and real-data bias
- Training mixes real and synthetic videos with heavy reliance on static 3DGS augmentation. Quantify domain gaps, report per-source performance, and test on out-of-distribution domains (e.g., underwater, night, extreme FOV).
Fairness of comparisons at different resolutions/budgets
- Comparisons include 480p vs. 720p and exclude some baselines at 720p under the chosen GPU budget. Provide matched-resolution and matched-budget comparisons, and sensitivity to decoding/frame rates.
Data release and reproducibility of annotations
- It is unclear whether the 213K metric-pose dataset and the benchmark are publicly released with detailed licenses. Release pose annotations, calibration, and quality scores; provide scripts for re-annotation and reproducibility audits.
Privacy and licensing of internet videos
- The paper does not detail licensing compliance and privacy protections for internet-sourced videos. Document provenance, licenses, and face/person redaction strategies; include a data governance plan.
Content safety and watermarking
- No strategy is presented for preventing misuse (e.g., deepfake world scenes) or for provenance/watermarking of generated videos. Evaluate robust watermarking and implement safety classifiers or guardrails.
Physics and semantic consistency
- Evaluation focuses on perceptual quality and camera following, not physical plausibility (e.g., gravity, collisions) or semantic stability (object counts, layout). Propose and report physics/semantic consistency metrics and train with physics-aware priors or constraints.
Multi-sensory outputs and audio synchronization
- The system is visual-only; audio or haptics are not modeled. Explore audio-conditioned generation and assess audiovisual synchronization with camera motion.
Multi-agent and crowd scenarios
- The approach does not evaluate multi-agent interactions or crowds, where temporal consistency and control complexities grow. Construct benchmarks with agent-rich scenes and test joint camera and agent control.
Robustness to adverse conditions
- Stress tests for motion blur, low light, HDR, specularities, rain/snow, and rapid rotations are missing. Add adversarial and adverse-condition suites and augment training accordingly.
Hardware portability and quantization effects
- NVFP4 results are reported for RTX 5090 only; broader portability (e.g., edge GPUs, Apple Silicon) and quantization-aware training impacts are not studied. Benchmark across hardware and analyze quality degradation vs. precision.
Energy use and environmental impact
- While efficiency is emphasized, energy consumption and carbon footprint of training/inference are not reported. Include energy metrics and compare against baselines on a per-minute-of-video basis.
Interpretability of geometric conditioning
- The dual UCPE + Plücker design improves control, but which aspects drive gains is unclear. Provide attention/feature attributions and analyze failure cases (e.g., rapid roll, parallax-heavy scenes).
Robustness to erroneous or adversarial conditioning
- The system’s behavior under inconsistent text/camera inputs or adversarial trajectories is untested. Evaluate failure modes and add safeguards (input validation, smoothing, or fallback strategies).
End-to-end training with tokenizer and refiner
- The backbone, tokenizer, and refiner are optimized in stages; end-to-end joint training (or alternating) might further improve fidelity and control. Investigate joint objectives and stability.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now using SANA-WM’s open-source model, data/benchmark assets, and single-GPU inference variants. Each item notes sector alignment, a plausible tool/workflow, and feasibility assumptions or dependencies.

Media, Film, and Advertising

Previsualization and storyboarding with precise camera paths (sector: media/VFX)
- What: Generate 60s, 720p animatics from a first frame + text + 6-DoF camera path to explore blocking, lenses, and motion beats.
- Tool/workflow: “CameraPath→Animatic” plugin for Blender/Unreal/DaVinci; Stage-1 fast search (chunk-causal AR) → Stage-2 refiner pass for shortlisted takes.
- Assumptions/dependencies: Single H100 or RTX 5090 class GPU; realistic scenes without heavy dynamic object motion; creators provide well-defined trajectories. Model license and content provenance practices in place.
Virtual production background plates and camera move rehearsals (sector: media/VFX)
- What: Produce controlled parallax backgrounds and iterate on dolly/crane/orbit paths before booking stages.
- Tool/workflow: Batch-generate camera path variants; keep identity via reference image; refine top choices.
- Assumptions/dependencies: Accepts that scene geometry is implicit (no explicit 3D mesh); minute-scale quality acceptable after refiner; on-set usage still requires safety and continuity checks.
Product flythroughs and promo content (sector: advertising/e-commerce)
- What: Controlled pans/orbits around a product scene seeded by a hero still or packshot.
- Tool/workflow: Preset library of camera paths (orbit, spiral, dolly), one-click Stage-2 refinement for sharpness.
- Assumptions/dependencies: Single image may not reveal unseen sides; content should be clearly labeled synthetic to avoid misrepresentation.

Game Development

Rapid cutscene and level moodboard generation (sector: gaming/software)
- What: Generate camera-driven sequences for cutscene drafts and environment tone exploration.
- Tool/workflow: Record a controller/camera path in-engine; export to 6-DoF; feed to SANA-WM; iterate with chunk-causal generator.
- Assumptions/dependencies: No game physics or gameplay logic; uses visual plausibility only; integrate via Python/REST service around the open model.
Controller-conditioned prototyping of world models (sector: gaming R&D)
- What: Prototype action-conditioned rollouts (camera/gamepad) with minute-scale persistence to evaluate interaction pacing.
- Tool/workflow: Stream short segments with chunk-causal inference; buffer and refine on completion.
- Assumptions/dependencies: Near-real-time requires lower resolution or smaller steps; latency budget dictated by hardware.

Robotics and Computer Vision Research

Data augmentation with camera-conditioned synthetic videos (sector: robotics/CV)
- What: Generate long, viewpoint-controlled sequences to diversify training for pose, depth, SLAM, or VIO tasks.
- Tool/workflow: Use SANA-WM to render sequences under predefined ego-motion distributions; label with the provided 6-DoF trajectory; optionally re-annotate with the paper’s pose engine for robustness testing.
- Assumptions/dependencies: Visual domain gap will exist; best for pretraining/augmentation rather than sole data source.
Metric-scale camera-pose annotation of public videos (sector: CV tooling)
- What: Apply the paper’s VIPE + Pi3X + MoGe-2 pipeline to recover robust metric-scale trajectories from internet videos.
- Tool/workflow: “PoseAnnotator” batch tool to ingest videos, optimize per-frame intrinsics, and export poses with quality filters.
- Assumptions/dependencies: Requires GPU for depth/BA; success rates depend on video quality and motion content; follow dataset license terms.
Benchmarking long-horizon camera control and scene persistence (sector: academia/industry labs)
- What: Adopt the 60s benchmark and metrics (VBench, Pi3X-aligned pose accuracy) to compare world models under identical budgets.
- Tool/workflow: CI-integrated evaluation harness with trajectory suites (Simple/Hard).
- Assumptions/dependencies: Community uptake; ensure fair compute/memory disclosures when reporting.

AR/VR Prototyping and Education

Head-motion path previews and immersive scene drafts (sector: AR/VR, education)
- What: Convert recorded head/hand trajectories into minute-long previews for experience design or teaching cinematography.
- Tool/workflow: Capture trajectories from a headset controller, convert to 6-DoF camera paths, render with SANA-WM, and refine.
- Assumptions/dependencies: Not real-time; usable for concept validation and teaching, not final runtime content.
Classroom modules on camera geometry and motion design (sector: education)
- What: Demonstrate UCPE/Plücker-conditioned effects by varying trajectory parameters; compare visual outcomes.
- Tool/workflow: “Preset Path Lab” exercises with S-curve, orbit, spiral, and zigzag trajectories from the paper’s benchmark.
- Assumptions/dependencies: Access to a single modern GPU for live demos; curated prompts to avoid content risks.

Real Estate and Architecture

Quick mock-up walkthroughs from a hero frame (sector: real estate/AEC)
- What: Generate indoor/outdoor tours from listings or concept renders with predefined navigation paths.
- Tool/workflow: Template paths (room lookaround, corridor fly-through) + Stage-2 refiner for client-facing clips.
- Assumptions/dependencies: No ground-truth metrology; not a replacement for accurate digital twins; disclose synthetic nature.

Software/Infrastructure

Single-GPU long-video generation service (sector: software/SaaS)
- What: Offer an API that turns an image + text + camera spline into 60s 720p clips, with optional refinement tier.
- Tool/workflow: Containerized SANA-WM with bidirectional or chunk-causal variants; NVFP4 quantization on RTX 5090; job queue and observability.
- Assumptions/dependencies: GPU availability; content moderation; prompt hygiene; watermarking.

Policy, Governance, and Safety

Transparent reporting and evaluation for long-video generators (sector: policy/standards)
- What: Standardize disclosures (data scale, compute, memory, latency), and adopt the paper’s 60s benchmark and camera-accuracy metrics.
- Tool/workflow: “WorldBench-60” guideline and reference kit; mandatory provenance tags and model-card checklists.
- Assumptions/dependencies: Voluntary industry alignment; integration with content provenance standards (e.g., C2PA).
Synthetic content provenance and labeling (sector: policy/platforms)
- What: Default watermarking/logging on generated sequences; pipeline hooks for disclosure overlays.
- Tool/workflow: Post-processing service that signs clips and embeds camera-path metadata.
- Assumptions/dependencies: Platform support; clear user consent; region-specific compliance.

Long-Term Applications

These applications require further research, scaling, or additional components (e.g., explicit 3D memory, physical dynamics, or real-time constraints).

Robotics and Autonomy

Closed-loop world models for policy learning and planning (sector: robotics/autonomy)
- What: Use camera/action-conditioned generators for long-horizon simulation in the loop with control policies.
- Needed advances: Explicit 3D/4D memory for persistent geometry, dynamic object modeling, robot action spaces beyond camera 6-DoF, domain adaptation for sim-to-real.
- Dependencies/assumptions: Safety-critical validation; integration with RL/IL pipelines; compute-efficient streaming.
Drone and mobile-robot rehearsal and path optimization (sector: robotics/inspection)
- What: Optimize inspection or cinematography paths by simulating visibility and parallax in a controllable generator.
- Needed advances: Accurate scene physics, occlusion handling, and multi-sensor simulation.
- Dependencies/assumptions: Calibrated to target facilities; failsafes against unrealistic optimism.

Autonomous Driving and Digital Twins

Minute-scale, multi-camera scenario synthesis for rare-event training (sector: automotive)
- What: Generate long, controllable driving sequences for edge cases, domain shifts, and revisit loops.
- Needed advances: Multi-camera rigs, traffic agents with behavior models, map-grounded 3D persistence, weather/time-of-day control.
- Dependencies/assumptions: Data governance with real-world maps; rigorous metrics beyond image quality.
Hybrid 3DGS/NeRF + world-model engines for persistent digital twins (sector: AEC/smart cities)
- What: Fuse explicit 3D reconstructions with generative rollouts for rapid, lifelike, long-horizon tours and what-if scenarios.
- Needed advances: Consistent fusion between latent video generation and 3D assets; bidirectional updates (edit 3D → propagate to video and vice versa).
- Dependencies/assumptions: Scalable content pipelines; storage and versioning of multi-minute latent sequences.

AR/VR and Real-Time Media

Real-time, head-locked generative environments (sector: XR)
- What: Streamed generation responding to head/hand motion for immersive experiences.
- Needed advances: Sub-30 ms latency, streaming refiner, low-power on-device models, robust dynamic-scene handling.
- Dependencies/assumptions: Specialized hardware acceleration; safety and comfort in prolonged use.

Healthcare and Technical Training

Endoscopy/surgical camera simulation for training (sector: healthcare/edtech)
- What: Controlled 6-DoF camera rollouts in anatomy-like environments for skill practice.
- Needed advances: Domain-specific finetuning, anatomical correctness, tool-tissue dynamics and physics, strict validation.
- Dependencies/assumptions: Access to annotated medical video; regulatory approval for training use.

Creative Co-pilots and Interactive Tools

Cinematography co-pilots that suggest and synthesize camera paths (sector: creative tools)
- What: Interactive agents recommending paths based on script/scene, generating previews on the fly.
- Needed advances: Text-to-trajectory grounding, user-in-the-loop editing of UCPE/Plücker space, real-time draft–refine cycles.
- Dependencies/assumptions: UX integration with DCC tools; copyright-safe training and outputs.

Safety, Detection, and Governance

Robust detection and traceability of long synthetic videos (sector: policy/platforms)
- What: Forensic tools tuned to long-horizon diffusion signals and action-conditioning artifacts.
- Needed advances: Detectors resilient to post-processing and compression; standardized metadata pipelines.
- Dependencies/assumptions: Broad platform adoption; periodic red-teaming and public audits.

Cross-cutting assumptions and dependencies

Hardware: Immediate deployments assume access to at least one H100 (bidirectional high-quality) or a single RTX 5090 (distilled, ~40–50s end-to-end per 60s, 720p clip with NVFP4).
Inputs: Best results when users provide clean first frames, accurate 6-DoF camera paths, and concise scene text. Scene dynamics, fast parallax with thin structures, or extreme FOVs may degrade quality.
IP and provenance: Use public or licensed assets; maintain provenance/watermarking; disclose synthetic content in downstream workflows.
Limitations noted by authors: Scale-limited model, no explicit 3D scene memory, potential drift in dynamic scenes or rare viewpoints, and quality drop over very long horizons without refinement.

View Paper Prompt View All Prompts

Glossary

3D Gaussian Splatting (3DGS): A point-based 3D scene representation/rendering method that models scenes as anisotropic Gaussians for fast novel-view synthesis. "we fit one FCGS [94] 3D Gaussian Splatting reconstruction per scene,"
6-DoF: Six degrees of freedom describing a camera’s 3D position and orientation. "metric-scale 6-DoF camera poses"
AIQ: A metric capturing long-horizon imaging-quality degradation across a video. "and long-term degradation AIQ are reported in App. E.1,"
Attention-sink tokens: Special tokens added to attention layers to stabilize context and keep memory/latency constant with local windows. "we add attention-sink tokens and local temporal windows to the softmax attention layers,"
Autoregressive generator: A model that generates sequences step-by-step, conditioning each step on previously generated outputs. "a chunk-causal autoregressive generator for sequential rollout,"
BEV (Bird’s-Eye View): A top-down spatial representation often used in driving simulators and geometry-aware models. "including BEV or occupancy-based driving simulators,"
Bidirectional generator: A generator that uses both past and future context (forward and backward scans) during inference. "a bidirectional generator for high-quality offline synthesis,"
CamMC (Camera-Motion Consistency): A metric evaluating the consistency of generated camera motion with reference trajectories. "The refined output obtains the best RotErr, TransErr, and CamMC,"
Chunk-causal: A scheme that processes long sequences in chunks with causal constraints within or across chunks. "we further fine-tune a chunk-causal variant for autoregressive rollout."
Context-Parallel (CP) training: Parallelization that shards sequences across devices and composes compact recurrent summaries instead of full activations. "Context- Parallel (CP) training shards the latent sequence along time."
Cumulative linear attention: A linear-attention formulation that accumulates key–value outer products to achieve constant memory with context length. "SANA-Video [25] uses ReLU-based cumulative linear attention in place of causal softmax attention."
Delta rule (in GDN): A recurrence that updates state using a gated, decayed correction toward the target value. "Gated DeltaNet (GDN) [11] augments the same recurrent state with a decay gate and a delta-rule correction:"
Diffusion forcing: A training/inference technique that enforces consistency across diffusion steps or sequence segments. "Long-duration generation is commonly approached through autoregressive or block-wise rollout, diffusion forcing, streaming training, and memory- or cache-aware inference"
Diffusion Transformer (DiT): A transformer backbone used as the denoiser/generator in diffusion-based models. "SANA-WM utilizes a Diffusion Transformer (DiT) architecture"
DiFix3D: A diffusion-based post-processing method that reduces artifacts in rendered 3D/novel-view videos. "We then refine the rendered videos with DiFix3D [95]"
Epipolar constraints: Geometric constraints relating corresponding points across views under known camera motion. "CamCo combines Plücker conditioning with epipolar constraints,"
FCGS: A fast 3D Gaussian Splatting reconstruction approach used to build renderable scene models. "we fit one FCGS [94] 3D Gaussian Splatting reconstruction per scene,"
FlashAttention: An efficient attention kernel that reduces memory traffic and accelerates softmax attention. "Standard softmax attention remains effective and can be accelerated by kernels such as FlashAttention [68],"
Flow matching: A training objective that learns a velocity field to transport noisy inputs toward clean targets. "The one-minute stage is trained on 961-frame clips with the standard flow-matching objective [101]."
FSDP2: A fully sharded data-parallel training framework/version for scaling large models across GPUs. "under FSDP2."
FVD (Fréchet Video Distance): A distributional metric measuring perceptual quality of generated videos. "FVD [98] and Umeyama-aligned Pi3X metrics are shown."
Gated DeltaNet (GDN): A gated linear/recurrent attention mechanism with decay and delta updates for long-context efficiency. "Gated DeltaNet (GDN) [11] augments the same recurrent state with a decay gate and a delta-rule correction:"
Halo exchange: A distributed-convolution technique exchanging boundary context between shards to match unsharded outputs. "Halo Exchange for Convolutions."
Hybrid Linear Attention: A design that interleaves efficient linear/recurrent attention with periodic exact softmax blocks. "Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling."
Intrinsics/extrinsics: Camera parameters; intrinsics define internal calibration, extrinsics define pose relative to the world. "Camera representations include raw extrinsics and intrinsics,"
KV cache: The stored keys and values in attention layers used to reuse past context during generation. "all-softmax grows its KV cache and runs out of memory at 60s."
LoRA (Low-Rank Adaptation): A parameter-efficient finetuning method that injects low-rank adapters into model layers. "we initialize from the 17B LTX-2 model and train LoRA adapters"
LTX2-VAE: A high-compression video variational autoencoder used to reduce token counts for long videos. "we replace the baseline VAE [20, 84] with LTX2-VAE [10]"
Metric-scale: Having absolute scale consistent with real-world units in estimated geometry or poses. "metric-scale 6-DoF camera poses"
NVFP4 quantization: A low-precision numeric format for efficient NVIDIA GPU inference with minimal quality loss. "with NVFP4 quantization to denoise a 60s 720p clip in 34s."
Occupancy-based (driving simulators): 3D scene/state representations modeling space occupancy for simulation. "including BEV or occupancy-based driving simulators,"
Plücker mixing: Injecting Plücker-parameterized camera-ray features into the model to capture fine camera motion. "Fine Branch: Raw-Frame Plücker Mixing."
PRoPE: A camera-aware positional encoding variant applied at the attention level. "attention-level PRoPE [65] and UCPE [12] greatly improve control,"
PSNR/SSIM/LPIPS: Standard perceptual and pixelwise similarity metrics for evaluating video/image fidelity. "revisit mem- ory (PSNR/SSIM/LPIPS [97]) and long-term degradation AIQ are reported in App. E.1,"
Raymap: A per-pixel representation of camera rays (e.g., Plücker coordinates) used for geometric conditioning. "We compute pixel-wise Plucker raymaps Pr,p = (dr,p, Or x dr,p) E R6"
Revisit memory: The ability to maintain scene identity and consistency when revisiting previously seen views. "revisit mem- ory (PSNR/SSIM/LPIPS [97]) and long-term degradation AIQ are reported in App. E.1,"
RoPE (Rotary Positional Encoding): A method injecting relative positional information via rotations in attention space. "standard RoPE channels."
Self-forcing distillation: A distillation strategy where the student conditions on its own predictions during training. "We then use self-forcing distilla- tion [28] to reduce sampling to four denoising steps."
Softmax attention: The standard dot-product attention with softmax-normalized weights. "Standard softmax attention remains effective and can be accelerated by kernels such as FlashAttention [68],"
State-space models: Sequence models based on linear dynamical systems for efficient long-context processing. "toward linear attention, kernelized attention, gated linear attention, state-space models, convolutional mixers,"
Triton kernels: Custom GPU kernels written in the Triton language to fuse operations and improve efficiency. "we use custom fused Triton kernels for GDN scan and gate operations."
UCPE (Unified Camera Positional Encoding): A positional encoding that embeds camera geometry into attention features. "latent-frame UCPE [12] captures global 6-DoF pose,"
Umeyama Sim(3) alignment: A method to align trajectories/point sets via a similarity transform (rotation, translation, scale). "Umeyama Sim(3) alignment [99]"
VAE (Variational Autoencoder): A generative model that learns a latent space for reconstructing data via an encoder–decoder. "we replace the baseline VAE [20, 84] with LTX2-VAE [10]"
VBench: A multi-dimensional benchmark for assessing visual/video quality. "We report VBench [96] scores for visual quality"
VIPE: A video pose estimation and reconstruction engine used for annotating camera trajectories. "Our annotation engine is based on VIPE [13],"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Summary

Authoritative Technical Summary of "SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer"

Motivation and Context

Architecture and Methodology

Training Protocol and Efficiency Analysis

Quantitative Evaluation and Ablations

Implications, Limitations, and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they build the system?

What did they find, and why is it important?

What could this change?

Simple explanations of tricky terms

Quick recap

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Media, Film, and Advertising

Game Development

Robotics and Computer Vision Research

AR/VR Prototyping and Education

Real Estate and Architecture

Software/Infrastructure

Policy, Governance, and Safety

Long-Term Applications

Robotics and Autonomy

Autonomous Driving and Digital Twins

AR/VR and Real-Time Media

Healthcare and Technical Training

Creative Co-pilots and Interactive Tools

Safety, Detection, and Governance

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets