Cosmos 3: Omnimodal World Models for Physical AI

Published 1 Jun 2026 in cs.CV, cs.AI, cs.LG, cs.MM, and cs.RO | (2606.02800v1)

Abstract: We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-LLMs, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 https://openmdw.ai/license/1-1/ License at https://github.com/nvidia/cosmos}{github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3 . The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3 .

Abstract PDF Upgrade to Chat

Authors (291)

First 10 authors:

Summary

The paper introduces Cosmos 3, a unified omnimodal model that integrates language, vision, audio, and action using a Mixture-of-Transformers architecture.
The model achieves state-of-the-art performance on diverse benchmarks, excelling in tasks like text-to-image generation, video synthesis, and control policy inference.
Its design enables flexible modality handling, scalable training, and rapid specialization for applications in robotics, autonomous vehicles, and smart infrastructure.

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3 represents a significant advancement in foundation models for Physical AI, providing a unified neural backbone for reasoning, perception, world simulation, and physically grounded action generation. Leveraging a Mixture-of-Transformers (MoT) architecture, Cosmos 3 natively models language, vision (image/video), audio, and action for both understanding and generation within a single, extensible system. This essay provides a comprehensive technical summary and analysis of Cosmos 3, focusing on its design principles, architectural innovations, training methodology, benchmark results, and future implications for the development of general-purpose embodied agents.

Unified Omnimodal Modeling for Physical AI

Cosmos 3 is motivated by the limitations of the prevailing paradigm in embodied AI, which relies on stitching together separate models for perception (VLMs), world simulation (video/dynamics models), and action (VLAs, WAMs). Cosmos 3 eliminates these architectural boundaries and subsumes language, image, video, audio, and action within a single inter-operable system, supporting both discriminative (reasoning) and generative (simulation and control) workloads.

Figure 1: Cosmos 3 as a general-purpose backbone subsuming a range of model classes for physical AI within a single architecture.

The model supports configurable input/output modalities, seamlessly transitioning between operational regimes: multimodal understanding/reasoning, text-to-image, image-to-video, video/audio-conditioned generation, forward/inverse dynamics, and closed-loop policy inference. This flexibility positions Cosmos 3 as a scalable substrate for scalable training and fine-tuning of embodied agents across tasks and embodiments.

Figure 2: Cosmos 3 enables post-training specialization for downstream applications (e.g., robotic manipulation, synthetic data, without architectural changes).

Model Architecture

Multimodal Encoders and Unified Action Representation

Cosmos 3 introduces dedicated, modality-specific encoders projecting language, vision, audio, and action into a shared embedding space. Vision uses a ViT for understanding tasks, and a video VAE (frozen) for generative modes; audio is handled via a high-throughput VAE. Actions, crucial for embodied learning, are mapped from diverse control interfaces into a unified action representation comprising relative egocentric and effector 6-DoF pose deltas and grasp states, supporting a broad spectrum of real and simulated agents.

Figure 3: Unified action representation—mapping heterogeneous control signals to compact vectors, facilitating domain transfer and policy generalization.

Mixture-of-Transformers (MoT) Backbone

The backbone adopts a dual-pathway MoT layout—an autoregressive "reasoner" and a diffusion-based "generator"—each with independent parameters, layer norms, and MLPs. Reasoner tokens are processed with causal attention, while generator tokens use bidirectional attention, attending jointly over AR and DM tokens, but never conditioning AR tokens on DM outputs, preserving causal consistency.

Figure 4: MoT architecture with dual pathway processing and joint attention—enabling native integration of reasoning and generation.

Multimodal Position Embeddings and Temporal Alignment

Cosmos 3 employs an extended 3D MRoPE scheme for position encoding, supporting spatiotemporal alignment across modalities with differing rates and semantics (e.g., aligning audio events, video frames, and action tokens). The design ensures robust handling of variable context and multi-resolution temporal signals, critical for embodied simulation, and includes offset handling to avoid initial frame artifacts.

Figure 5: 3D coordinate assignment for joint multimodal sequences, supporting FPS modulation and absolute time alignment.

Token Arrangement and Generation Modes

By interleaving autoregressive and diffusion token subsequences for various modalities, Cosmos 3 efficiently packs heterogeneous data into large context windows for training and inference. This arrangement allows support for advanced operational modes: text-to-image/video, image-to-video, video-to-video, control-conditioned transfer, forward/inverse dynamics, and joint video-action policy prediction.

Figure 6: Action sequence training configurations for different conditional generation and prediction settings.

Training Data and Infrastructure

The training pipeline follows a staged curriculum. The reasoner is trained and fine-tuned on multimodal web-scale and physical-AI task data, while the generator receives progressive multimodal pre-training and mid-training, expanding from image/video/audio to joint action/video and transfer-conditioned data. All data pipelines leverage advanced curation infrastructure (SILA), semantic deduplication, AI-judged annotation filtering, and structured captioning.

Figure 7: Curated data mixture for the reasoner, organized by capability category to ensure broad coverage and domain adaptation.

Figure 8: Generator data curriculum—modalities and modes are introduced across pre-training, mid-training, and post-training stages, allowing domain-specialized adaptation.

The action corpus covers 61.3K hours and 8.4M episodes across robotics, AV, camera motion, and egocentric data, intentionally designed for maximal domain transferability.

Figure 9: Action data distribution across major pillars in the mid-training corpus, crucial for unified world-action modeling.

Key Numerical Results and Empirical Claims

Cosmos 3 achieves the strongest performance among open-source models on nearly all evaluated Physical AI understanding and generation benchmarks:

General Reasoning: Cosmos3-Super attains a score of 73.7 (general understanding average), competitive with state-of-the-art VLMs (e.g., Gemini 3.1 Pro: 77.5), while surpassing all on robotics, smart infrastructure, and driving benchmarks.
Text-to-Image Generation: Cosmos3-Super-Text2Image achieves 91.36 on UniGenBench and leads all open-weight models on the Artificial Analysis leaderboard.
Image-to-Video Generation: Cosmos3-Super-Image2Video is ranked #1 among open models and is competitive with many proprietary models.
Audio-Visual Synchronization: Semantically grounded and temporally aligned audio-visual generations, with Cosmos3-Nano and Super leading semantic alignment metrics among open models.
Action and Policy: Cosmos3-Nano-Policy-DROID is the top open-weight policy model on RoboArena and achieves 39.7% overall success on RoboLab-120 under specific task instructions, outperforming all prior real-world policy models.
Physical and Temporal Plausibility: State-of-the-art scores are reported in physical plausibility (Physics-IQ, PAIBench-G) and human evaluation for long-horizon, physically consistent video and robotic task scenarios.

Practical and Theoretical Implications

Cosmos 3’s unified treatment of language, perception, and action establishes a template for flexible, compositional agents capable of end-to-end perception, reasoning, simulation, and control. The joint modeling of all modalities facilitates robust transfer to downstream tasks, enabling rapid specialization for robotics, AV, smart infrastructure, and synthetic simulation environments without architecture changes.

Figure 10: Qualitative results for robotics forward dynamics—Cosmos 3 closely follows action commands, surpassing specialized baselines in visual fidelity.

The synthetic data generation capabilities of Cosmos 3 present new opportunities for scalable, low-risk training of embodied agents, providing photorealistic, physically grounded visual and auditory simulation, as well as accurate world-action rollouts.

Infrastructure and Scalability

Cosmos 3’s infrastructure stack supports efficient handling of massive multimodal corpora, distributed training, and high-throughput serving, including:

Integration with advanced serving backends (TensorRT-LLM, vLLM-Omni)
Production-scale joint data loading and packing for ~74K token contexts
Variance-reduced, multi-resolution, and context-parallel sampling and training
Agentic human-evaluation, semantic upsampling, and prompt adaptation
Figure 11: Cosmos 3 infrastructure stack—spanning data, training, serving, and benchmark pipelines for large-scale model lifecycle management.

Future Directions

Cosmos 3 enables several promising future developments, including:

Agentic, closed-loop training environments where simulation and policy co-evolve within the same model
Large-scale synthesis of complex training environments for real-world deployment of embodied AI
Further exploration of self-supervised learning from unlabeled multimodal corpora via flexible conditional generation objectives
Scaling to longer contexts, richer resolutions, and broader embodiment/action spaces

Given its released code, checkpoints, synthetic datasets, and evaluation benchmarks under an open license, Cosmos 3 provides a robust platform for the community to explore next-generation Physical AI systems.

Conclusion

Cosmos 3 demonstrates that a single omnimodal, dual-pathway transformer can unify multimodal understanding and generation, subsuming most prior model classes (VLM, video generator, AV generator, WAM) for Physical AI applications. It establishes new open-source state-of-the-art performance across a range of tasks central to real-world embodied agents, and shows that the Physical AI research community can advance on transparent, auditable open benchmarks. Cosmos 3 lays a highly extensible foundation for the next generation of embodied, generalist AI systems capable of perception, simulation, and action at scale.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What is this paper about?

This paper introduces Cosmos 3, a single AI system that can understand and create many kinds of things at once: text, images, videos, sounds, and even action steps for robots. Instead of using separate models for each job (like one for reading images, another for making videos, and another for controlling a robot), Cosmos 3 combines them into one “omnimodal” model. The goal is to build a strong, general “world model” that can power Physical AI—agents that see, think, and act safely and effectively in the real world.

The main questions the paper asks

Can we build one model that both understands the world and also generates (creates or predicts) what happens next—across text, images, video, audio, and actions?
Can this single model replace many separate, specialized systems (like vision-LLMs, video generators, and robot policy models)?
Will this unified approach help train smarter, safer Physical AI agents, especially by learning in simulation before trying things in real life?

How the system works (in everyday terms)

Think of Cosmos 3 as a single “brain” with two tightly connected halves that share what they learn:

One brain for many skills

Cosmos 3 accepts different kinds of inputs: words, pictures, video clips, audio, and even robot movements.
It turns everything into “tokens,” which you can imagine as small puzzle pieces that carry meaning. Different encoders turn each input type into a shared internal language so the brain can work on all of them together.

Understanding and making: two teamwork modes inside one model

Autoregressive (AR) mode = understanding and reasoning. Picture writing a sentence one word at a time, using what came before to decide the next word. Cosmos 3 uses this for reading text, looking at images/videos, and answering questions.
Diffusion (DM) mode = generation. Imagine starting with a very noisy, blurry canvas and slowly cleaning it up until a clear image or video appears. Cosmos 3 uses this to generate images, videos, audio, and actions.

Both modes run inside a Mixture-of-Transformers (MoT) design—think of it like a two-lane highway inside the same brain:

The “reasoning lane” handles AR tokens (words and vision tokens for understanding).
The “generation lane” handles the diffusion tokens (for making images, video, audio, and actions).
The generation lane can look at what the reasoning lane produced (so it follows instructions), but the reasoning lane doesn’t depend on the generation lane. This keeps instructions stable while creation happens.

Action tokens: teaching the model how to act

“Action” is treated as a first-class citizen, just like text or images. An action token sits between two video frames and represents “what changed” to get from the last frame to the next, like a tiny motion step.
Different robots or bodies have different controls (car steering, robot arm movement, hand grasps). Cosmos 3 converts them into a shared, compact “action vector” so it can learn consistent patterns across many types of movement.
This lets the model do:
- Forward dynamics: if I take these actions, what will I see next?
- Inverse dynamics: given what changed in the video, what actions caused it?
- Policy mode: predict actions and the future video together, like planning moves and imagining the outcome at the same time.

Keeping time in sync across video, audio, and actions

Different data run at different speeds: video might be 24 or 60 frames per second; audio is sampled thousands of times per second; actions might be logged at yet another rate.
Cosmos 3 uses a shared “time map” so all tokens line up on the same timeline. Think of it as a common rhythm or beat so the model knows which sound, picture, and action belong together at each moment.

Flexible input/output modes

Cosmos 3 can switch jobs without changing its architecture. Here are common modes:

Text-to-Image: write a caption, get a picture.
Text-to-Video (optionally with audio): write a script, get a video (and matching sound).
Image-to-Video or Video-to-Video: start from one image or a few frames, continue the motion.
Video transfer: give a control video (like an edge map or depth map) and text, get a stylized RGB video.
Action modes:
- Forward dynamics: clean actions in, predict future visuals.
- Inverse dynamics: clean visuals in, infer the actions that caused them.
- Policy: predict actions and their visual results together.

Different sizes for different needs

Edge (small), Nano (medium), and Super (large) versions balance speed and power. All share the same design but vary in scale.

How it’s trained

Reasoner pathway: trained on vision-language tasks (like image+text, video+text) to learn to understand and reason.
Generator pathway: trained on lots of images, videos, audio, and action data to learn to create and simulate.
The model is then “post-trained” (further tuned) for specific tasks, like better text-to-image, better image-to-video, or better robot policies.

What the researchers found and why it matters

According to the paper’s benchmarks and rankings:

Cosmos 3 reaches or sets state-of-the-art results across many understanding and generation tasks.
The post-trained Cosmos 3 models were ranked:
- Best open-source Text-to-Image and Image-to-Video by Artificial Analysis at the time of writing.
- Best policy model by RoboArena at the time of writing.
It performs competitively even against specialized models built for single tasks, while Cosmos 3 handles many tasks in one framework.
It can be adapted (“post-trained”) for particular needs—like creating higher-quality synthetic data or learning strong robot policies—without changing the architecture.

Why this is important:

Training robots directly in the real world is slow, expensive, and sometimes unsafe. Cosmos 3 can learn in simulation, generate realistic data, and predict outcomes, making training cheaper and safer.
Unifying understanding and generation helps the model plan better: it can reason about what will happen next and create those futures in its “mind,” then choose actions accordingly.
Because it’s open-weight and comes with code, datasets, and benchmarks, other researchers can build on it.

What this could change in the future

Smarter, safer Physical AI: A home robot, for example, could read instructions, look around, plan its moves, predict the results, and act—using one brain instead of stitching together separate tools.
Faster progress through synthetic worlds: Cosmos 3 can generate high-quality images, videos, and environments for training, speeding up research without risking real-world harm.
Easier specialization: The same base model can be tuned for different jobs—robotics, driving, digital humans, warehouse tasks—while sharing a common understanding of the world.

In short, Cosmos 3 is like a Swiss Army knife for AI that sees, thinks, makes, and acts. By combining many abilities into one model, it aims to become a strong backbone for the next generation of real-world AI agents.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains missing, uncertain, or unexplored in the paper, phrased to guide follow-up research.

Quantify the sensitivity of the fixed temporal gap (15,000) between AR and diffusion subsequences: ablations on image/video quality, artifact suppression, and any side-effects on AR reasoning.
Systematically evaluate the absolute temporal modulation (base TPS=6) across diverse frame rates and sampling rates: robustness to variable-FPS inputs, dropped frames, clock drift, and mixed-modality resampling.
Provide ablations comparing alternative positional-encoding schemes (e.g., different RoPE variants, learned embeddings, segment-aware encodings) and their effects on long-horizon temporal coherence and cross-modal alignment.
Investigate whether the AR-not-conditioned-on-DM constraint limits iterative reasoning, e.g., text updates informed by partially denoised visuals or actions; explore controlled bidirectional coupling or multi-round conditioning.
Characterize the impact of token ordering within the unified format (vision → audio → action in DM) on conditioning strength, modality bias, and generation fidelity; test alternative orderings and interleavings.
Detail and compare generator objectives (flow matching vs. DDPM/EDM/consistency models), noise schedules, and training dynamics; report how each objective trades off sharpness, temporal consistency, and conditioning faithfulness.
Assess whether freezing the video/audio VAEs creates a fidelity bottleneck; evaluate end-to-end finetuning or tokenizer adaptation and its cost/benefit on realism and dynamics fidelity.
Examine limits of the unified action interface: coverage gaps for legged locomotion, drones, deformable manipulation, multi-contact tasks, and complex gripper/hand synergies not captured by pose deltas + simple grasp states.
Quantify the loss from discarding embodiment-specific low-level actuation/controller parameters; evaluate feasibility constraints and control stability when converting pose deltas back to executable controls.
Analyze numerical stability of the 6D rotation representation + SVD inversion for action decoding in long sequences; report failure modes and regularization needed to avoid discontinuities.
Measure cross-domain transfer and negative transfer with domain-aware input/output projections: when do shared MoT parameters help vs. hurt new embodiments, and what isolation or adapters minimize interference?
Provide a recipe for rapidly adding a new embodiment domain (data requirements, initialization of domain-specific projections, few-shot adaptation) and quantify sample efficiency.
Clarify synchronization assumptions between video, audio, and action streams; evaluate robustness to latency, jitter, and sensor desynchronization common in real data.
Report memory/latency/computation scaling for the dual-tower MoT at different sequence lengths and modes; quantify the cost-benefit of dual towers vs. single-tower baselines.
Provide long-horizon evaluations (minutes-scale) to probe temporal drift, identity/scene consistency, and physics plausibility given the VAE’s 4× temporal compression and transformer context limits.
Investigate uncertainty calibration and multi-future modeling: how many diverse samples are needed for planning; metrics for diversity vs. realism under action-conditioned generation.
Specify the generator data composition (scale, domains, real vs. synthetic ratios, action-video pair coverage), curation/filtering procedures, and their impact on downstream capabilities and biases.
Evaluate sim-to-real transfer and closed-loop control on real robots/vehicles beyond DROID post-training: robustness to sensor noise, occlusions, lighting changes, and embodiment mismatch.
Analyze catastrophic forgetting risks when post-training for different targets (T2I, I2V, policy): what freezes/regularizers or multi-task schedules preserve reasoning while improving generation/policy.
Provide audio–video synchronization metrics and results; characterize failure modes (lip-sync drift, onset misalignment) and training interventions to fix them.
Test multi-video/multi-scene conditioning and segment stitching: does resetting spatial/temporal indices per segment cause ambiguities or context leakage; propose disambiguation mechanisms.
Clarify how action feasibility is enforced in policy mode: do predicted pose deltas respect dynamics/kinematics constraints; propose training-time penalties or post-hoc projection onto feasible sets.
Benchmark the world-model’s physics fidelity (contact, momentum, friction) with standardized diagnostics; identify gaps where the model breaks physical invariants.
Evaluate planning usefulness: does forward dynamics accuracy translate into better control when used in MPC/planning loops; quantify end-to-end task improvements over baselines.
Study robustness under distribution shift and adversarial prompts across modalities (text, images, audio); propose guardrails for unsafe content or unsafe policies.
Provide detailed training schedules (which layers updated per mode, learning-rate strategies, curriculum timing) to improve reproducibility and illuminate key design choices.
Compare MoT to alternatives (e.g., cross-attention conditioning, mixture-of-experts, adapters) with controlled compute to isolate where dual towers most help.
Quantify effects of the DeepStack aggregation and text-based video timestamps on temporal reasoning vs. generation fidelity; ablate or replace with alternatives.
Describe evaluation protocols and metrics behind the headline scores (human preference criteria, inter-rater reliability, dataset coverage), including energy/latency footprints for fair comparisons.
Explore interactive/online learning: integrating RL, environment rollouts, and self-play with the unified model; report stability and sample efficiency in on-policy updates.
Address ethical risks of open synthetic data and omnimodal generation (video/audio deepfakes, action misuse); document bias audits, content filtering, and policy safeguards.
Specify interoperability with simulators and control stacks: APIs and data exchange formats to plug Cosmos 3 into standard RL/planning pipelines.
Clarify the status and roadmap for the unreleased Edge model and its on-device constraints; provide mobile/edge benchmarks and compression/distillation strategies.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now using the released code, open-weight checkpoints, and datasets, or with modest post-training/fine-tuning. Each item includes suggested sectors and possible tools/workflows, plus key assumptions/dependencies.

Synthetic data engine for vision and robotics
- Sectors: robotics, autonomous driving, smart infrastructure, media
- What: Use Cosmos3-Super/Nano (and post-trained Text2Image/Image2Video) to generate high-diversity images/videos for perception model pretraining, rare-event augmentation, and domain balancing; leverage open SDG datasets (e.g., SDG-DriveSim, SDG-Warehouse, SDG-RobotSim) to bootstrap domain-specific corpora.
- Tools/workflows: “Data factory” pipelines that script prompts + control-video (depth/edges) for targeted data; automatic curriculum generation via Image-to-Video and Video transfer; integrate with MLOps data versioning and evaluation via Cosmos-HUE.
- Assumptions/dependencies: GPU capacity for diffusion inference; prompt and control-signal engineering; data governance (copyright/consent) and content provenance; sim-to-real validation.
Unified multimodal evaluation and benchmarking
- Sectors: academia, industry R&D, public sector labs
- What: Use Cosmos-HUE and the MoT-based reasoner for standardized assessment across understanding, generation, and action; compare ablations and fine-tuned variants in a single protocol.
- Tools/workflows: Continuous evaluation in CI; regression dashboards for multimodal KPIs (reasoning, T2I/T2V quality, policy success).
- Assumptions/dependencies: Agreement on metrics for cross-modality evaluation; reproducible evaluation harnesses.
Robotics policy prototyping and imitation learning
- Sectors: robotics (R&D labs, warehouse automation), education
- What: Post-train Cosmos3-Nano on task data (e.g., DROID) to obtain world-action models and policies; use forward/inverse dynamics for rollouts and action inference from demonstrations.
- Tools/workflows: “Policy studio” for dataset ingestion, embodiment adapters (domain-aware action projections), closed-loop sim evaluation, and real-robot pilots; RoboArena-style benchmarks for quick iteration.
- Assumptions/dependencies: High-quality demonstrations with calibrated action/state logs; safety guardrails; sim-to-real transfer plan; latency budget for control loops.
Previsualization and virtual production (text-to-video/image-to-video with audio)
- Sectors: media/entertainment, advertising, education
- What: Rapidly iterate storyboards, motion studies, and animatics from prompts or single images; generate synchronous audio for scenes.
- Tools/workflows: NLE/DCC plugins that drive Cosmos3 generation modes (T2V, I2V, AV sync); control-video (depth/edge) guided transfers to preserve motion/layout.
- Assumptions/dependencies: Alignment to studio style guides; IP clearance; compute provisioning and caching; watermarking/content provenance.
Smart-infrastructure and AV research sandboxes
- Sectors: transportation, city planning, AV startups
- What: Use video-to-video/transfer to synthesize traffic scenes, weather, lighting, and sensor variations for perception/regression testing; employ reasoner for spatiotemporal QA and incident annotation.
- Tools/workflows: “Scenario bank” generation on SDG-DriveSim; automatic labeling and question-answering for corner-case triage; evaluation against downstream detectors/trackers.
- Assumptions/dependencies: Coverage of local traffic rules and assets; realism thresholds for downstream models; governance for synthetic images in safety pipelines.
Interactive multimodal assistants for labs and classrooms
- Sectors: education, research
- What: VLM-style understanding (language+image/video) for lab demonstrations, procedural guidance, and video explanation; generate short illustrative clips for concepts.
- Tools/workflows: Lightweight front-ends that invoke AR reasoning for step-by-step instructions and DM generation for illustrative sequences.
- Assumptions/dependencies: Domain alignment (terminology, safety); content filters for classroom use.
Teleoperation preview and human-in-the-loop robotics
- Sectors: field robotics, logistics
- What: Use forward dynamics to visualize predicted near-term consequences of candidate actions, aiding remote operators in delicate tasks.
- Tools/workflows: Operator console overlays showing rollout videos conditioned on action tokens; adjustable horizon/uncertainty bands.
- Assumptions/dependencies: Accurate camera/pose calibration for action tokens; latency management; uncertainty communication to the operator.
Multimodal documentation and training content authoring
- Sectors: software/devtools, enterprise L&D
- What: Auto-generate short tutorial videos from textual procedures or screenshots (I2V/T2V), with synchronized narration.
- Tools/workflows: CMS integration to draft trainings; batch generation with human review; A/B testing on learner outcomes.
- Assumptions/dependencies: Corporate brand/safety policies; accessibility standards; compute cost controls.
Egocentric analytics and AR prototyping
- Sectors: wearables, AR/VR
- What: Use egocentric action tokens for wrist/head motion analysis, imitation learning, or predictive overlays in AR prototypes.
- Tools/workflows: Embodiment adapters for head/wrist pose; inverse dynamics to infer actions from video for coaching/feedback.
- Assumptions/dependencies: Accurate inertial/vision sensor fusion; privacy/consent; power/latency limits on device (Edge model pending release).
Open research on world models and MoT architectures
- Sectors: academia, open-source community
- What: Experiment with dual-tower MoT, absolute-temporal MRoPE, and mixed AR+diffusion training strategies using open weights and code.
- Tools/workflows: Reproducible training curricula for reasoner/generator; ablation suites; dataset contribution via SDG schema.
- Assumptions/dependencies: Compute access; alignment with OpenMDW-1.1 licensing.

Long-Term Applications

These applications are feasible with further research, scaling, safety validation, or systems integration. They indicate where Cosmos 3’s unified, omnimodal design can evolve into end-to-end solutions.

General-purpose home and service robots
- Sectors: consumer robotics, hospitality, eldercare
- What: Single backbone for perception, planning via forward simulation, and action generation across diverse household tasks (e.g., table clearing, tidying).
- Tools/workflows: Continual post-training per home/venue; joint video-action policy learning; on-device Edge deployment for privacy/latency.
- Assumptions/dependencies: Robust sim-to-real transfer, safety and reliability certification, long-horizon memory, task generalization, cost-effective sensors/actuators.
Closed-loop autonomous driving with world-model-based planning
- Sectors: automotive, mobility services
- What: Use joint video-action modeling for prediction and policy rollouts, integrating multiple sensors and control stacks under a unified world model.
- Tools/workflows: Scenario-driven self-play and rare-event synthesis; uncertainty-aware planning on top of DM predictions; hybrid optimization with rule-based constraints.
- Assumptions/dependencies: Multi-sensor fusion extensions, stringent safety validation, regulatory approval, robust OOD handling.
City-scale digital twins and policy stress testing
- Sectors: public policy, smart cities, insurance
- What: Generate and simulate complex urban scenes (traffic flows, events, infrastructure changes) to test interventions (signal timing, zoning, emergency response).
- Tools/workflows: Scenario authoring with control-video inputs; synthetic longitudinal datasets for impact analysis; benchmarking pipelines for KPIs.
- Assumptions/dependencies: High-fidelity asset libraries and behavioral priors; governance frameworks for synthetic evidence; integration with GIS and real telemetry.
Industrial and energy inspection robotics with predictive simulation
- Sectors: energy, manufacturing, utilities
- What: Robots that plan and execute inspection/maintenance, leveraging forward dynamics to predict effects of manipulations in constrained environments.
- Tools/workflows: Task libraries (valve turning, gauge reading, thermal inspections); policy rollouts under safety envelopes; digital twin synchronization.
- Assumptions/dependencies: Specialized sensors (thermal, depth), hazardous-environment certification, domain adaptation to materials/lighting.
Synthetic training environments as a service (EaaS) for Physical AI
- Sectors: robotics platforms, AV, defense research
- What: Programmatically generate complex, curriculum-aligned 3D environments and tasks, co-evolving with agent capabilities to minimize real-world risk.
- Tools/workflows: “Environment compiler” that uses text prompts + control signals to produce interactive scenes and tasks; automated difficulty scaling.
- Assumptions/dependencies: Interactive physics integration with the generator; scalable evaluation harnesses; guardrails to avoid overfitting to synthetic artifacts.
Multimodal AR/VR agents and embodied tutoring
- Sectors: education, healthcare training, enterprise
- What: Agents that perceive user actions/scene context and generate real-time guidance, demonstrations, or corrective feedback via video/audio overlays.
- Tools/workflows: Real-time MoT inference pipelines; egocentric action adapters; curriculum personalization via reasoner.
- Assumptions/dependencies: Edge-grade models for on-device operation, low-latency streaming, bias mitigation, user safety and privacy controls.
Assistive technologies and rehabilitation robots
- Sectors: healthcare, home care
- What: Personalized, safe action policies for mobility/rehab support, with video-action rollouts to anticipate and avoid risk.
- Tools/workflows: Clinician-in-the-loop training; safety-case documentation using synthetic stress tests; explainable action rationales via reasoner.
- Assumptions/dependencies: Medical-grade data, regulatory clearance, rigorous human-subject trials, specialized hardware and fail-safes.
Content authenticity, governance, and forensic tools for generative media
- Sectors: policy/regulation, media platforms
- What: Detection/traceability layered onto Cosmos 3 outputs (e.g., watermarking, provenance metadata), plus explainable generation settings for audits.
- Tools/workflows: Platform-side filters; provenance pipelines embedded in Cosmos-Framework; third-party audit APIs.
- Assumptions/dependencies: Standardized provenance frameworks, cross-platform adoption, adversarial robustness.
Cross-embodiment transfer and fleet learning
- Sectors: robotics platforms, logistics networks
- What: Share a common backbone with embodiment-specific action adapters to propagate skills across robot types and sites; centralize learning while respecting local constraints.
- Tools/workflows: Domain-aware projection libraries; federated post-training; capability discovery and routing.
- Assumptions/dependencies: Consistent action semantics across fleets, privacy-preserving training, robust adapter calibration.
On-device, privacy-preserving multimodal agents
- Sectors: consumer devices, wearables, automotive
- What: Edge-scale Cosmos 3 variants for real-time perception, short-horizon prediction, and local action, minimizing cloud dependency.
- Tools/workflows: Quantization/distillation; hardware-aware scheduling; hybrid on-device/cloud fallback.
- Assumptions/dependencies: Release and maturation of Cosmos3-Edge; memory and power constraints; thermal design; model safety on-device.

Notes on common dependencies and risks across applications:

Compute and latency: Diffusion-based generation and joint attention are compute-intensive; production use may require model scaling strategies (distillation, caching, progressive inference).
Data and calibration: Accurate time-aligned video/action/audio streams, correct FPS/sampling metadata (required by the absolute-temporal MRoPE), and reliable camera/pose calibration are critical.
Safety and sim-to-real: Policies derived from synthetic or simulated data require staged validation and conservative deployment; uncertainty estimation is essential.
Licensing and governance: Compliance with OpenMDW-1.1, asset/IP policies for synthetic media, and platform rules for generated content distribution.
Security and misuse: Powerful text-to-video/image capabilities necessitate watermarking/provenance and moderation pipelines to mitigate deepfake or deceptive content risks.

View Paper Prompt View All Prompts

Glossary

3D MRoPE: A 3D multimodal rotary positional embedding that splits positional encoding across temporal and spatial dimensions to align tokens across modalities. "we design a 3D MRoPE with absolute temporal indexing to align video, audio, and action tokens along the same physical temporal axis."
6D rotation representation: An over-parameterized representation of 3D rotations that improves continuity and learning stability compared to Euler angles or quaternions. "We use the 6D representation following~\citet{zhou2019continuity} and the OpenCV convention for rotations where the z-axis is along the fingers/grippers and x-axis is to the right."
Action tokens: Learned tokens representing actions as a core modality, used to connect control signals with multimodal reasoning and generation. "Cosmos 3 treats action as a core modality, introducing a dedicated class of action tokens."
Autoregressive (AR) subsequence: The front segment of the packed token sequence used for reasoning with causal attention (e.g., language and ViT-encoded vision). "The AR subsequence is responsible for reasoning and understanding."
Begin-of-generation (BOG) token: A special token marking the transition point where generation begins in the packed sequence. "ending with <EOS> and a begin-of-generation token <BOG>,"
Bidirectional attention: Attention that allows tokens to attend to both past and future positions (used for diffusion tokens). "Tokens in the DM subsequence use full bidirectional attention, with the union of AR and DM tokens serving as the keys and values."
Causal self-attention: Attention constrained to past tokens only, preserving the autoregressive property for language/understanding. "Tokens in the AR subsequence attend only to tokens within the AR subsequence using causal self-attention;"
DeepStack: A technique for aggregating multi-level vision features to enhance understanding. "we also aggregate visual features from ViT via DeepStack~\citep{meng2024deepstack}"
Diffusion (DM) subsequence: The back segment of the packed token sequence containing noisy continuous tokens to be iteratively denoised for generation. "The diffusion subsequence follows the AR subsequence and contains video and image tokens from the VAE encoder, as well as audio and action tokens."
Domain-aware input and output projection layers: Per-domain linear projections that map heterogeneous action vectors into and out of a shared latent action space. "We therefore use domain-aware input and output projection layers with separate weight matrices for each embodiment domain~\citep{zheng2026xvla}, while sharing the MoT backbone."
Dual-stream joint attention: An attention mechanism where diffusion tokens attend to both AR and DM tokens, while AR tokens attend only within AR. "tokens from the diffusion subsequence interact with the AR subsequence through a dual-stream joint attention operation."
Ego poses: The poses of the agent’s main observation frame (e.g., camera or body), used as part of the unified action representation. "actions can include up to three components: ego poses for the agent's main observation frame, effector poses for the agent's effectors, and grasp states for the manipulation state."
Effector poses: The poses of the agent’s effectors (e.g., robot grippers or wrists) included in the action representation. "actions can include up to three components: ego poses for the agent's main observation frame, effector poses for the agent's effectors, and grasp states for the manipulation state."
Flow-matching objective: A generative training objective that predicts a velocity field consistent with probability flow, used for diffusion-style generation. "trained in practice with a flow-matching objective predicting velocity; we show the clean target here for clarity"
FPS modulation: A positional-embedding adjustment that aligns tokens from different sampling rates (e.g., video, audio, actions) onto a shared physical timeline. "FPS modulation is designed to align tokens with different temporal resolutions onto a shared physical temporal axis by modulating the effective size of each temporal increment."
Forward dynamics: Predicting future states given current observations and actions, used for simulating world evolution. "Forward dynamics predicts future visual states conditioned on observed context and clean action tokens"
Hop size: The number of audio samples between successive encoded frames, determining tokens-per-second for audio. "The raw stereo audio sampled at 48\,kHz is encoded with a hop size of 1920 samples, resulting in 25 tokens per second of audio."
Inverse dynamics: Inferring the actions that caused an observed transition between states. "inverse dynamics infers the action tokens that explain an observed visual transition."
Mixture-of-Transformers (MoT) architecture: A dual-tower transformer design with separate parameter sets for reasoning (AR) and generation (DM) that share attention. "Cosmos 3 adopts a Mixture-of-Transformers (MoT) architecture that processes a unified sequence of tokens from different modalities."
Proportional–Integral–Derivative (PID): A classical control scheme characterized by proportional, integral, and derivative terms, often used in low-level actuation. "To avoid embodiment-specific controller details such as ProportionalâIntegralâDerivative (PID) parameters or low-level actuation interfaces,"
Pseudo-actions: Action representations derived from state differences (e.g., relative pose changes) rather than direct controller commands. "ego and effector poses are represented as pseudo-actions derived from state differences."
SE(3): The Lie group of 3D rigid body transformations (rotations and translations). "For consecutive $\mathrm{SE}(3)$ poses $\mathbf{T}_{t-1}$ and $\mathbf{T}_t$ , we represent motion as the relative transform $\Delta \mathbf{T}_t = \mathbf{T}_{t-1}^{-1}\mathbf{T}_t$ ."
SO(3): The Lie group of 3D rotations represented by orthogonal matrices with determinant one. "We convert the predicted 6D rotation back to a $3\times3$ $\mathrm{SO}(3)$ rotation matrix using singular value decomposition (SVD)."
Singular value decomposition (SVD): A matrix factorization used here to project 6D rotation predictions back onto a valid rotation matrix in SO(3). "We convert the predicted 6D rotation back to a $3\times3$ $\mathrm{SO}(3)$ rotation matrix using singular value decomposition (SVD)."
Temporal steps per second (TPS): A rate defining how many positional steps correspond to one second of physical time for a token stream. "We first define the temporal steps per second (TPS) to characterize the physical temporal resolution."
VAE: A variational autoencoder used to tokenize images/videos (and audio) into continuous latent tokens for diffusion-based generation. "The VAE compresses the input video temporally by $4\times$ and spatially by $32\times32$ "
ViT encoder: A Vision Transformer encoder used for visual understanding and alignment with language. "For visual understanding, we use a ViT encoder pre-trained with vision-language alignment."
Video transfer: A conditional generation task that maps control videos (e.g., edges, depth) and text to RGB videos. "Video transfer. In this task, the input consists of a control video (\eg, edge, or depth) together with a text description, and the model generates the corresponding RGB video."
Vision-Language-Action Model (VLA): A model class that predicts actions from vision and language inputs. "such as Vision-Language-Action Models (VLAs) and World-Action Models (WAMs)."
Vision-LLM (VLM): A model class that jointly processes images/videos and text for understanding and reasoning. "initialized from the weights of a pre-trained Vision-LLM (VLM), allowing Cosmos 3 to inherit strong language and visual reasoning capabilities"
World-Action Model (WAM): A model that predicts actions (and possibly state evolution) grounded in world observations. "such as Vision-Language-Action Models (VLAs) and World-Action Models (WAMs)."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

GitHub

GitHub - NVIDIA/cosmos: NVIDIA Cosmos is an open platform of world models, datasets, and tools that enables developers to build Physical AI for robots, autonomous vehicles, smart infrastructure, and more. · GitHub (8,653 stars)

Cosmos 3: Omnimodal World Models for Physical AI

Summary

Cosmos 3: Omnimodal World Models for Physical AI

Unified Omnimodal Modeling for Physical AI

Model Architecture

Multimodal Encoders and Unified Action Representation

Mixture-of-Transformers (MoT) Backbone

Multimodal Position Embeddings and Temporal Alignment

Token Arrangement and Generation Modes

Training Data and Infrastructure

Key Numerical Results and Empirical Claims

Practical and Theoretical Implications

Infrastructure and Scalability

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What is this paper about?

The main questions the paper asks

How the system works (in everyday terms)

One brain for many skills

Understanding and making: two teamwork modes inside one model

Action tokens: teaching the model how to act

Keeping time in sync across video, audio, and actions

Flexible input/output modes

Different sizes for different needs

How it’s trained

What the researchers found and why it matters

What this could change in the future

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

GitHub

Tweets