Cosmos-Predict2.5: Unified Physical AI Simulator

Updated 4 November 2025

Cosmos-Predict2.5 is an open-world foundation model for Physical AI, unifying text, image, and video simulation using a scalable flow-based diffusion framework.
The model integrates multiple modalities in a unified transformer backbone with Cosmos-Reason1, enhancing instruction alignment and temporal consistency.
Advanced training on 200M curated video clips with reinforcement learning post-training delivers state-of-the-art generative fidelity for robotics and autonomous systems.

Cosmos-Predict2.5 is an open-world foundation model for Physical AI, introducing a unified, scalable, flow-based generative framework for physically grounded video simulation and world modeling, with applications in robotics, autonomous systems, and embodied intelligence. Developed by NVIDIA, Cosmos-Predict2.5 integrates Text2World, Image2World, and Video2World generation in one model, leverages the Cosmos-Reason1 vision-LLM for text grounding and control, and is trained on 200 million high-quality video clips with reinforcement learning–based post-training, delivering substantial advances in fidelity, instruction alignment, and utility for Sim2Real/Real2Real world translation (NVIDIA et al., 28 Oct 2025).

1. Architectural Foundations

Cosmos-Predict2.5 employs a Flow Matching (FM) diffusion backbone, distinct from Elucidated Diffusion Models used in Cosmos-Predict1. The FM approach models the data generation process in continuous-time by predicting the velocity field $\mathbf{v}_t$ that transforms a noisy input toward the target data manifold. Specifically: $\mathbf{x}_t = (1-t)\mathbf{x} + t\epsilon,\quad \epsilon \sim \mathcal{N}(0, I)$

$\mathbf{v}_t = \epsilon - \mathbf{x}$

The training objective minimizes

$\mathcal{L}(\theta) = \mathbb{E}_{\mathbf{x}, \epsilon, \mathbf{c}, t} \left\| \mathbf{u}(\mathbf{x}_t, t, \mathbf{c}; \theta) - \mathbf{v}_t \right\|^2$

where $\mathbf{c}$ encodes all conditioning information (text, images, or videos).

The model operates in the latent space of WAN2.1 VAE (compression ratio $4 \times 8 \times 8$ for time, height, and width, respectively), enabling efficient processing of high-dimensional video data. The architecture supports all three generation modes within a single flexible transformer backbone, characterized by the direct use of cross-modal attention mechanisms and relative (3D RoPE) positional embeddings, accommodating various conditioning modalities and sequence lengths.

Cosmos-Reason1, a Physical AI–specialized decoder-only vision-LLM, provides rich multimodal grounding. Text embeddings are computed by concatenating activations from multiple transformer blocks, projected to a 1024-dimensional space, and injected as conditioning vectors, yielding improved context awareness and instruction following compared to previous T5 encoders.

2. Unified Generation Modes: Text2World, Image2World, Video2World

Cosmos-Predict2.5 unifies three principal generation paradigms:

Text2World: Text-prompted video synthesis, e.g., generating a scenario such as "a robot places a red block on a table."
Image2World: Video synthesis with both text and image conditioning, initializing the output sequence from a reference image and ensuring temporal consistency.
Video2World: Extension or transformation of video sequences, including video continuation, re-rendering, or editing, optionally under additional textual constraints.

Input masking strategies and flexible conditioning tokenization are used to enable these modalities interchangeably. During training, loss is selectively computed only for model-predicted (unfixed) frames, enhancing temporal continuity for scenarios involving fixed context (as in Image2World and Video2World).

3. Advanced Training, Post-Training, and Optimization

Cosmos-Predict2.5 is trained on 200 million curated, deduplicated video clips (selected out of over 6 billion, amounting to 35 million hours) processed through a 7-stage curation pipeline: shot detection, GPU transcoding, spatial cropping, multi-faceted filtering (aesthetic, motion, OCR, semantics), high-quality captioning via VLMs, semantic deduplication, and sharding. Specialized pipelines target key PhysAI domains, including robotics, autonomous driving (with 3.1 million multi-view driving sessions), human dynamics, and physics.

Training follows a progressive curriculum—starting with lower-resolution text-to-image tasks and incorporating increasingly complex Video2World and Image2World objectives, with resolution scaling from 256p to 720p. Post-training comprises:

Supervised Fine-Tuning (SFT): On curated task-specific datasets, with domain-expert curation and multi-head InternVideo2 classifiers.
Model Merging: Robust model soups (using TIES/DARE) fusing domain and global skills.
Reinforcement Learning (VideoAlign): RLHF-style reward modeling, using VLM-based automated metrics covering text alignment, motion, and visual quality, with batch normalization and regularization to mitigate reward exploits.
Timestep Distillation: Reducing inference steps (rCM framework) while maintaining generative fidelity.
Scaling Infrastructure: Full Sharded Data Parallel (FSDP2), context parallelism, activation checkpointing, and elastic reward servers for efficient H100-based distributed training.

4. Enhancements and Advances Over Cosmos-Predict1

Substantial advancements over Cosmos-Predict1 are documented across quantitative, qualitative, and architectural axes:

Video Quality Gains: Cosmos-Predict2.5-2B and 14B post-trained models achieve Image2World PAI-Bench scores of 0.810, matching closed-source models with substantially greater scale. Human preference makes the 2B variant competitive with (or superior to) much larger commercial alternatives.
Temporal Consistency and Object Permanence: RL post-training enhances video realism and accurate temporal tracking.
Instruction Alignment: Integration with Cosmos-Reason1 and RL-based adjustment yield demonstrably stronger adherence to prompt instructions and visual-linguistic control, as measured by automated rewards and human evaluation.
Unified and Efficient Architecture: Single-model support for all modes minimizes architectural complexity and code duplication, and removal of absolute positional embeddings permits generalization to longer and higher-resolution sequences.
Robust Multimodal World Transfer: Cosmos-Transfer2.5 introduces a ControlNet-style world translation framework with distributed control branches (edges, blur, segmentation, depth), improving Sim2Real and Real2Real transfer significantly. Despite being 3.5 $\times$ smaller than Cosmos-Transfer1, it delivers higher fidelity, reduced hallucination, and superior performance on long-horizon video generation.

Select Quantitative Table

Model Variant	Scale (B)	PAI-Bench (I2W)	Human Preference
Cosmos-Predict2.5 (PT)	2 / 14	0.810	$\geq$ SOTA
Major closed-source WAN	5–14	0.797–0.810	Comparable

5. Cosmos-Transfer2.5: Sim2Real and Real2Real World Translation

Cosmos-Transfer2.5 is a control-net style extension that distributes four control branches (edge, blur, segmentation, depth) across the Predict2.5 backbone, enabling dense, multi-modal, spatial and temporal control. Each branch can be trained independently or in concert, providing precise world translation capabilities adaptable to both simulation (Sim2Real) and real-world (Real2Real) settings.

This framework enables efficient, high-fidelity rendering of output from physical simulators and augmentation of real videos. In practical experiments, Cosmos-Transfer2.5 significantly improves downstream policy robustness for robotic manipulation, exceeding baselines with classical augmentations (e.g., showing 24/30 vs. 5/30 generalization successes). RNDS-based error metrics indicate slow degradation of quality over extended video horizons, confirming robust long-term generative stability.

6. Applications in Physical AI and Embodied Intelligence

Cosmos-Predict2.5 and Cosmos-Transfer2.5 underpin a wide spectrum of applications:

Synthetic Data Generation: Serve as scalable generators for Vision-Language-Action (VLA) datasets, supporting closed-loop data augmentation and extraction of pseudo-action labels for downstream training.
Policy Evaluation and Closed-Loop Robotics: Empower action-conditioned world modeling, supporting planning, simulation, and policy validation, with demonstrated performance on robotics benchmarks such as Bridge (robot kitchen).
Autonomous Driving Simulation: Multi-view video generation (with up to 7 synchronized camera streams), supporting scenario generation for perception and planning, with camera or map-based trajectory conditioning.
Sim2Real/Real2Real Transfer: Enable robust robot policy training with realistic augmentations, improving transfer from simulation to real-world deployments or between real domains.
Long-Horizon Video Translation: Maintain high-fidelity generative robustness across extended time horizons, supporting video augmentation and analysis in complex multi-view, multi-agent settings.

7. Open Source Availability and Research Impact

Cosmos-Predict2.5 and Cosmos-Transfer2.5 are available under the NVIDIA Open Model License, with public repositories at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. The release includes source code, pretrained and post-trained checkpoints (2B, 14B, all domain-specific variants), curated captioned datasets, and open benchmarks (e.g., PAI-Bench, DreamGen). The stated goals are to accelerate research and deployment in Physical AI by providing a reproducible, scalable foundation for controllable world simulation, adaptation to custom domains, and rapid innovation in embodied intelligence and autonomous system research.

PDF Markdown Chat (Pro)

References (1)

World Simulation with Video Foundation Models for Physical AI (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Cosmos-Predict2.5.