Papers
Topics
Authors
Recent
2000 character limit reached

Cosmos-Predict2.5: Unified Physical AI Simulator

Updated 4 November 2025
  • Cosmos-Predict2.5 is an open-world foundation model for Physical AI, unifying text, image, and video simulation using a scalable flow-based diffusion framework.
  • The model integrates multiple modalities in a unified transformer backbone with Cosmos-Reason1, enhancing instruction alignment and temporal consistency.
  • Advanced training on 200M curated video clips with reinforcement learning post-training delivers state-of-the-art generative fidelity for robotics and autonomous systems.

Cosmos-Predict2.5 is an open-world foundation model for Physical AI, introducing a unified, scalable, flow-based generative framework for physically grounded video simulation and world modeling, with applications in robotics, autonomous systems, and embodied intelligence. Developed by NVIDIA, Cosmos-Predict2.5 integrates Text2World, Image2World, and Video2World generation in one model, leverages the Cosmos-Reason1 vision-LLM for text grounding and control, and is trained on 200 million high-quality video clips with reinforcement learning–based post-training, delivering substantial advances in fidelity, instruction alignment, and utility for Sim2Real/Real2Real world translation (NVIDIA et al., 28 Oct 2025).

1. Architectural Foundations

Cosmos-Predict2.5 employs a Flow Matching (FM) diffusion backbone, distinct from Elucidated Diffusion Models used in Cosmos-Predict1. The FM approach models the data generation process in continuous-time by predicting the velocity field vt\mathbf{v}_t that transforms a noisy input toward the target data manifold. Specifically: xt=(1−t)x+tϵ,ϵ∼N(0,I)\mathbf{x}_t = (1-t)\mathbf{x} + t\epsilon,\quad \epsilon \sim \mathcal{N}(0, I)

vt=ϵ−x\mathbf{v}_t = \epsilon - \mathbf{x}

The training objective minimizes

L(θ)=Ex,ϵ,c,t∥u(xt,t,c;θ)−vt∥2\mathcal{L}(\theta) = \mathbb{E}_{\mathbf{x}, \epsilon, \mathbf{c}, t} \left\| \mathbf{u}(\mathbf{x}_t, t, \mathbf{c}; \theta) - \mathbf{v}_t \right\|^2

where c\mathbf{c} encodes all conditioning information (text, images, or videos).

The model operates in the latent space of WAN2.1 VAE (compression ratio 4×8×84 \times 8 \times 8 for time, height, and width, respectively), enabling efficient processing of high-dimensional video data. The architecture supports all three generation modes within a single flexible transformer backbone, characterized by the direct use of cross-modal attention mechanisms and relative (3D RoPE) positional embeddings, accommodating various conditioning modalities and sequence lengths.

Cosmos-Reason1, a Physical AI–specialized decoder-only vision-LLM, provides rich multimodal grounding. Text embeddings are computed by concatenating activations from multiple transformer blocks, projected to a 1024-dimensional space, and injected as conditioning vectors, yielding improved context awareness and instruction following compared to previous T5 encoders.

2. Unified Generation Modes: Text2World, Image2World, Video2World

Cosmos-Predict2.5 unifies three principal generation paradigms:

  • Text2World: Text-prompted video synthesis, e.g., generating a scenario such as "a robot places a red block on a table."
  • Image2World: Video synthesis with both text and image conditioning, initializing the output sequence from a reference image and ensuring temporal consistency.
  • Video2World: Extension or transformation of video sequences, including video continuation, re-rendering, or editing, optionally under additional textual constraints.

Input masking strategies and flexible conditioning tokenization are used to enable these modalities interchangeably. During training, loss is selectively computed only for model-predicted (unfixed) frames, enhancing temporal continuity for scenarios involving fixed context (as in Image2World and Video2World).

3. Advanced Training, Post-Training, and Optimization

Cosmos-Predict2.5 is trained on 200 million curated, deduplicated video clips (selected out of over 6 billion, amounting to 35 million hours) processed through a 7-stage curation pipeline: shot detection, GPU transcoding, spatial cropping, multi-faceted filtering (aesthetic, motion, OCR, semantics), high-quality captioning via VLMs, semantic deduplication, and sharding. Specialized pipelines target key PhysAI domains, including robotics, autonomous driving (with 3.1 million multi-view driving sessions), human dynamics, and physics.

Training follows a progressive curriculum—starting with lower-resolution text-to-image tasks and incorporating increasingly complex Video2World and Image2World objectives, with resolution scaling from 256p to 720p. Post-training comprises:

  • Supervised Fine-Tuning (SFT): On curated task-specific datasets, with domain-expert curation and multi-head InternVideo2 classifiers.
  • Model Merging: Robust model soups (using TIES/DARE) fusing domain and global skills.
  • Reinforcement Learning (VideoAlign): RLHF-style reward modeling, using VLM-based automated metrics covering text alignment, motion, and visual quality, with batch normalization and regularization to mitigate reward exploits.
  • Timestep Distillation: Reducing inference steps (rCM framework) while maintaining generative fidelity.
  • Scaling Infrastructure: Full Sharded Data Parallel (FSDP2), context parallelism, activation checkpointing, and elastic reward servers for efficient H100-based distributed training.

4. Enhancements and Advances Over Cosmos-Predict1

Substantial advancements over Cosmos-Predict1 are documented across quantitative, qualitative, and architectural axes:

  • Video Quality Gains: Cosmos-Predict2.5-2B and 14B post-trained models achieve Image2World PAI-Bench scores of 0.810, matching closed-source models with substantially greater scale. Human preference makes the 2B variant competitive with (or superior to) much larger commercial alternatives.
  • Temporal Consistency and Object Permanence: RL post-training enhances video realism and accurate temporal tracking.
  • Instruction Alignment: Integration with Cosmos-Reason1 and RL-based adjustment yield demonstrably stronger adherence to prompt instructions and visual-linguistic control, as measured by automated rewards and human evaluation.
  • Unified and Efficient Architecture: Single-model support for all modes minimizes architectural complexity and code duplication, and removal of absolute positional embeddings permits generalization to longer and higher-resolution sequences.
  • Robust Multimodal World Transfer: Cosmos-Transfer2.5 introduces a ControlNet-style world translation framework with distributed control branches (edges, blur, segmentation, depth), improving Sim2Real and Real2Real transfer significantly. Despite being 3.5×\times smaller than Cosmos-Transfer1, it delivers higher fidelity, reduced hallucination, and superior performance on long-horizon video generation.

Select Quantitative Table

Model Variant Scale (B) PAI-Bench (I2W) Human Preference
Cosmos-Predict2.5 (PT) 2 / 14 0.810 ≥\geq SOTA
Major closed-source WAN 5–14 0.797–0.810 Comparable

5. Cosmos-Transfer2.5: Sim2Real and Real2Real World Translation

Cosmos-Transfer2.5 is a control-net style extension that distributes four control branches (edge, blur, segmentation, depth) across the Predict2.5 backbone, enabling dense, multi-modal, spatial and temporal control. Each branch can be trained independently or in concert, providing precise world translation capabilities adaptable to both simulation (Sim2Real) and real-world (Real2Real) settings.

This framework enables efficient, high-fidelity rendering of output from physical simulators and augmentation of real videos. In practical experiments, Cosmos-Transfer2.5 significantly improves downstream policy robustness for robotic manipulation, exceeding baselines with classical augmentations (e.g., showing 24/30 vs. 5/30 generalization successes). RNDS-based error metrics indicate slow degradation of quality over extended video horizons, confirming robust long-term generative stability.

6. Applications in Physical AI and Embodied Intelligence

Cosmos-Predict2.5 and Cosmos-Transfer2.5 underpin a wide spectrum of applications:

  • Synthetic Data Generation: Serve as scalable generators for Vision-Language-Action (VLA) datasets, supporting closed-loop data augmentation and extraction of pseudo-action labels for downstream training.
  • Policy Evaluation and Closed-Loop Robotics: Empower action-conditioned world modeling, supporting planning, simulation, and policy validation, with demonstrated performance on robotics benchmarks such as Bridge (robot kitchen).
  • Autonomous Driving Simulation: Multi-view video generation (with up to 7 synchronized camera streams), supporting scenario generation for perception and planning, with camera or map-based trajectory conditioning.
  • Sim2Real/Real2Real Transfer: Enable robust robot policy training with realistic augmentations, improving transfer from simulation to real-world deployments or between real domains.
  • Long-Horizon Video Translation: Maintain high-fidelity generative robustness across extended time horizons, supporting video augmentation and analysis in complex multi-view, multi-agent settings.

7. Open Source Availability and Research Impact

Cosmos-Predict2.5 and Cosmos-Transfer2.5 are available under the NVIDIA Open Model License, with public repositories at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. The release includes source code, pretrained and post-trained checkpoints (2B, 14B, all domain-specific variants), curated captioned datasets, and open benchmarks (e.g., PAI-Bench, DreamGen). The stated goals are to accelerate research and deployment in Physical AI by providing a reproducible, scalable foundation for controllable world simulation, adaptation to custom domains, and rapid innovation in embodied intelligence and autonomous system research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cosmos-Predict2.5.