Papers
Topics
Authors
Recent
2000 character limit reached

Cosmos-Transfer2.5: High-Fidelity Video Synthesis

Updated 4 November 2025
  • Cosmos-Transfer2.5 is a control-net framework for conditional world video generation that uses distributed spatial control blocks for fine-grained video translation.
  • The methodology integrates independent training of edge, blur, segmentation, and depth branches, ensuring superior spatial alignment and reduced long-horizon error.
  • Empirical results show that Transfer2.5, with 3.5× fewer parameters than its predecessor, delivers higher video quality and robustness in both Sim2Real and domain adaptation scenarios.

Cosmos-Transfer2.5 is a control-net style framework for conditional world video generation and translation, designed for Sim2Real (simulation-to-reality) and Real2Real (real-world domain translation) applications within Physical AI, robotics, and autonomous systems. Introduced as part of the Cosmos World Model family, it represents an evolution toward highly controllable, robust, and efficient video synthesis, addressing the stringent requirements of embodied intelligence tasks where precision and temporal coherence are critical.

1. System Overview and Purposes

Cosmos-Transfer2.5 extends earlier generation architectures by enabling fine-grained spatial control over video generation tasks. Its primary objectives are to bridge the reality gap in Sim2Real applications—translating low-fidelity simulator output to photo-realistic video—and to facilitate domain adaptation and augmentation across different real-world visual domains for training and evaluating embodied agents. The model is architected both for discrete scenario-to-video synthesis and for continuous, close-loop rollouts in policy learning and evaluation.

2. Architecture and Model Design

The Cosmos-Transfer2.5 architecture advances the original control-net design by distributing its four spatial control blocks evenly throughout the main branch, as opposed to clustering them at the network entrance. Specifically, control blocks are inserted after every seven main blocks, yielding more seamless propagation of control signals:

Model Size Control-Block Placement
Cosmos-Transfer1-7B 7B params 4 at start of main branch
Cosmos-Transfer2.5-2B 2B params 1 block every 7 main network blocks

Key architectural elements include:

  • Multimodal spatial control: The model takes as input any mixture of edge maps, blurred frames, segmentation maps, or depth maps, allowing independent or combined spatial constraints.
  • Base backbone: The core network backbone is the Cosmos-Predict2.5-2B, aligning video synthesis and control with advanced vision-language representations.

This distributed control block scheme ensures that conditioning information is injected at multiple depths in the network, improving the fidelity and prompt alignment of generated video compared to single-block or front-loaded methods.

3. Technical Innovations and Training Protocols

Cosmos-Transfer2.5 introduces several innovations targeting robustness, efficiency, and conditioning precision:

  • Model scaling: Reduction from 7B to 2B parameters (a 3.5× decrease) with no sacrifice in output quality or alignment, translating directly to faster inference and improved deployability.
  • Conditioning improvements: Gradual rather than clustered control integration yields smoother spatial adherence and better fusion of multi-modal cues.
  • Physics-AI-centric data: The training set comprises 200M+ curated video clips with semantic deduplication and specific domain labels (robotics, driving, physics).
  • Control signal generation: Large-scale automatic labeling for each control modality using state-of-the-art models (e.g., Video Depth Anything, SAMv2).
  • Task-specific training: Each control input (edge, blur, segmentation, depth) first trains its branch independently (100,000 steps, batch size 64) before merged multimodal integration.
  • Core loss: Utilizes a flow-matching objective,

L(θ)=Ex,ϵ,c,tu(xt,t,c;θ)vt2\mathcal{L}(\theta) = \mathbb{E}_{\mathbf{x}, \epsilon, \mathbf{c}, t} \left\| \mathbf{u}(\mathbf{x}_t, t, \mathbf{c}; \theta) - \mathbf{v}_t \right\|^2

where conditioning c\mathbf{c} includes one or more control signals.

Efficiency at training scale is provided via distributed infrastructure, including FSDP2, context parallelism, and memory optimization, enabling high-resolution, long-context video training.

4. Empirical Evaluation and Benchmarking

Cosmos-Transfer2.5 is benchmarked on the PAIBench-Transfer suite, assessing spatial control adherence and video quality across domains (driving, robotics). Despite the reduced parameter count, Transfer2.5 outperforms Transfer1 on both spatial and quality measures. The following key metrics are reported:

Model Blur SSIM Edge F1 Depth RMSE↓ Seg mIoU Quality Score
Transfer1-7B (Uniform) 0.82 0.26 0.70 0.74 9.24
Transfer2.5-2B (Uniform) 0.87 0.41 0.67 0.76 9.31

Further, Transfer2.5 demonstrates substantially reduced error and hallucination over time, as measured by the averaged Relative Normalized Dover Score (RNDS), a long-horizon metric quantifying error accumulation in extended video synthesis. This addresses a principal failure mode of prior autoregressive chunked video models, and is critical for closed-loop simulation and reinforcement learning tasks.

Qualitative analysis confirms sharper outputs, better spatial alignment, and prompt-consistent generations, particularly for long-range rollouts.

5. Applications in Physical AI, Robotics, and Autonomous Systems

Cosmos-Transfer2.5 is deployed for a range of high-precision embodied intelligence scenarios:

  1. Sim2Real video translation: Converts low-fidelity simulator output (e.g., IsaacSim) to photorealistic videos, facilitating robust transfer learning by bridging domain gaps.
  2. Real2Real domain adaptation: Translates real videos between domains (e.g., weather, lighting, background changes), offering broad policy exposure.
  3. Visual policy augmentation: Produces structured, realistic visual augmentations beyond simple pixel operations, demonstrated to significantly increase real-world robot policy success rates (24/30 successes versus 5/30 for the strongest baseline in reported experiments).
  4. Autonomous driving multiview generation: Given a scenario map, produces consistent, multi-view 720p video with spatiotemporal consistency for training/evaluating perception and policy modules, outperforming Transfer1 in FVD, FID, detection, and lane metrics.
  5. Long-horizon, closed-loop simulation: Lower error accumulation enables multi-episode simulation for reinforcement learning.
  6. General translation and control tasks: Fine-tuned variants support tasks such as scene relighting, semantic-to-real video synthesis, and multi-camera robotics visualization.

6. Model Release, Resources, and Community Impact

The development group has provided the following open resources under the NVIDIA Open Model License:

  • Full source code for Cosmos-Transfer2.5 architecture, training, and inference.
  • Pretrained checkpoints for all principal variants.
  • Curated benchmarks, datasets, and evaluation code supporting end-to-end reproducible research.

These public releases are intended to accelerate development and deployment within the broader Physical AI, robotics, and simulation research communities. Access to the resources is provided via:

This open distribution supports rapid community evaluation, reproducibility, and transparent benchmarking.

7. Comparison to Prior Work and Summary of Advances

Relative to Cosmos-Transfer1, Cosmos-Transfer2.5 achieves:

  • 3.5× smaller size (2B vs 7B parameters), enabling efficient scaling.
  • Enhanced spatial and temporal fidelity, superior long-horizon coherence.
  • Improved, flexible control block placement leading to greater performance on all measured axes.
  • State-of-the-art adherence to control signals and robust prompt alignment even over long video episodes.

These technical improvements are corroborated by both quantitative (PAIBench-Transfer, RNDS) and qualitative evaluation.

Cosmos-Transfer2.5 therefore sets a new benchmark for controllable, high-fidelity world video generation tailored for embodied intelligence, substantially narrowing the Sim2Real and domain gap challenges in robotics and autonomous system research. Its open-source release is positioned to support rapid innovation, scaling, and cross-domain technology transfer across the Physical AI landscape.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cosmos-Transfer2.5.