Cosmos World Foundation Model Platform for Physical AI (2501.03575v2)

Published 7 Jan 2025 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.

Summary

The paper proposes a novel foundation model platform that pre-trains a digital twin from diverse video data to address data scarcity in Physical AI training.
It integrates advanced data curation, video tokenization, and both diffusion and autoregressive models to enable high-quality, controllable simulations.
The work demonstrates improved video generation, 3D consistency, and applicability in camera control, robotic manipulation, and autonomous driving.

The paper introduces the Cosmos World Foundation Model (WFM) Platform for Physical AI, aiming to address the data scarcity challenge in training AI systems that interact with the physical world. The core idea is to pre-train a general-purpose "world model" from large-scale, diverse video data and then fine-tune this model for specific Physical AI setups. The platform includes components for data curation, video tokenization, pre-trained world models (both diffusion-based and autoregressive), examples of post-training for various tasks, and guardrails for safe deployment. The platform and models are open-sourced via NVIDIA Cosmos and NVIDIA Cosmos Tokenizer.

Physical AI systems, equipped with sensors and actuators, require training data consisting of interleaved observations and actions. Collecting such data in the real world is difficult and potentially risky. A WFM serves as a digital twin of the world, allowing safe and efficient data generation and policy training in simulation. The paper highlights several potential applications of WFMs for Physical AI builders:

Policy evaluation: Testing policies in a WFM before real-world deployment.
Policy initialization: Using a WFM to provide a starting point for policy models.
Policy training: Training policies via reinforcement learning within a WFM.
Planning/Model-Predictive Control: Simulating future states under different actions to choose optimal sequences.
Synthetic data generation: Creating diverse training data, potentially conditioned on metadata like depth or semantic maps for Sim2Real transfer.

The Cosmos platform comprises several key components:

1. Data Curation:

A scalable pipeline is developed to curate high-quality video datasets for tokenizer and WFM training. The pipeline involves 5 steps:

Splitting: Raw videos are segmented into shorter clips without scene changes using robust shot detection (like TransNetV2 (2009.04741)) and transcoded into a consistent format (MP4 with h264_nvenc) for efficient processing. Hardware acceleration (NVDEC/NVENC on GPUs like L40S) is heavily utilized and optimized using PyNvideoCodec within a Ray-based orchestration (1712.05889).
Filtering: Low-quality or irrelevant clips are removed, and valuable clips are tagged. This includes motion filtering (using optical flow), visual quality filtering (perceptual scores via a DOVER-based model (2303.17074), aesthetic scores), text overlay filtering (using an MLP classifier on video embeddings), and video type filtering (classifying content based on a custom taxonomy using a VLM-labeled dataset).
Annotation: A large Vision-LLM (VLM), specifically an internal VILA model (2408.10188, 2406.08581) fine-tuned for video captioning, generates detailed text descriptions for each clip. Inference efficiency for VILA is boosted using FP8 quantization with TensorRT-LLM, achieving 10x speedup.
Deduplication: Semantic duplicates are removed using InternVideo2 embeddings (2409.12191) and GPU-accelerated k-means clustering [RAPIDS], based on methods like SemDeDup (2303.09540) and DataComp (2409.18869). About 30% of the data is removed.
Sharding: Processed clips are packaged into webdatasets, sharded by resolution, aspect ratio, and length, to align with the training curriculum.

2. Tokenizer:

Cosmos Tokenizer is a suite of continuous and discrete visual tokenizers designed for efficiency and high reconstruction quality.

Architecture: An attention-based encoder-decoder structure with temporal causality is used. It operates in the wavelet space using a 3D Haar wavelet transform [lepik2014haar] followed by causal spatio-temporal factorized convolutions and attention layers. Layer Normalization is preferred over Group Normalization.
Token Types: Continuous tokenizers use a standard Autoencoder (AE) formulation for latent embeddings. Discrete tokenizers employ Finite-Scalar-Quantization (FSQ) (2309.15505) for quantized indices (e.g., 6 FSQ levels for a vocabulary size of 64,000).
Training: A joint image and video training strategy alternates batches. Training uses L1, Perceptual (VGG-19 features), Optical Flow (2003.02118), and Gram-matrix losses, with adversarial loss in a second fine-tuning stage.
Evaluation: Evaluated on standard image (MS-COCO (1405.0314), ImageNet (0909.0530)) and video (DAVIS (1604.00855)) benchmarks, and a new dataset called TokenBench (curated from BDD100K (2002.07296), EgoExo-4D (2403.14836), BridgeData V2 (2309.14381), Panda-70M (2401.03186)). Cosmos Tokenizer significantly outperforms baselines in reconstruction quality (PSNR, SSIM, rFID (1706.08500), rFVD (1904.06770)) and runtime efficiency (2x-12x faster), while having fewer parameters.

3. World Foundation Model Pre-training:

Pre-trained WFMs are built using two scalable paradigms: diffusion and autoregressive models, leveraging Transformer architectures (1706.03762). Training is performed on a cluster of 10,000 NVIDIA H100 GPUs.

Diffusion-based WFM:
- Formulation: Uses the EDM approach (2401.03575, 2206.00364) for denoising score matching.
- Architecture: Builds upon DiT (2303.09556), adapted for video. Features include 3D patchification, hybrid positional embedding (FPS-aware 3D RoPE (2104.09864) + learnable APE), cross-attention for text conditioning (using T5-XXL embeddings (2002.07296)), QK-Normalization (2309.14322), and AdaLN-LoRA (2408.14837) for parameter efficiency (reducing 11B model to 7B).
- Training: Joint image and video training with domain-specific normalization. Progressive training starts at lower resolution/frame count and scales up. Multi-aspect training handles various aspect ratios using reflection padding. Mixed-precision (BF16/FP32) training with AdamW (1904.00962) is used. Text conditioning uses T5-XXL and can leverage classifier-free guidance (2207.12598). Video2World models are trained by conditioning on previous frames with augmented noise.
- Scaling: Uses FSDP (2308.09819) and Context Parallelism (CP) (2310.01889, 2309.14509) (P2P variant from TransformerEngine [nvidia_transformer_engine]) to handle large memory requirements.
- Enhancer: A Prompt Upsampler (Cosmos-1.0-PromptUpsampler-12B-Text2World), fine-tuned from Mistral-NeMo-12B-Instruct [mistral_nemo_2024] using a long-to-short captioning strategy, converts short user prompts to longer, detailed prompts matching the training distribution. Pixtral-12B (2410.07073) is used for Video2World prompt upsampling.
- Results: 7B and 14B Text2World and Video2World models generate high-quality, dynamic videos. 14B models show finer details and better motion stability.
Autoregressive-based WFM:
- Formulation: World simulation as next-token prediction using discrete video tokens from Cosmos-1.0-Tokenizer-DV8x16x16.
- Architecture: Transformer decoder (Llama3-style (2407.21783)). Features include 3D-aware positional embeddings (YaRN-extended 3D RoPE (2309.00071) + 3D APE), cross-attention for text conditioning (using T5-XXL), QK-Normalization (2309.14322), and Z-loss [de2016z] for training stability. Vocabulary size is 64,000.
- Scaling: Uses Tensor Parallelism (TP) (1909.08053) and Sequence Parallelism (SP) (2308.09819) to manage memory.
- Training: Multi-stage training starts with video prediction (17 then 34 frames) and adds text conditioning in later stages with joint image/video data. A "cooling-down" phase with high-quality data is used. Models include 4B, 5B (Video2World), 12B, and 13B (Video2World) variants.
- Inference Optimization: Leverages LLM techniques like KV caching, torch.compile, and speculative decoding (Medusa (2407.09955)) for real-time generation (demonstrated at 10 FPS at 320x512 resolution). Medusa adds extra decoding heads for parallel token prediction.
- Enhancer: A Diffusion Decoder (Cosmos-1.0-Diffusion-7B-Decoder-DV8x16x16ToCV8x8x8), fine-tuned from Cosmos-1.0-Diffusion-7B-Text2World, maps discrete tokens to higher-quality continuous tokens to mitigate blurriness from aggressive discrete tokenization.
- Results: Larger models (12B, 13B) generate sharper videos with better motion than smaller models (4B, 5B). The diffusion decoder significantly enhances sharpness. Failure cases like objects unexpectedly appearing are observed, with higher rates for smaller models and single-frame conditioning.
Evaluation of Pre-trained WFMs:
- 3D Consistency: Evaluated on static scenes from RealEstate10K (1803.05872). Metrics include geometric consistency (Sampson error [hartley2003multiple], camera pose estimation success rate using SuperPoint (1808.00891) + LightGlue (2306.13643) + COLMAP (1607.07454)) and view synthesis consistency (PSNR, SSIM, LPIPS (1801.03924) by fitting 3D Gaussian splatting (2308.07915)). Cosmos WFMs significantly outperform VideoLDM (2306.09314) and achieve geometric consistency close to real videos.
- Physics Alignment: Evaluated using synthetic physics simulations (PhysX [PhysX], Isaac Sim [IsaacSim]). Scenarios test gravity, collision, inertia, etc. Metrics include pixel-level (PSNR, SSIM), feature-level (DreamSim (2311.16011)), and object-level (Avg. IoU of tracked objects using SAMURAI (2411.11922)). Cosmos WFMs show intuitive physics understanding but struggle with complex dynamics, highlighting the need for improved data and model design. Diffusion models perform better at pixel-level quality in 9-frame conditioning.

4. Post-trained World Foundation Models:

Pre-trained WFMs are fine-tuned for specific Physical AI tasks:

Camera Control: Cosmos-1.0-Diffusion-7B-Video2World-Sample-CameraCond is fine-tuned on DL3DV-10K (2406.05416) by conditioning on camera poses using Pl\"ucker embeddings (2103.02667). Evaluated on RealEstate10K against CamCo (2406.02509) using FID/FVD and camera trajectory alignment (rotation/translation error after Procrustes analysis). Cosmos model significantly outperforms CamCo, generating realistic, 3D-consistent, and controllable videos that generalize to unseen camera trajectories.
Robotic Manipulation: Fine-tuned for instruction-based video prediction (on Cosmos-1X dataset, egocentric videos with instructions) and action-based next-frame generation (on Bridge (2209.14916) dataset, third-person videos with action vectors). Cosmos-1.0-Diffusion-7B/Autoregressive-5B-Video2World are fine-tuned by adding conditioning for instruction (text embeddings via cross-attention) or action (MLP embedder, incorporated via cross-attention or timestep embedding). Evaluated via human evaluation (instruction following, object permanence, verity) against VideoLDM-Instruction and quantitatively (PSNR, SSIM, Latent L2, FVD) against IRASim-Action (2406.14540). Cosmos models perform better than baselines on both tasks.
Autonomous Driving: Fine-tuned on the internal multi-view RDS dataset (3.6M 20-second clips, 6 synchronized camera views + ego-motion). Cosmos-1.0-Diffusion-7B-Text2World is fine-tuned to generate 6 views simultaneously, using view-independent PE, view embeddings, and view-dependent cross-attention. Variants include Text2World-Sample-MultiView, Text2World-Sample-MultiView-TrajectoryCond (conditioned on future 3D trajectories), and Video2World-Sample-MultiView (conditioned on previous frames for extension). Evaluated against VideoLDM-MultiView on generation quality (FID/FVD), multi-view consistency (Temporal/Cross-view Sampson Error using a robust pose estimation pipeline (2412.03526)), and trajectory consistency (Trajectory Agreement Error using multi-view pose estimation, Trajectory Following Error against ground truth). Cosmos models significantly outperform VideoLDM-MultiView and achieve consistency close to real videos. Trajectory-conditioned models accurately follow given paths. Object detection and tracking evaluation confirms physical consistency.

5. Guardrails:

A two-stage system ensures safe usage:

Pre-Guard: Blocks harmful text prompts using Keyword Blocking (blocklist search after lemmatization) and Aegis Guardrail (2404.05993) (LLM-based classifier, fine-tuned Llama-Guard (2312.06674)).
Post-Guard: Blocks harmful visual outputs using a Video Content Safety Filter (frame-level classifier trained on VLM-labeled, synthetic, and human-annotated data using SigLIP embeddings (2312.14424)) and a Face Blur Filter (using RetinaFace (2007.08231)). A dedicated red team actively probes the system to improve safety.

In conclusion, Cosmos WFMs provide a platform with powerful pre-trained models and tools to build physical AI systems. While they demonstrate significant progress in visual quality, 3D consistency, and controllability across diverse domains, challenges remain in achieving perfect physics adherence and comprehensive evaluation. The choice between diffusion and autoregressive models presents trade-offs, with hybrid approaches offering potential future directions. The open-source release aims to democratize access and accelerate research in Physical AI.