Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cosmos-Reason1: Physical AI Multimodal Models

Updated 11 July 2025
  • Cosmos-Reason1 models are multimodal large language models that integrate vision, language, and structured ontologies to enable advanced physical reasoning.
  • They employ a dual-ontology design and a four-stage training pipeline, including supervised fine-tuning and reinforcement learning, to generate detailed chain-of-thought explanations.
  • These models are openly available under the NVIDIA Open Model License, supporting research and applications in robotics, autonomous vehicles, and embodied AI.

Cosmos-Reason1 Models are a class of multimodal LLMs designed for high-level reasoning about the physical world, with a particular focus on integrating physical common sense and embodied decision-making capabilities for Physical AI systems. Developed to generate detailed chain-of-thought explanations and select appropriate next-step actions based on complex visual and sensory inputs, these models represent an overview of recent advances in vision-language pretraining, ontology-driven knowledge representation, and reinforcement learning. Cosmos-Reason1 models and their pre-trained checkpoints are openly available under the NVIDIA Open Model License.

1. Theoretical Motivation and Design Objectives

Cosmos-Reason1 models are motivated by the need for intelligent agents that can both understand and act within complex physical environments. This requires two foundational competencies:

  • Physical common sense: Knowledge about space, time, and physics that enables understanding of physical scenarios beyond surface pattern recognition.
  • Embodied reasoning: The ability to reason about the effects of actions, constraints, and affordances arising from specific embodiments—spanning humans, robotic arms, humanoid robots, and autonomous vehicles.

To comprehensively cover these requirements, Cosmos-Reason1 leverages structured ontologies:

  • A hierarchical ontology for physical common sense—categorizing knowledge as Space (Relationship, Plausibility, Affordance, Environment), Time (Actions, Order, Causality, Camera, Planning), and Fundamental Physics (Attributes, States, Object Permanence, Mechanics, Electromagnetism, Thermodynamics, Anti-Physics), with 16 fine-grained subcategories.
  • A two-dimensional ontology for embodied reasoning, mapping skills and constraints across agent types.

This dual-ontology approach establishes a formal basis for model specialization and evaluation in both physical understanding and decision making (2503.15558).

2. Model Architecture

Cosmos-Reason1 models adopt a decoder-only multimodal LLM pipeline designed for processing both visual and language modalities.

  • Vision Encoder: The input vision encoder (InternViT-300M-V2.5) handles both images (segmented into 448×448 tiles) and videos (sampled at 2 fps, up to 32 frames), converting them into visual tokens.
  • MLP Projector: A two-layer MLP projects visual tokens into the LLM text embedding space, allowing seamless integration of vision and language data.
  • LLM Backbone: The core model is a hybrid architecture combining:
    • Mamba layers (linear-time sequence models for handling long sequences efficiently).
    • Transformer layers (to capture detailed cross-token relationships).
    • Interposed MLP blocks for additional feature mixing.
  • Model Scales: Two principal scales are released: Cosmos-Reason1-8B and Cosmos-Reason1-56B, denoting the parameter count of the LLM backbone.

Visual Tokenization and Alignment Workflow:

1
2
3
Image/Video → Vision Encoder → MLP Projector → [Visual Tokens]
Text Prompt → [Text Tokens]
[Visual Tokens] + [Text Tokens] → Hybrid LLM (Mamba-MultiBlock-Transformer) → Output

  • Chain-of-Thought Output: The model always generates step-by-step rationale, culminating in a final embodied action or answer in natural language.

3. Training Regime

A four-stage training pipeline is implemented to achieve domain proficiency:

  1. Vision Pre-Training: The vision encoder and projector are aligned using a large collection (∼130M) of image-text pairs, with only the projector trainable while LLM and vision encoder are frozen.
  2. General Supervised Fine-Tuning (SFT): End-to-end training on general-purpose multimodal tasks (6M image-text + 2M video-text examples) to develop robust cross-modal associations.
  3. Physical AI SFT: Task-specific fine-tuning on datasets constructed or augmented for physical common sense (VQA-style datasets, multiple-choice, and chain-of-thought explanations) and embodied decision making (from BridgeData V2, RoboVQA, AgiBot, HoloAssist, and autonomous vehicle datasets). A salient feature is the use of an intuitive physics curriculum, including spatial puzzles, time-reversed videos, and object permanence evaluations.
  4. Physical AI RL Post-Training: Policy refinement by reinforcement learning using rule-based string-match rewards. Questions from the SFT phase are reformatted for verifiable answer checking, and the RL pipeline uses a modified GRPO algorithm for distributed, stable updates. Reward normalization and a reference model help stabilize optimization.

These stages jointly optimize the model to ground high-level reasoning both in perception and action.

4. Evaluation Benchmarks

Evaluation covers two benchmark families:

  • Physical Common Sense Reasoning: A benchmark of 604 hand-constructed questions (binary and multiple-choice) sampled from the hierarchical ontology’s full pool, testing generalization across Space, Time, and Fundamental Physics.
  • Embodied Reasoning: Six benchmarks using data from diverse physical agents and tasks (e.g., BridgeData V2, RoboVQA, AgiBot, HoloAssist, autonomous vehicles). Assessed skills include task completion verification, action affordance judgments, and action prediction.

Additional intuitive physics tasks (spatial puzzles, arrow-of-time, object permanence) probe the model’s physics grounding. Reported results demonstrate that supervised and RL post-training stages yield accuracy gains of 8–10% or more vs. general-purpose VLMs.

Model Variant Physical SFT RL Post-training Physics Benchmark ΔAcc. Embodied Benchmark ΔAcc.
Cosmos-Reason1-8B Yes Yes +8–10% +8–10%
Cosmos-Reason1-56B Yes Yes +8–10% +8–10%

Table: Differential accuracy gains (ΔAcc.) attributed to domain-specialized SFT and RL, relative to baseline general-purpose VLMs on respective benchmarks (2503.15558).

5. Applications and Significance

Cosmos-Reason1 models address a wide spectrum of Physical AI scenarios:

  • Robotic Manipulation: Generating action plans in open worlds, with stepwise justification.
  • Autonomous Driving: Interpreting traffic scenes, affording proactive and safe decision sequences.
  • Agent Embodiment Generalization: Applying the same underlying reasoning pipeline to agents with differing morphologies and constraints.
  • Intuitive Physics: Adapting to challenges such as object permanence, causality, and temporal reasoning necessary for robust real-world interaction.

This suggests that Cosmos-Reason1 offers a unified approach for deploying multimodal reasoning systems that must couple perception, inference, and action in safety-critical and real-time environments.

6. Access and Licensing

Cosmos-Reason1 models, source code, and weights are released under the NVIDIA Open Model License, enabling broad academic and commercial research use. The repository and updates are accessible at:

1
https://github.com/nvidia-cosmos/cosmos-reason1

7. Relationship to Broader Research in Physical AI and Generalization

The development of Cosmos-Reason1 aligns with complementary research advancing:

  • World foundation models for Physical AI pretraining, supporting customization across robot morphologies and real-world domains (2501.03575).
  • Neurosymbolic and ontological representations for world modeling and compositional generalization (2310.12690).
  • Robust benchmarks for commonsense grounding and decision assessment in multimodal systems.

A plausible implication is that Cosmos-Reason1 models, through their open architecture and benchmarking approach, set a standard for evaluating and improving physical reasoning in multimodal LLMs. Their explicit ontological grounding and dual-stage domain specialization distinguish them from prior VLMs and endow them with substantial transferability to applied settings in robotics, autonomous systems, and embodied AI research.