Papers
Topics
Authors
Recent
Search
2000 character limit reached

GR00T N1.5: Vision-Language-Action in Robotics

Updated 17 January 2026
  • GR00T N1.5 is a vision-language-action model that integrates RGB image observations and textual instructions to generate 7-DoF actions for robotic manipulation in simulation.
  • The model is benchmarked within the REALM framework, showing lower task progression and increased execution times compared to π₀ and π₀-FAST policies.
  • Key challenges include missing architectural details, ambiguous training protocols, and potential negative transfer that undermines its generalization across diverse tasks.

GR00T N1.5 is a vision-language-action (VLA) model evaluated within the REALM ("A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation") framework for assessing the generalization capabilities of robotic agents tasked with executing natural language instructions. GR00T N1.5 operates by ingesting RGB image observations from the IsaacSim–DROID simulation platform, alongside free-form textual instructions, and outputs actions in a 7-DoF joint-space format suitable for robot manipulators. While it is one of three VLA policies benchmarked in REALM, virtually all architecture and training specifics remain unpublished; consequently, only high-level operational and empirical characteristics are available for the model (Sedlacek et al., 22 Dec 2025).

1. Model Overview

The only specification provided for GR00T N1.5 concerns its deployment setting: it receives as input an RGB image observation OO coupled with a language instruction LL, and produces a 7-DoF action ata_t. GR00T N1.5 is evaluated in the same context and action space as the π₀ and π₀-FAST policies within the IsaacSim–DROID simulation. No information is furnished regarding the internal structure of its vision encoder (e.g., CNNs, Vision Transformers), language encoder (e.g., Transformer depths, embedding sizes), or the presence and nature of any multimodal fusion operator. In addition, there are no published details on its policy function πGR00TN1.5(atO,L)\pi_{\mathrm{GR00T\,N1.5}}(a_t|O,L), nor any specifications as to whether its action outputs are generated deterministically or stochastically.

2. Training Protocols and Data

The documentation in REALM is restricted to a single procedural statement: “we fine-tune GR00T ourselves to operate in the same action space as the π models.” The only explicit reference to a training resource is the DROID dataset. No breakdown of the pretraining or finetuning data regime is provided. Crucially, loss functions—neither their functional forms (such as Llang,Lvis,Lact\mathcal{L}_{\mathrm{lang}},\, \mathcal{L}_{\mathrm{vis}},\, \mathcal{L}_{\mathrm{act}}), nor any weighting coefficients (α\alpha, β\beta, γ\gamma) for a composite loss objective—are described. This absence precludes any inference about optimization strategy, regularization, or objective formulation for GR00T N1.5 within the RL or supervised learning paradigm.

3. Integration with the REALM Benchmark

Operationally, GR00T N1.5 is benchmarked identically to π₀ and π₀-FAST. Each model is tasked with interpreting a visual observation OO and textual instruction LL to emit a joint-space action ata_t. The deployment pipeline involves direct ingestion of IsaacSim–DROID outputs, and action generation in the corresponding 7-DoF space. No unique deployment protocol, environment adaptation, or task scheduling is indicated for GR00T N1.5. The evaluation pertains to multiple manipulation skills, 15 environment perturbation factors, and a set of more than 3,500 objects, reflecting diverse and challenging operational conditions designed to probe model robustness and generalization. The only procedural qualification is that GR00T N1.5 was “fine-tuned” for compatibility with this shared action space.

4. Empirical Performance and Generalization

Figure 1 in REALM reports GR00T N1.5’s mean task progression under both nominal and perturbed scenarios (represented as the green curve). Across nearly all tasks, GR00T N1.5 demonstrates clear under-performance relative to both π₀ and π₀-FAST, seldom exceeding a task progression score of 0.3 in settings where π₀-FAST achieves between 0.6–0.7. Due to its consistently low baseline, the paper concludes that “the measured effects of individual perturbations [on GR00T] are less informative,” shifting analytical focus to the higher-performing π policies. As a result, the REALM report omits tables or figures containing GR00T N1.5’s absolute or relative numerical success rates, task completion times, or simulated-to-real correlation metrics; all such granular analyses are reserved for the other baseline models (Sedlacek et al., 22 Dec 2025).

5. Failure Modes and Analysis

No ablation studies, error breakdowns, or systematic failure mode analyses are presented for GR00T N1.5. The only data point provided in this context highlights execution time: “GR00T generally takes around 30 seconds with significant variance,” notably higher than the approximately 20-second average shown by the π models. There is no discussion of whether this increased latency is attributable to perception, language processing, or actuation. The absence of diagnostic insight constrains the ability to assess the underlying factors responsible for the model’s low performance.

6. Limitations and Future Directions

The authors do not distinguish GR00T N1.5 in their concluding assessments. However, three general lessons are stated for all models in the evaluation, including GR00T N1.5:

  • Semantic and Behavioral Robustness: The need to advance robustness under both semantic (instructional) and behavioral (dynamic/environmental) domain shifts, as all tested models exhibit substantial vulnerabilities in these regimes.
  • VLM Backbone Generalization: The preservation of generalization capacity in vision-LLMs (VLM) when fine-tuning on task-specific robotic data is highlighted; it is hypothesized that “full fine-tuning on DROID data … harms the generalization capabilities” of all VLA models.
  • Visual Fidelity and Control Alignment: The importance of further improvements in visual realism and control alignment in simulation, to better correlate simulation-based evaluations with real-world deployment outcomes.

A plausible implication is that GR00T N1.5’s poor relative and absolute scores may be symptomatic of negative transfer or overfitting during robotics-specific fine-tuning, as hypothesized generally in the “Lessons Learned” section.

7. Summary Table: Reported Information on GR00T N1.5

Category Information Reported in REALM Information Absent
Architecture None Encoder, fusion, policy details
Training Procedure Fine-tuned to DROID action space Pretraining data, all objectives
Evaluation Protocol Action in 7-DoF joint-space format Policy form, sampling method
Empirical Results Underperforms π₀ and π₀-FAST Success rates, ablations, failures

All published details on GR00T N1.5 are restricted to performance summaries and high-level integration points. Technical specifics pertaining to architecture, training methodology, policy formulation, and fine-grained benchmarking are not disclosed in REALM (Sedlacek et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GR00T N1.5.