Papers
Topics
Authors
Recent
2000 character limit reached

GR00T N1.5 Vision-Language-Action Model

Updated 29 December 2025
  • GR00T N1.5 is a diffusion-based Vision-Language-Action model that predicts short robot action sequences from current multi-view images and language instructions without temporal memory.
  • It excels in tabletop manipulation tasks with augmentation but exhibits limited generalization and robustness under systematic simulation perturbations.
  • History-aware extensions like HAMLET significantly improve success rates, underscoring the practical benefits of integrating temporal context in robotic manipulation.

GR00T N1.5 is a diffusion-based Vision-Language-Action (VLA) model developed for robotic manipulation tasks, characterized by its memoryless architecture and use of a pre-trained Vision-LLM (VLM) to encode observations and instructions. By design, it predicts short sequences of future robot actions based solely on the current observation and a language command. GR00T N1.5 has been used as a state-of-the-art VLA policy for various benchmarks and as a backbone for history-aware extensions, most notably HAMLET. Performance assessments reveal both strong and weak aspects: while it displays high efficacy in certain real-world tabletop tasks with appropriate augmentation, it demonstrates limited generalization and robustness under systematic environment perturbations in simulation.

1. Core Architecture and Operational Pipeline

GR00T N1.5’s architecture comprises two principal stages: perceptual encoding and diffusion-based action prediction. At each timestep tt, the model receives as input a set of multi-view images It1,,ItnI_t^1, \ldots, I_t^n, a language instruction cc, and the proprioceptive state sts_t.

  • VLM Encoding: Observations and language instructions are processed by a frozen pre-trained Vision-LLM,

ht=Fθ(ot,c)h_t = \mathcal{F}_\theta(o_t, c)

where ot=[It1,,Itn]o_t = [I_t^1, \ldots, I_t^n]. The VLM outputs a frame-level embedding htRdh_t \in \mathbb{R}^d.

  • Diffusion Action Expert: This embedding is used as input to a diffusion-based action predictor,

[at,,at+k1]=Aψ(ht,st)[a_t, \ldots, a_{t+k-1}] = \mathcal{A}_\psi(h_t, s_t)

where Aψ\mathcal{A}_\psi is trained to denoise and predict a chunk of kk future actions, given the current frame’s perceptual and state encoding.

A distinguishing feature is that GR00T N1.5 does not aggregate information temporally: its policy’s prediction is based solely on the present snapshot (hth_t, sts_t), making it fundamentally memoryless (Koo et al., 1 Oct 2025).

2. Application Contexts and Integration in Benchmarks

In practical deployments, GR00T N1.5 has been subjected to real-world robotic evaluation and used as a testbed or backbone for extensions:

  • Empirical Deployments: It serves as a standalone controller for tabletop manipulation tasks on systems such as the Franka arm, receiving multi-view visual inputs, proprioceptive state, and a high-level instruction.
  • Benchmark Adaptations: Within the REALM simulation environment, GR00T N1.5 is fine-tuned to a 7-DoF joint-position action space, aligning with competing policies (π₀, π₀-FAST), with inputs restricted to a single eye-in-hand RGB image and a language prompt, and outputs being 7-dimensional joint targets at 50 Hz (Sedlacek et al., 22 Dec 2025).

3. Performance Characteristics and Empirical Results

Evaluation of GR00T N1.5 across settings highlights divergent areas of strength and deficiency.

Real-World, History-Dependent Tasks (Tabletop Manipulation)

On three history-dependent Franka arm tasks (“Pick-and-Place Twice,” “Cover-and-Stack,” “Swap Cubes”), average success rates are reported as follows (50 demonstrations per task, 24 trials held out per task):

Method Uses History Avg. Success (%)
GR00T N1.5 No 29.2
Naïve multi-frame 4 frames 45.8
GR00T N1.5 + HAMLET Yes 76.4

Editor's term: "Naïve multi-frame" refers to simple frame concatenation without dedicated temporal modeling. The substantial 47.2 percentage-point gain from history-aware augmentation underscores the baseline’s memory bottleneck (Koo et al., 1 Oct 2025).

Simulation Generalization (REALM Benchmark)

Within the REALM environment, task progression and robustness under 15 systematic environment/task perturbations are as follows:

Task Set π₀-FAST (default) π₀ (default) GR00T N1.5 (default) GR00T N1.5 (perturbed mean)
REALM-base 0.68 0.57 0.23 0.19 ± 0.07
REALM-articulated 0.48 0.35 0.10 0.08 ± 0.05

Binary success rates for GR00T N1.5 are near zero and omitted from detailed presentation due to poor performance. Most rollouts (≈70%) terminate via 100-second timeouts. No perturbation-by-perturbation breakdown is reported, as performance and robustness are both negligible in these regimes (Sedlacek et al., 22 Dec 2025).

4. Failure Modes and Robustness Limitations

Observed limitations of GR00T N1.5 include:

  • Low Default Performance: Task progression scores remain below 0.25 for standard pick-and-place and below 0.10 for articulated open/close tasks when evaluated in REALM.
  • High Action Variance and Unstable Completion: Large run-to-run variance, frequent timeouts, and a mean completion time of ~30 seconds per successful rollout.
  • Negligible Robustness to Perturbations: Mean change in progression under visual, semantic, and behavioral shifts is less than 0.05, indicating minimal generalization beyond training regimes.
  • No Temporal Memory: The memoryless construction likely limits policy confidence, adaptability, and effective closed-loop control given non-Markovian observations or partial observability (Sedlacek et al., 22 Dec 2025).

5. Extensions via History-Aware Augmentation

The HAMLET framework demonstrates a significant performance uplift by augmenting GR00T N1.5 into a history-aware policy (Koo et al., 1 Oct 2025). Key augmentations are:

  • Moment Tokens: Per-timestep learnable tokens appended to the VLM input, producing time-distinctive compressed representations mtm'_t. These are initialized using time-contrastive learning (TCL), optimizing

LTCL=t=1Blogexp(sim(zt,zt+)/τ)exp(sim(zt,zt+)/τ)+exp(sim(zt,zt)/τ)\mathcal{L}_\textrm{TCL} = -\sum_{t=1}^{B} \log \frac{\exp(\mathrm{sim}(z_t, z_t^+)/\tau)}{\exp(\mathrm{sim}(z_t, z_t^+)/\tau) + \exp(\mathrm{sim}(z_t, z_t^-)/\tau)}

where positive and negative pairs are drawn from augmented and non-matching timesteps, respectively.

  • Memory Transformer: A lightweight 2-layer Transformer Mϕ\mathcal{M}_\phi attends causally over a sliding window (length TT) of mtm'_t to generate a history-augmented summary m~t\tilde{m}'_t.
  • Concatenated Action Input: The augmented feature [ht;m~t][h_t; \tilde{m}'_t] replaces hth_t as input to the diffusion head; only the memory and token modules are trained during fine-tuning.

These interventions improve average success rates on complex tasks from 29.2% to 76.4% with minimal training overhead (~30 K TCL steps), and transferable temporal-attention appears generally effective across datasets such as LIBERO and RoboCasa (Koo et al., 1 Oct 2025).

6. Broader Implications and Open Challenges

GR00T N1.5’s benchmark results illuminate several open questions:

  • Generalization Deficit: The observed lack of robustness in simulation benchmarks such as REALM, across diverse perturbation regimes, suggests limitations in architectural invariance, data coverage, or policy regularization practices.
  • Memory Integration: Empirically demonstrated gains from HAMLET highlight the importance of explicit history modeling for real-world, long-horizon robotic tasks, especially in the presence of occlusions and non-Markovian dynamics.
  • Architecture Agnosticism: HAMLET-style memory mechanisms are backbone-agnostic and have demonstrated transferability, suggesting a modular future direction for VLA policy design.
  • Evaluation Gaps: The scant reporting of internal architectural details, hyperparameters, or layer-wise analyses in both direct and derivative literature constrains precise comparative study and ablation-based understanding.

A plausible implication is that future research should emphasize report completeness—especially regarding architecture and training regimes—to facilitate reproducibility, diagnostic analysis, and more principled benchmarking (Sedlacek et al., 22 Dec 2025, Koo et al., 1 Oct 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GR00T N1.5 VLA Model.