Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gemini Robotics 1.5: VLA Models for Robotics

Updated 3 July 2026
  • Gemini Robotics 1.5 is a family of advanced vision-language-action models that fuse visual, linguistic, and motor control capabilities for solving complex multi-step robotic tasks.
  • It incorporates an explicit motion transfer module that enables zero-shot skill adaptation across diverse robot morphologies, enhancing cross-platform functionality.
  • Internal chain-of-thought reasoning and embodied spatial planning clearly improve task decomposition and execution interpretability in real-world robotic applications.

Gemini Robotics 1.5 is a family of advanced multi-embodiment Vision-Language-Action (VLA) foundation models for robotics, featuring an integrated architecture for perception, reasoning, and dexterous physical control. It incorporates explicit motion transfer mechanisms for zero-shot skill adaptation across diverse robot morphologies and introduces internal natural-language reasoning—enabling a generalist robotic system that can perceive, think, and act to solve complex, multi-step tasks. The family comprises the core Gemini Robotics 1.5 VLA model and the Gemini Robotics-ER 1.5 embodied reasoning model, designed for hierarchical orchestration, robust spatial reasoning, and interpretable behavior explanation (Abdolmaleki et al., 2 Oct 2025).

1. Model Family Architecture

The Gemini Robotics 1.5 model family integrates several architectural innovations:

  • Vision Backbone: A multi-scale Vision Transformer (ViT) generates per-patch embeddings from raw camera inputs.
  • Language Module: A LLM, originating from the Gemini family, encodes open-ended instructions and internal reasoning as token embeddings.
  • Cross-modal Fusion: Multiple transformer cross-attention layers fuse visual and linguistic representations.
  • Action Decoder: A lightweight MLP or transformer head maps fused features to continuous action commands (e.g., end-effector poses, joint deltas, gripper actuation).
  • Motion Transfer Module: A learnable alignment network normalizes embodiment-specific trajectories to a shared canonical space, supporting cross-robot transfer.
  • Embodied Reasoning Model (GR-ER 1.5): An optimized Gemini backbone for reasoning, with specialized heads for object segmentation, pointing, trajectory prediction, progress estimation, and success/failure classification.

The overall agentic system couples GR 1.5 for low-level closed-loop control and GR-ER 1.5 for high-level planning, spatial understanding, and feedback-driven adaptation (Abdolmaleki et al., 2 Oct 2025).

2. Multimodal Perception, Reasoning, and Action

Gemini Robotics 1.5 processes a sequence of images ItI_t, instruction TT, and history of past actions, fusing them to produce actionable outputs. Its reasoning loop operates as follows:

  • At each sub-step ii, the model may generate a “thought” token τi\tau_i (chain-of-thought reasoning) to explicitly break down the task.
  • The fused representation HH is dynamically updated with these internal language traces, enhancing context and subgoal decomposition.
  • The action decoder gθag_{\theta_a} maps the current fused state to the next motion command ΔxiRD\Delta x_i \in \mathbb{R}^D.
  • The reasoning loop continues until a task-completion predicate is satisfied.

Mathematical Formulation:

  • Vision encoder: Vθv(I)RP×dV_{\theta_v}(I) \in \mathbb{R}^{P \times d}
  • Language encoder: Lθl(T,τ1:k)RM×dL_{\theta_l}(T, \tau_{1:k}) \in \mathbb{R}^{M \times d}
  • Cross-modal fusion: H=ffuse(Vθv(I),Lθl(T,τ1:k))H = f_{\text{fuse}}\big(V_{\theta_v}(I), L_{\theta_l}(T, \tau_{1:k})\big)
  • Action decoding: TT0

The internal reasoning process is explicitly interleaved:

  • TT1
  • TT2
  • TT3

This internal “thinking” mechanism improves multi-step task performance, enables implicit detection and recovery of subtask failures, and increases user interpretability through language-trace explanation (Abdolmaleki et al., 2 Oct 2025).

3. Motion Transfer Across Robot Embodiments

A distinguishing component is the Motion Transfer (MT) module, enabling generalized control over heterogeneous robot platforms (e.g., ALOHA, Franka, Apollo):

  • Each robot trajectory TT4 is normalized and encoded into a shared latent motion space TT5 via TT6.
  • Per-robot decoders TT7 re-map this latent to actionable trajectories.
  • Alignment loss TT8 is minimized for paired demonstrations:

TT9

This mechanism obviates the need for separate robot-specific policies, supporting multi-embodiment learning and skill adaptation (Abdolmaleki et al., 2 Oct 2025).

4. Training Methodologies and Evaluation Protocols

For the GR 1.5 VLA model:

  • Supervised Learning: Trained on large-scale datasets composed of robot trajectories, multimodal sensor data, and natural-language instructions. Loss objectives combine action regression or policy distillation (ii1), latent trajectory reconstruction (ii2), cross-robot alignment (ii3), and language modeling for internal “thoughts” (ii4).
  • Behavior Cloning: No reinforcement-based fine-tuning (in Proc4Gem); trajectories are collected using a privileged-state RL expert and chopped into fixed-length windows.
  • Procedural Generation: Simulated environments are composed using Unity rendering and MuJoCo for full physics, with domain randomization and asset variation (ii5 furniture meshes).
  • Multilingual Prompts: Language interface robust to non-English prompts (e.g., Italian) without performance loss in sim (Lin et al., 11 Mar 2025).

Benchmarking (Representative Results):

Robot In-Dist Instr-Gen Act-Gen Vis-Gen Task-Gen
ALOHA 0.83 0.81 0.78 0.73 0.70
Bi-arm Franka 0.73 0.70 0.66 0.63 0.62
Apollo 0.74 0.77 0.66 0.66 0.62

Motion Transfer yields up to +0.06 (ALOHA), +0.02 (Franka), +0.06 (Apollo) progress increase for generalization tasks. Enabling internal reasoning (“thinking”) yields an average gain of +0.07 across robots on multi-step progress (Abdolmaleki et al., 2 Oct 2025).

In the Proc4Gem evaluation, Gemini 1.5 (fine-tuned) achieves ii670% real-world success in hard scenes versus 30% for the baseline, and robust out-of-distribution performance (e.g., 70% for 1.5 m giraffe OOD condition, baseline 0%). Simulation-to-reality transfer maintains ii780% of expert performance in simulation, with hardware transfer outperforming the baseline by 40 percentage points on hard tasks (Lin et al., 11 Mar 2025).

5. Embodied Reasoning and Agentic Integration

GR-ER 1.5 extends the Gemini backbone for spatial reasoning, task progress estimation, and embodied question answering:

  • Object-centric reasoning: Detection, segmentation, 2D and 3D pointing, trajectory prediction.
  • Task decomposition: Progress estimation and subtask completion.
  • Robust QA: Achieves state-of-the-art 59.6% embodied reasoning score vs ∼54% for prior approaches, while maintaining a generalist benchmark score of ∼62.8%.

As an orchestrator, the GR-ER 1.5 model can hierarchically plan and supervise task execution via chain-of-thought traces, integrating high-level symbolic plans with GR 1.5’s continuous control. For long-horizon agentic tasks, the combined system achieves higher progress scores than both “thinking-only” and Gemini 2.5 Flash orchestrated baselines (e.g., 0.83 vs 0.64 vs 0.56 on ALOHA) (Abdolmaleki et al., 2 Oct 2025).

6. System Limitations and Prospective Research

  • Dexterity Plateau: Out-of-the-box dexterity matches, but does not surpass, previous Gemini Robotics generations.
  • Motion Transfer Dependence: Requires paired multi-robot data; a plausible implication is that unpaired large-scale video (e.g., third-person human manipulation) would further enhance embodiment generalization.
  • Latency: The “thinking” mode incurs additional inference latency; prospective improvements include early-exit or weight-sharing techniques.
  • Safety: Ongoing work includes automatic red-teaming (ASIMOV-2.0) for semantic and physical safety validation.
  • Simulation Complexity: Some physical detail limitations, such as non-placement of small objects on tables, limit domain realism in Proc4Gem (Lin et al., 11 Mar 2025, Abdolmaleki et al., 2 Oct 2025).

Anticipated future directions include leveraging uncurated video corpora for pretraining, end-to-end RL fine-tuning for enhanced low-level dexterity, and dynamic subtask planning with real-world perception for fully autonomous long-horizon operation.

7. Hardware Platforms and Deployment

  • Robotic Embodiments: Supports diverse platforms: quadruped (e.g., Barkour), multi-arm (Franka, Apollo, ALOHA), and compatibility with various sensor configurations.
  • Control Frequencies: In Proc4Gem, high-level reasoning and VLA runs at 2 Hz; low-level controllers (e.g., quadruped MPC) operate at 50 Hz (Lin et al., 11 Mar 2025).
  • Compute: Onboard PC for sensor processing; offboard workstation runs Gemini model inference via remote service, with action caching to mask network latency.
  • Inference: Pseudocode for inference and training protocols is provided in respective works; per-step inference latency is not explicitly specified.

The Gemini Robotics 1.5 family, through explicit motion transfer, natural language reasoning, and hierarchical agentic integration, represents a significant advance in generalist robotics, enabling robust, interpretable, and cross-platform skill transfer for physical agents tasked with complex real-world operations (Lin et al., 11 Mar 2025, Abdolmaleki et al., 2 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemini Robotics 1.5.