Gemini Robotics 1.5: VLA Models for Robotics

Updated 3 July 2026

Gemini Robotics 1.5 is a family of advanced vision-language-action models that fuse visual, linguistic, and motor control capabilities for solving complex multi-step robotic tasks.
It incorporates an explicit motion transfer module that enables zero-shot skill adaptation across diverse robot morphologies, enhancing cross-platform functionality.
Internal chain-of-thought reasoning and embodied spatial planning clearly improve task decomposition and execution interpretability in real-world robotic applications.

Gemini Robotics 1.5 is a family of advanced multi-embodiment Vision-Language-Action (VLA) foundation models for robotics, featuring an integrated architecture for perception, reasoning, and dexterous physical control. It incorporates explicit motion transfer mechanisms for zero-shot skill adaptation across diverse robot morphologies and introduces internal natural-language reasoning—enabling a generalist robotic system that can perceive, think, and act to solve complex, multi-step tasks. The family comprises the core Gemini Robotics 1.5 VLA model and the Gemini Robotics-ER 1.5 embodied reasoning model, designed for hierarchical orchestration, robust spatial reasoning, and interpretable behavior explanation (Abdolmaleki et al., 2 Oct 2025).

1. Model Family Architecture

The Gemini Robotics 1.5 model family integrates several architectural innovations:

Vision Backbone: A multi-scale Vision Transformer (ViT) generates per-patch embeddings from raw camera inputs.
Language Module: A LLM, originating from the Gemini family, encodes open-ended instructions and internal reasoning as token embeddings.
Cross-modal Fusion: Multiple transformer cross-attention layers fuse visual and linguistic representations.
Action Decoder: A lightweight MLP or transformer head maps fused features to continuous action commands (e.g., end-effector poses, joint deltas, gripper actuation).
Motion Transfer Module: A learnable alignment network normalizes embodiment-specific trajectories to a shared canonical space, supporting cross-robot transfer.
Embodied Reasoning Model (GR-ER 1.5): An optimized Gemini backbone for reasoning, with specialized heads for object segmentation, pointing, trajectory prediction, progress estimation, and success/failure classification.

The overall agentic system couples GR 1.5 for low-level closed-loop control and GR-ER 1.5 for high-level planning, spatial understanding, and feedback-driven adaptation (Abdolmaleki et al., 2 Oct 2025).

2. Multimodal Perception, Reasoning, and Action

Gemini Robotics 1.5 processes a sequence of images $I_t$ , instruction $T$ , and history of past actions, fusing them to produce actionable outputs. Its reasoning loop operates as follows:

At each sub-step $i$ , the model may generate a “thought” token $\tau_i$ (chain-of-thought reasoning) to explicitly break down the task.
The fused representation $H$ is dynamically updated with these internal language traces, enhancing context and subgoal decomposition.
The action decoder $g_{\theta_a}$ maps the current fused state to the next motion command $\Delta x_i \in \mathbb{R}^D$ .
The reasoning loop continues until a task-completion predicate is satisfied.

Mathematical Formulation:

Vision encoder: $V_{\theta_v}(I) \in \mathbb{R}^{P \times d}$
Language encoder: $L_{\theta_l}(T, \tau_{1:k}) \in \mathbb{R}^{M \times d}$
Cross-modal fusion: $H = f_{\text{fuse}}\big(V_{\theta_v}(I), L_{\theta_l}(T, \tau_{1:k})\big)$
Action decoding: $T$ 0

The internal reasoning process is explicitly interleaved:

$T$ 1
$T$ 2
$T$ 3

This internal “thinking” mechanism improves multi-step task performance, enables implicit detection and recovery of subtask failures, and increases user interpretability through language-trace explanation (Abdolmaleki et al., 2 Oct 2025).

3. Motion Transfer Across Robot Embodiments

A distinguishing component is the Motion Transfer (MT) module, enabling generalized control over heterogeneous robot platforms (e.g., ALOHA, Franka, Apollo):

Each robot trajectory $T$ 4 is normalized and encoded into a shared latent motion space $T$ 5 via $T$ 6.
Per-robot decoders $T$ 7 re-map this latent to actionable trajectories.
Alignment loss $T$ 8 is minimized for paired demonstrations:

$T$ 9

Zero-shot transfer is performed as: $i$ 0.

This mechanism obviates the need for separate robot-specific policies, supporting multi-embodiment learning and skill adaptation (Abdolmaleki et al., 2 Oct 2025).

4. Training Methodologies and Evaluation Protocols

For the GR 1.5 VLA model:

Supervised Learning: Trained on large-scale datasets composed of robot trajectories, multimodal sensor data, and natural-language instructions. Loss objectives combine action regression or policy distillation ( $i$ 1), latent trajectory reconstruction ( $i$ 2), cross-robot alignment ( $i$ 3), and language modeling for internal “thoughts” ( $i$ 4).
Behavior Cloning: No reinforcement-based fine-tuning (in Proc4Gem); trajectories are collected using a privileged-state RL expert and chopped into fixed-length windows.
Procedural Generation: Simulated environments are composed using Unity rendering and MuJoCo for full physics, with domain randomization and asset variation ( $i$ 5 furniture meshes).
Multilingual Prompts: Language interface robust to non-English prompts (e.g., Italian) without performance loss in sim (Lin et al., 11 Mar 2025).

Benchmarking (Representative Results):

Robot	In-Dist	Instr-Gen	Act-Gen	Vis-Gen	Task-Gen
ALOHA	0.83	0.81	0.78	0.73	0.70
Bi-arm Franka	0.73	0.70	0.66	0.63	0.62
Apollo	0.74	0.77	0.66	0.66	0.62

Motion Transfer yields up to +0.06 (ALOHA), +0.02 (Franka), +0.06 (Apollo) progress increase for generalization tasks. Enabling internal reasoning (“thinking”) yields an average gain of +0.07 across robots on multi-step progress (Abdolmaleki et al., 2 Oct 2025).

In the Proc4Gem evaluation, Gemini 1.5 (fine-tuned) achieves $i$ 670% real-world success in hard scenes versus 30% for the baseline, and robust out-of-distribution performance (e.g., 70% for 1.5 m giraffe OOD condition, baseline 0%). Simulation-to-reality transfer maintains $i$ 780% of expert performance in simulation, with hardware transfer outperforming the baseline by 40 percentage points on hard tasks (Lin et al., 11 Mar 2025).

5. Embodied Reasoning and Agentic Integration

GR-ER 1.5 extends the Gemini backbone for spatial reasoning, task progress estimation, and embodied question answering:

Object-centric reasoning: Detection, segmentation, 2D and 3D pointing, trajectory prediction.
Task decomposition: Progress estimation and subtask completion.
Robust QA: Achieves state-of-the-art 59.6% embodied reasoning score vs ∼54% for prior approaches, while maintaining a generalist benchmark score of ∼62.8%.

As an orchestrator, the GR-ER 1.5 model can hierarchically plan and supervise task execution via chain-of-thought traces, integrating high-level symbolic plans with GR 1.5’s continuous control. For long-horizon agentic tasks, the combined system achieves higher progress scores than both “thinking-only” and Gemini 2.5 Flash orchestrated baselines (e.g., 0.83 vs 0.64 vs 0.56 on ALOHA) (Abdolmaleki et al., 2 Oct 2025).

6. System Limitations and Prospective Research

Dexterity Plateau: Out-of-the-box dexterity matches, but does not surpass, previous Gemini Robotics generations.
Motion Transfer Dependence: Requires paired multi-robot data; a plausible implication is that unpaired large-scale video (e.g., third-person human manipulation) would further enhance embodiment generalization.
Latency: The “thinking” mode incurs additional inference latency; prospective improvements include early-exit or weight-sharing techniques.
Safety: Ongoing work includes automatic red-teaming (ASIMOV-2.0) for semantic and physical safety validation.
Simulation Complexity: Some physical detail limitations, such as non-placement of small objects on tables, limit domain realism in Proc4Gem (Lin et al., 11 Mar 2025, Abdolmaleki et al., 2 Oct 2025).

Anticipated future directions include leveraging uncurated video corpora for pretraining, end-to-end RL fine-tuning for enhanced low-level dexterity, and dynamic subtask planning with real-world perception for fully autonomous long-horizon operation.

7. Hardware Platforms and Deployment

Robotic Embodiments: Supports diverse platforms: quadruped (e.g., Barkour), multi-arm (Franka, Apollo, ALOHA), and compatibility with various sensor configurations.
Control Frequencies: In Proc4Gem, high-level reasoning and VLA runs at 2 Hz; low-level controllers (e.g., quadruped MPC) operate at 50 Hz (Lin et al., 11 Mar 2025).
Compute: Onboard PC for sensor processing; offboard workstation runs Gemini model inference via remote service, with action caching to mask network latency.
Inference: Pseudocode for inference and training protocols is provided in respective works; per-step inference latency is not explicitly specified.

The Gemini Robotics 1.5 family, through explicit motion transfer, natural language reasoning, and hierarchical agentic integration, represents a significant advance in generalist robotics, enabling robust, interpretable, and cross-platform skill transfer for physical agents tasked with complex real-world operations (Lin et al., 11 Mar 2025, Abdolmaleki et al., 2 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer (2025)

Proc4Gem: Foundation models for physical agency through procedural generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemini Robotics 1.5.