Gemini Robotics 1.5: VLA Models for Robotics
- Gemini Robotics 1.5 is a family of advanced vision-language-action models that fuse visual, linguistic, and motor control capabilities for solving complex multi-step robotic tasks.
- It incorporates an explicit motion transfer module that enables zero-shot skill adaptation across diverse robot morphologies, enhancing cross-platform functionality.
- Internal chain-of-thought reasoning and embodied spatial planning clearly improve task decomposition and execution interpretability in real-world robotic applications.
Gemini Robotics 1.5 is a family of advanced multi-embodiment Vision-Language-Action (VLA) foundation models for robotics, featuring an integrated architecture for perception, reasoning, and dexterous physical control. It incorporates explicit motion transfer mechanisms for zero-shot skill adaptation across diverse robot morphologies and introduces internal natural-language reasoning—enabling a generalist robotic system that can perceive, think, and act to solve complex, multi-step tasks. The family comprises the core Gemini Robotics 1.5 VLA model and the Gemini Robotics-ER 1.5 embodied reasoning model, designed for hierarchical orchestration, robust spatial reasoning, and interpretable behavior explanation (Abdolmaleki et al., 2 Oct 2025).
1. Model Family Architecture
The Gemini Robotics 1.5 model family integrates several architectural innovations:
- Vision Backbone: A multi-scale Vision Transformer (ViT) generates per-patch embeddings from raw camera inputs.
- Language Module: A LLM, originating from the Gemini family, encodes open-ended instructions and internal reasoning as token embeddings.
- Cross-modal Fusion: Multiple transformer cross-attention layers fuse visual and linguistic representations.
- Action Decoder: A lightweight MLP or transformer head maps fused features to continuous action commands (e.g., end-effector poses, joint deltas, gripper actuation).
- Motion Transfer Module: A learnable alignment network normalizes embodiment-specific trajectories to a shared canonical space, supporting cross-robot transfer.
- Embodied Reasoning Model (GR-ER 1.5): An optimized Gemini backbone for reasoning, with specialized heads for object segmentation, pointing, trajectory prediction, progress estimation, and success/failure classification.
The overall agentic system couples GR 1.5 for low-level closed-loop control and GR-ER 1.5 for high-level planning, spatial understanding, and feedback-driven adaptation (Abdolmaleki et al., 2 Oct 2025).
2. Multimodal Perception, Reasoning, and Action
Gemini Robotics 1.5 processes a sequence of images , instruction , and history of past actions, fusing them to produce actionable outputs. Its reasoning loop operates as follows:
- At each sub-step , the model may generate a “thought” token (chain-of-thought reasoning) to explicitly break down the task.
- The fused representation is dynamically updated with these internal language traces, enhancing context and subgoal decomposition.
- The action decoder maps the current fused state to the next motion command .
- The reasoning loop continues until a task-completion predicate is satisfied.
Mathematical Formulation:
- Vision encoder:
- Language encoder:
- Cross-modal fusion:
- Action decoding: 0
The internal reasoning process is explicitly interleaved:
- 1
- 2
- 3
This internal “thinking” mechanism improves multi-step task performance, enables implicit detection and recovery of subtask failures, and increases user interpretability through language-trace explanation (Abdolmaleki et al., 2 Oct 2025).
3. Motion Transfer Across Robot Embodiments
A distinguishing component is the Motion Transfer (MT) module, enabling generalized control over heterogeneous robot platforms (e.g., ALOHA, Franka, Apollo):
- Each robot trajectory 4 is normalized and encoded into a shared latent motion space 5 via 6.
- Per-robot decoders 7 re-map this latent to actionable trajectories.
- Alignment loss 8 is minimized for paired demonstrations:
9
- Zero-shot transfer is performed as: 0.
This mechanism obviates the need for separate robot-specific policies, supporting multi-embodiment learning and skill adaptation (Abdolmaleki et al., 2 Oct 2025).
4. Training Methodologies and Evaluation Protocols
For the GR 1.5 VLA model:
- Supervised Learning: Trained on large-scale datasets composed of robot trajectories, multimodal sensor data, and natural-language instructions. Loss objectives combine action regression or policy distillation (1), latent trajectory reconstruction (2), cross-robot alignment (3), and language modeling for internal “thoughts” (4).
- Behavior Cloning: No reinforcement-based fine-tuning (in Proc4Gem); trajectories are collected using a privileged-state RL expert and chopped into fixed-length windows.
- Procedural Generation: Simulated environments are composed using Unity rendering and MuJoCo for full physics, with domain randomization and asset variation (5 furniture meshes).
- Multilingual Prompts: Language interface robust to non-English prompts (e.g., Italian) without performance loss in sim (Lin et al., 11 Mar 2025).
Benchmarking (Representative Results):
| Robot | In-Dist | Instr-Gen | Act-Gen | Vis-Gen | Task-Gen |
|---|---|---|---|---|---|
| ALOHA | 0.83 | 0.81 | 0.78 | 0.73 | 0.70 |
| Bi-arm Franka | 0.73 | 0.70 | 0.66 | 0.63 | 0.62 |
| Apollo | 0.74 | 0.77 | 0.66 | 0.66 | 0.62 |
Motion Transfer yields up to +0.06 (ALOHA), +0.02 (Franka), +0.06 (Apollo) progress increase for generalization tasks. Enabling internal reasoning (“thinking”) yields an average gain of +0.07 across robots on multi-step progress (Abdolmaleki et al., 2 Oct 2025).
In the Proc4Gem evaluation, Gemini 1.5 (fine-tuned) achieves 670% real-world success in hard scenes versus 30% for the baseline, and robust out-of-distribution performance (e.g., 70% for 1.5 m giraffe OOD condition, baseline 0%). Simulation-to-reality transfer maintains 780% of expert performance in simulation, with hardware transfer outperforming the baseline by 40 percentage points on hard tasks (Lin et al., 11 Mar 2025).
5. Embodied Reasoning and Agentic Integration
GR-ER 1.5 extends the Gemini backbone for spatial reasoning, task progress estimation, and embodied question answering:
- Object-centric reasoning: Detection, segmentation, 2D and 3D pointing, trajectory prediction.
- Task decomposition: Progress estimation and subtask completion.
- Robust QA: Achieves state-of-the-art 59.6% embodied reasoning score vs ∼54% for prior approaches, while maintaining a generalist benchmark score of ∼62.8%.
As an orchestrator, the GR-ER 1.5 model can hierarchically plan and supervise task execution via chain-of-thought traces, integrating high-level symbolic plans with GR 1.5’s continuous control. For long-horizon agentic tasks, the combined system achieves higher progress scores than both “thinking-only” and Gemini 2.5 Flash orchestrated baselines (e.g., 0.83 vs 0.64 vs 0.56 on ALOHA) (Abdolmaleki et al., 2 Oct 2025).
6. System Limitations and Prospective Research
- Dexterity Plateau: Out-of-the-box dexterity matches, but does not surpass, previous Gemini Robotics generations.
- Motion Transfer Dependence: Requires paired multi-robot data; a plausible implication is that unpaired large-scale video (e.g., third-person human manipulation) would further enhance embodiment generalization.
- Latency: The “thinking” mode incurs additional inference latency; prospective improvements include early-exit or weight-sharing techniques.
- Safety: Ongoing work includes automatic red-teaming (ASIMOV-2.0) for semantic and physical safety validation.
- Simulation Complexity: Some physical detail limitations, such as non-placement of small objects on tables, limit domain realism in Proc4Gem (Lin et al., 11 Mar 2025, Abdolmaleki et al., 2 Oct 2025).
Anticipated future directions include leveraging uncurated video corpora for pretraining, end-to-end RL fine-tuning for enhanced low-level dexterity, and dynamic subtask planning with real-world perception for fully autonomous long-horizon operation.
7. Hardware Platforms and Deployment
- Robotic Embodiments: Supports diverse platforms: quadruped (e.g., Barkour), multi-arm (Franka, Apollo, ALOHA), and compatibility with various sensor configurations.
- Control Frequencies: In Proc4Gem, high-level reasoning and VLA runs at 2 Hz; low-level controllers (e.g., quadruped MPC) operate at 50 Hz (Lin et al., 11 Mar 2025).
- Compute: Onboard PC for sensor processing; offboard workstation runs Gemini model inference via remote service, with action caching to mask network latency.
- Inference: Pseudocode for inference and training protocols is provided in respective works; per-step inference latency is not explicitly specified.
The Gemini Robotics 1.5 family, through explicit motion transfer, natural language reasoning, and hierarchical agentic integration, represents a significant advance in generalist robotics, enabling robust, interpretable, and cross-platform skill transfer for physical agents tasked with complex real-world operations (Lin et al., 11 Mar 2025, Abdolmaleki et al., 2 Oct 2025).