Gemini ER-1.5: Advanced Robotic Reasoning
- Gemini ER-1.5 is a state-of-the-art generalist vision-language model designed for embodied reasoning, integrating multi-modal perception with natural language-based task planning.
- It employs specialized reasoning heads for spatial pointing, trajectory prediction, progress estimation, and natural language “thinking traces” to enable interpretable multi-step robotic operations.
- The model demonstrates significant performance improvements in spatial VQA, pointing accuracy, and cross-embodiment skill transfer, reducing task failure rates in real-world deployments.
Gemini ER-1.5 (Gemini Robotics-ER 1.5) is a state-of-the-art generalist Vision-LLM designed for advanced embodied reasoning in robotics. Building upon the Gemini 2.5 multimodal backbone, this model introduces critical innovations in visuo-spatial-temporal understanding, multi-level planning, and tool integration to support robotic agents in perceiving, reasoning, and acting across diverse physical environments. In concert with the Vision-Language-Action (VLA) model Gemini Robotics 1.5, Gemini ER-1.5 enables an agentic framework where robots systematically decompose, execute, and monitor complex multi-step tasks with explicit natural language “thinking traces,” interpretable progression signals, and robust skill transfer across heterogeneous hardware embodiments (Abdolmaleki et al., 2 Oct 2025).
1. Architectural Innovations and Model Structure
Gemini ER-1.5 inherits a transformer-based architecture from Gemini 2.5, employing specialized modules optimized for embodied reasoning:
- Backbone and Multimodal Fusion: The model fuses information from a vision encoder (producing image tokens) and a text encoder (producing language tokens). Cross-modal transformer layers integrate these modalities through self- and cross-attention mechanisms.
- Specialized Reasoning Heads:
- Pointing Head: Projects the CLS token into via a linear map to output spatial coordinates for 2D pointing tasks, as .
- Trajectory Prediction Head: Outputs for motion planning across sequence steps.
- Segmentation/Detection Head: Performs per-pixel segmentation and bounding box regression for object localization.
- Progress Head: Estimates task completion as .
- Success Detection Head: Classifies episode binary success as .
- Planning (Language) Head: Generates natural-language “thinking traces” autoregressively to facilitate chain-of-thought reasoning.
- Multi-level Reasoning Cycle: At inference, the model generates reasoning traces, executes perceptual queries (pointing, segmentation, tool use), and dynamically updates internal plans contingent on perceived success or failure.
2. Embodied Reasoning Competencies
Gemini ER-1.5 operationalizes embodied reasoning through tightly integrated subsystems:
- Visual and Spatial Understanding: Supports both single- and multi-view detection, segmentation, affordance prediction, and reasoning-augmented pointing tasks with explicit spatial targeting.
- Decompositional Task Planning: Breaks down goals into subtasks via language-based chain-of-thought (“thinking traces”), enabling interpretable internal monologues that guide action selection in open-ended workflows.
- Progress Estimation and Success Detection: Implements temporal reasoning over video frames and real-time binary evaluation signals for robust progress tracking at ~5 Hz, including simulating perceptual latency as encountered in real deployments.
- Tool Use and External Data Access: Executes function calls, code, or web search queries during planning, enabling active information gathering and manipulation beyond direct observation.
3. Motion Transfer and Multi-Embodiment Skill Alignment
The Motion Transfer (MT) mechanism, implemented in Gemini Robotics 1.5 and foundational to the agentic system, establishes a unified motion embedding space across heterogeneous robot embodiments (e.g., ALOHA, Bi-arm Franka, Apollo humanoid):
where is an MLP-based embedding of a trajectory, and is a contrastive term:
This unified manifold facilitates zero-shot skill transfer between robot platforms, reducing dataset fragmentation and supporting generalization.
4. Training Regime and Data Utilization
Comprehensive multi-modal training leverages:
- Data Sources:
- Thousands of manipulation trajectories across multiple physically distinct robots in the ALOHA, Bi-arm Franka, and Apollo datasets.
- Large-scale text, image, and video corpora to pre-train for general world knowledge and visual-linguistic grounding.
- Loss Function: The training objective aggregates language modeling, trajectory regression, MT alignment, progress estimation, and binary success detection:
- Optimization: AdamW optimizer with cosine-decay learning rate schedule, trained on scalable TPU v4/v5p/v6e hardware.
5. Quantitative Benchmarks and Real-World Evaluation
The system is systematically benchmarked across spatial VQA, pointing tasks, success detection, and long-horizon real-world deployments:
Embodied Reasoning (ER) Score on 15 benchmarks (mean of spatial and VQA tasks):
| Model | Thinking | ER Score |
|---|---|---|
| GR-ER 1.5 | Yes | 59.6% |
| GR-ER 1.5 (no thinking) | No | 53.7% |
| GR-ER (prior) | No | 52.6% |
| Gemini 2.5 Flash | Yes | 51.2% |
| GPT-5 | Yes | 51.1% |
| GPT-5-mini | Yes | 47.7% |
Complex Pointing (average across five datasets):
| Model | Avg. Pointing |
|---|---|
| GR-ER 1.5 (Thinking) | 52.6% |
| GR-ER 1.5 (No think) | 47.1% |
| GR-ER | 49.1% |
| Gemini 2.5 Flash | 39.7% |
| GPT-5 | 30.8% |
| GPT-5-mini | 27.1% |
Success Detection Binary Accuracy:
| Setting | GR-ER 1.5 | GR-ER |
|---|---|---|
| Offline multiview | 0.79 | 0.68 |
| Offline singleview | 0.80 | 0.76 |
| Real-time multiview | 0.66 | 0.63 |
| Real-time singleview | 0.51 | 0.50 |
On shelf inspection tasks (IoU/point-accuracy): GR-ER 1.5 achieves IoU ≈ 70%, PointAcc ≈ 64%, outperforming prior iterations and GPT-5 baselines. In long-horizon ALOHA tasks, integrating GR-ER 1.5 with the VLA results in a progress score of 0.88, exceeding both the Gemini 2.5 Flash (0.64) and VLA alone (0.44) configurations. Failure rate in orchestration (planning, monitoring, acting) declines from 44.5% to 22.0% when deploying GR-ER 1.5.
6. Key Contributions, Limitations, and Ongoing Work
Gemini ER-1.5 establishes new embodied reasoning state-of-the-art with several advances:
- Performance Gains: Provides 8–10 percentage point improvements over prior models in spatial VQA and pointing benchmarks.
- Motion Transfer: Demonstrates zero-shot cross-embodiment skill transfer, unifying dataset utilization.
- Multi-step Natural Language Reasoning: Implements a distributed “thinking paradigm,” enhancing interpretability and reliability in multi-stage agent workflows.
- Agentic Integration: Combines high-level reasoning with robust low-level control, supporting failure detection and recovery in real deployments.
Limitations:
- Manipulation dexterity remains similar to previous VLA models; plans are in place to incorporate reinforcement learning for enhanced fine motor control.
- Current scaling is constrained by action-annotated datasets; integration of unlabelled video corpora is anticipated to increase skill diversity.
- Ongoing work targets improved semantic safety alignment and robustness under dynamic real-world conditions.
This suggests that Gemini ER-1.5 forms a foundational architecture for generalist physical agents with advanced perception, reasoning, and robust action, providing a trajectory towards safer and more capable autonomous robotics (Abdolmaleki et al., 2 Oct 2025).