Gemini ER-1.5: Advanced Robotic Reasoning

Updated 17 March 2026

Gemini ER-1.5 is a state-of-the-art generalist vision-language model designed for embodied reasoning, integrating multi-modal perception with natural language-based task planning.
It employs specialized reasoning heads for spatial pointing, trajectory prediction, progress estimation, and natural language “thinking traces” to enable interpretable multi-step robotic operations.
The model demonstrates significant performance improvements in spatial VQA, pointing accuracy, and cross-embodiment skill transfer, reducing task failure rates in real-world deployments.

Gemini ER-1.5 (Gemini Robotics-ER 1.5) is a state-of-the-art generalist Vision-LLM designed for advanced embodied reasoning in robotics. Building upon the Gemini 2.5 multimodal backbone, this model introduces critical innovations in visuo-spatial-temporal understanding, multi-level planning, and tool integration to support robotic agents in perceiving, reasoning, and acting across diverse physical environments. In concert with the Vision-Language-Action (VLA) model Gemini Robotics 1.5, Gemini ER-1.5 enables an agentic framework where robots systematically decompose, execute, and monitor complex multi-step tasks with explicit natural language “thinking traces,” interpretable progression signals, and robust skill transfer across heterogeneous hardware embodiments (Abdolmaleki et al., 2 Oct 2025).

1. Architectural Innovations and Model Structure

Gemini ER-1.5 inherits a transformer-based architecture from Gemini 2.5, employing specialized modules optimized for embodied reasoning:

Backbone and Multimodal Fusion: The model fuses information from a vision encoder $V(\cdot)$ (producing image tokens) and a text encoder $T(\cdot)$ (producing language tokens). Cross-modal transformer layers integrate these modalities through self- and cross-attention mechanisms.
Specialized Reasoning Heads:
- Pointing Head: Projects the CLS token into $\mathbb{R}^2$ via a linear map to output spatial coordinates for 2D pointing tasks, as $\hat{\mathbf{p}} = W_p h_{\text{[CLS]}} + b_p$ .
- Trajectory Prediction Head: Outputs $\{\hat{\mathbf{p}}_t\}_{t=1}^T$ for motion planning across sequence steps.
- Segmentation/Detection Head: Performs per-pixel segmentation and bounding box regression for object localization.
- Progress Head: Estimates task completion as $\hat{r} = W_r h_{\text{[CLS]}} + b_r \in [0,1]$ .
- Success Detection Head: Classifies episode binary success as $\hat{s} = \sigma(W_s h_{\text{[CLS]}} + b_s)$ .
- Planning (Language) Head: Generates natural-language “thinking traces” autoregressively to facilitate chain-of-thought reasoning.
Multi-level Reasoning Cycle: At inference, the model generates reasoning traces, executes perceptual queries (pointing, segmentation, tool use), and dynamically updates internal plans contingent on perceived success or failure.

2. Embodied Reasoning Competencies

Gemini ER-1.5 operationalizes embodied reasoning through tightly integrated subsystems:

Visual and Spatial Understanding: Supports both single- and multi-view detection, segmentation, affordance prediction, and reasoning-augmented pointing tasks with explicit spatial targeting.
Decompositional Task Planning: Breaks down goals into subtasks via language-based chain-of-thought (“thinking traces”), enabling interpretable internal monologues that guide action selection in open-ended workflows.
Progress Estimation and Success Detection: Implements temporal reasoning over video frames and real-time binary evaluation signals for robust progress tracking at ~5 Hz, including simulating perceptual latency as encountered in real deployments.
Tool Use and External Data Access: Executes function calls, code, or web search queries during planning, enabling active information gathering and manipulation beyond direct observation.

3. Motion Transfer and Multi-Embodiment Skill Alignment

The Motion Transfer (MT) mechanism, implemented in Gemini Robotics 1.5 and foundational to the agentic system, establishes a unified motion embedding space across heterogeneous robot embodiments (e.g., ALOHA, Bi-arm Franka, Apollo humanoid):

$L_{\rm MT} = \mathbb{E}_{(\tau^A,\tau^B)\sim\mathcal{D}} \|f(\tau^A) - f(\tau^B)\|^2_2 + \lambda_{\rm contra} L_{\rm contra}$

where $f(\tau)$ is an MLP-based embedding of a trajectory, and $L_{\rm contra}$ is a contrastive term:

$L_{\rm contra} = -\sum_{i}\log\frac{\exp(\mathrm{sim}(f(\tau_i^A),f(\tau_i^B))/\tau)} {\sum_{j}\exp(\mathrm{sim}(f(\tau_i^A),f(\tau_j^B))/\tau)}$

This unified manifold facilitates zero-shot skill transfer between robot platforms, reducing dataset fragmentation and supporting generalization.

4. Training Regime and Data Utilization

Comprehensive multi-modal training leverages:

Data Sources:
- Thousands of manipulation trajectories across multiple physically distinct robots in the ALOHA, Bi-arm Franka, and Apollo datasets.
- Large-scale text, image, and video corpora to pre-train for general world knowledge and visual-linguistic grounding.
Loss Function: The training objective aggregates language modeling, trajectory regression, MT alignment, progress estimation, and binary success detection: $\mathcal{L} = -\sum_t\log P(w_t\mid w_{<t},I) + \lambda_{\rm traj}\mathbb{E}\|\tau-\hat\tau\|^2 + \lambda_{\rm MT}L_{\rm MT} + \lambda_{\rm prog}\mathbb{E}(r-\hat r)^2 + \lambda_{\rm sd}\mathbb{E}[-s\log\hat s-(1-s)\log(1-\hat s)] + \cdots$
Optimization: AdamW optimizer with cosine-decay learning rate schedule, trained on scalable TPU v4/v5p/v6e hardware.

5. Quantitative Benchmarks and Real-World Evaluation

The system is systematically benchmarked across spatial VQA, pointing tasks, success detection, and long-horizon real-world deployments:

Embodied Reasoning (ER) Score on 15 benchmarks (mean of spatial and VQA tasks):

Model	Thinking	ER Score
GR-ER 1.5	Yes	59.6%
GR-ER 1.5 (no thinking)	No	53.7%
GR-ER (prior)	No	52.6%
Gemini 2.5 Flash	Yes	51.2%
GPT-5	Yes	51.1%
GPT-5-mini	Yes	47.7%

Complex Pointing (average across five datasets):

Model	Avg. Pointing
GR-ER 1.5 (Thinking)	52.6%
GR-ER 1.5 (No think)	47.1%
GR-ER	49.1%
Gemini 2.5 Flash	39.7%
GPT-5	30.8%
GPT-5-mini	27.1%

Success Detection Binary Accuracy:

Setting	GR-ER 1.5	GR-ER
Offline multiview	0.79	0.68
Offline singleview	0.80	0.76
Real-time multiview	0.66	0.63
Real-time singleview	0.51	0.50

On shelf inspection tasks (IoU/point-accuracy): GR-ER 1.5 achieves IoU ≈ 70%, PointAcc ≈ 64%, outperforming prior iterations and GPT-5 baselines. In long-horizon ALOHA tasks, integrating GR-ER 1.5 with the VLA results in a progress score of 0.88, exceeding both the Gemini 2.5 Flash (0.64) and VLA alone (0.44) configurations. Failure rate in orchestration (planning, monitoring, acting) declines from 44.5% to 22.0% when deploying GR-ER 1.5.

6. Key Contributions, Limitations, and Ongoing Work

Gemini ER-1.5 establishes new embodied reasoning state-of-the-art with several advances:

Performance Gains: Provides 8–10 percentage point improvements over prior models in spatial VQA and pointing benchmarks.
Motion Transfer: Demonstrates zero-shot cross-embodiment skill transfer, unifying dataset utilization.
Multi-step Natural Language Reasoning: Implements a distributed “thinking paradigm,” enhancing interpretability and reliability in multi-stage agent workflows.
Agentic Integration: Combines high-level reasoning with robust low-level control, supporting failure detection and recovery in real deployments.

Limitations:

Manipulation dexterity remains similar to previous VLA models; plans are in place to incorporate reinforcement learning for enhanced fine motor control.
Current scaling is constrained by action-annotated datasets; integration of unlabelled video corpora is anticipated to increase skill diversity.
Ongoing work targets improved semantic safety alignment and robustness under dynamic real-world conditions.

This suggests that Gemini ER-1.5 forms a foundational architecture for generalist physical agents with advanced perception, reasoning, and robust action, providing a trajectory towards safer and more capable autonomous robotics (Abdolmaleki et al., 2 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemini ER-1.5.