Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gemini ER-1.5: Advanced Robotic Reasoning

Updated 17 March 2026
  • Gemini ER-1.5 is a state-of-the-art generalist vision-language model designed for embodied reasoning, integrating multi-modal perception with natural language-based task planning.
  • It employs specialized reasoning heads for spatial pointing, trajectory prediction, progress estimation, and natural language “thinking traces” to enable interpretable multi-step robotic operations.
  • The model demonstrates significant performance improvements in spatial VQA, pointing accuracy, and cross-embodiment skill transfer, reducing task failure rates in real-world deployments.

Gemini ER-1.5 (Gemini Robotics-ER 1.5) is a state-of-the-art generalist Vision-LLM designed for advanced embodied reasoning in robotics. Building upon the Gemini 2.5 multimodal backbone, this model introduces critical innovations in visuo-spatial-temporal understanding, multi-level planning, and tool integration to support robotic agents in perceiving, reasoning, and acting across diverse physical environments. In concert with the Vision-Language-Action (VLA) model Gemini Robotics 1.5, Gemini ER-1.5 enables an agentic framework where robots systematically decompose, execute, and monitor complex multi-step tasks with explicit natural language “thinking traces,” interpretable progression signals, and robust skill transfer across heterogeneous hardware embodiments (Abdolmaleki et al., 2 Oct 2025).

1. Architectural Innovations and Model Structure

Gemini ER-1.5 inherits a transformer-based architecture from Gemini 2.5, employing specialized modules optimized for embodied reasoning:

  • Backbone and Multimodal Fusion: The model fuses information from a vision encoder V()V(\cdot) (producing image tokens) and a text encoder T()T(\cdot) (producing language tokens). Cross-modal transformer layers integrate these modalities through self- and cross-attention mechanisms.
  • Specialized Reasoning Heads:
    • Pointing Head: Projects the CLS token into R2\mathbb{R}^2 via a linear map to output spatial coordinates for 2D pointing tasks, as p^=Wph[CLS]+bp\hat{\mathbf{p}} = W_p h_{\text{[CLS]}} + b_p.
    • Trajectory Prediction Head: Outputs {p^t}t=1T\{\hat{\mathbf{p}}_t\}_{t=1}^T for motion planning across sequence steps.
    • Segmentation/Detection Head: Performs per-pixel segmentation and bounding box regression for object localization.
    • Progress Head: Estimates task completion as r^=Wrh[CLS]+br[0,1]\hat{r} = W_r h_{\text{[CLS]}} + b_r \in [0,1].
    • Success Detection Head: Classifies episode binary success as s^=σ(Wsh[CLS]+bs)\hat{s} = \sigma(W_s h_{\text{[CLS]}} + b_s).
    • Planning (Language) Head: Generates natural-language “thinking traces” autoregressively to facilitate chain-of-thought reasoning.
  • Multi-level Reasoning Cycle: At inference, the model generates reasoning traces, executes perceptual queries (pointing, segmentation, tool use), and dynamically updates internal plans contingent on perceived success or failure.

2. Embodied Reasoning Competencies

Gemini ER-1.5 operationalizes embodied reasoning through tightly integrated subsystems:

  • Visual and Spatial Understanding: Supports both single- and multi-view detection, segmentation, affordance prediction, and reasoning-augmented pointing tasks with explicit spatial targeting.
  • Decompositional Task Planning: Breaks down goals into subtasks via language-based chain-of-thought (“thinking traces”), enabling interpretable internal monologues that guide action selection in open-ended workflows.
  • Progress Estimation and Success Detection: Implements temporal reasoning over video frames and real-time binary evaluation signals for robust progress tracking at ~5 Hz, including simulating perceptual latency as encountered in real deployments.
  • Tool Use and External Data Access: Executes function calls, code, or web search queries during planning, enabling active information gathering and manipulation beyond direct observation.

3. Motion Transfer and Multi-Embodiment Skill Alignment

The Motion Transfer (MT) mechanism, implemented in Gemini Robotics 1.5 and foundational to the agentic system, establishes a unified motion embedding space across heterogeneous robot embodiments (e.g., ALOHA, Bi-arm Franka, Apollo humanoid):

LMT=E(τA,τB)Df(τA)f(τB)22+λcontraLcontraL_{\rm MT} = \mathbb{E}_{(\tau^A,\tau^B)\sim\mathcal{D}} \|f(\tau^A) - f(\tau^B)\|^2_2 + \lambda_{\rm contra} L_{\rm contra}

where f(τ)f(\tau) is an MLP-based embedding of a trajectory, and LcontraL_{\rm contra} is a contrastive term:

Lcontra=ilogexp(sim(f(τiA),f(τiB))/τ)jexp(sim(f(τiA),f(τjB))/τ)L_{\rm contra} = -\sum_{i}\log\frac{\exp(\mathrm{sim}(f(\tau_i^A),f(\tau_i^B))/\tau)} {\sum_{j}\exp(\mathrm{sim}(f(\tau_i^A),f(\tau_j^B))/\tau)}

This unified manifold facilitates zero-shot skill transfer between robot platforms, reducing dataset fragmentation and supporting generalization.

4. Training Regime and Data Utilization

Comprehensive multi-modal training leverages:

  • Data Sources:
    • Thousands of manipulation trajectories across multiple physically distinct robots in the ALOHA, Bi-arm Franka, and Apollo datasets.
    • Large-scale text, image, and video corpora to pre-train for general world knowledge and visual-linguistic grounding.
  • Loss Function: The training objective aggregates language modeling, trajectory regression, MT alignment, progress estimation, and binary success detection: L=tlogP(wtw<t,I)+λtrajEττ^2+λMTLMT+λprogE(rr^)2+λsdE[slogs^(1s)log(1s^)]+\mathcal{L} = -\sum_t\log P(w_t\mid w_{<t},I) + \lambda_{\rm traj}\mathbb{E}\|\tau-\hat\tau\|^2 + \lambda_{\rm MT}L_{\rm MT} + \lambda_{\rm prog}\mathbb{E}(r-\hat r)^2 + \lambda_{\rm sd}\mathbb{E}[-s\log\hat s-(1-s)\log(1-\hat s)] + \cdots
  • Optimization: AdamW optimizer with cosine-decay learning rate schedule, trained on scalable TPU v4/v5p/v6e hardware.

5. Quantitative Benchmarks and Real-World Evaluation

The system is systematically benchmarked across spatial VQA, pointing tasks, success detection, and long-horizon real-world deployments:

Embodied Reasoning (ER) Score on 15 benchmarks (mean of spatial and VQA tasks):

Model Thinking ER Score
GR-ER 1.5 Yes 59.6%
GR-ER 1.5 (no thinking) No 53.7%
GR-ER (prior) No 52.6%
Gemini 2.5 Flash Yes 51.2%
GPT-5 Yes 51.1%
GPT-5-mini Yes 47.7%

Complex Pointing (average across five datasets):

Model Avg. Pointing
GR-ER 1.5 (Thinking) 52.6%
GR-ER 1.5 (No think) 47.1%
GR-ER 49.1%
Gemini 2.5 Flash 39.7%
GPT-5 30.8%
GPT-5-mini 27.1%

Success Detection Binary Accuracy:

Setting GR-ER 1.5 GR-ER
Offline multiview 0.79 0.68
Offline singleview 0.80 0.76
Real-time multiview 0.66 0.63
Real-time singleview 0.51 0.50

On shelf inspection tasks (IoU/point-accuracy): GR-ER 1.5 achieves IoU ≈ 70%, PointAcc ≈ 64%, outperforming prior iterations and GPT-5 baselines. In long-horizon ALOHA tasks, integrating GR-ER 1.5 with the VLA results in a progress score of 0.88, exceeding both the Gemini 2.5 Flash (0.64) and VLA alone (0.44) configurations. Failure rate in orchestration (planning, monitoring, acting) declines from 44.5% to 22.0% when deploying GR-ER 1.5.

6. Key Contributions, Limitations, and Ongoing Work

Gemini ER-1.5 establishes new embodied reasoning state-of-the-art with several advances:

  • Performance Gains: Provides 8–10 percentage point improvements over prior models in spatial VQA and pointing benchmarks.
  • Motion Transfer: Demonstrates zero-shot cross-embodiment skill transfer, unifying dataset utilization.
  • Multi-step Natural Language Reasoning: Implements a distributed “thinking paradigm,” enhancing interpretability and reliability in multi-stage agent workflows.
  • Agentic Integration: Combines high-level reasoning with robust low-level control, supporting failure detection and recovery in real deployments.

Limitations:

  • Manipulation dexterity remains similar to previous VLA models; plans are in place to incorporate reinforcement learning for enhanced fine motor control.
  • Current scaling is constrained by action-annotated datasets; integration of unlabelled video corpora is anticipated to increase skill diversity.
  • Ongoing work targets improved semantic safety alignment and robustness under dynamic real-world conditions.

This suggests that Gemini ER-1.5 forms a foundational architecture for generalist physical agents with advanced perception, reasoning, and robust action, providing a trajectory towards safer and more capable autonomous robotics (Abdolmaleki et al., 2 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemini ER-1.5.