Gemini Robotics-ER 1.5: Advanced Embodied Reasoning

Updated 13 April 2026

Gemini Robotics-ER 1.5 is an advanced embodied reasoning model integrating vision-language processing with robust dexterous control for multi-task robotic manipulation.
It employs a twin-stream transformer architecture with temporal consistency regularization and multi-view perception to enhance safety and performance.
Evaluated in both simulation and real-world settings, the model supports long-horizon tasks with dynamic re-planning and interpretable reasoning chains.

Gemini Robotics-ER 1.5 is an advanced embodied reasoning (ER) model within the Gemini Robotics 1.5 family, designed to couple high-level vision-language reasoning with robust, dexterous control for generalist robotic agents. Developed as an extension of the Gemini 2.0/2.5 multimodal backbone and evaluated in both simulated (Veo World Simulator) and real-world environments, ER 1.5 achieves strong generalization, perceptual acuity, and safety-oriented performance, particularly in long-horizon, multitask manipulation scenarios. The model is characterized by a twin-stream transformer architecture, temporal consistency regularization, multi-view perception, natural-language instruction following, and integration with robust action decoding modules (Team et al., 11 Dec 2025, Abdolmaleki et al., 2 Oct 2025, Team et al., 25 Mar 2025).

1. System and Architectural Overview

Gemini Robotics-ER 1.5 (GR-ER 1.5) is architected as the orchestrating module within a two-stage agentic system, where it handles high-level perception, language understanding, plan generation, and embodied reasoning, interfacing with a Vision-Language-Action (VLA) backbone, Gemini Robotics 1.5 (GR 1.5), for low-level motor control (Abdolmaleki et al., 2 Oct 2025).

Core inputs and processing:

Four simultaneous RGB image streams (top-down, side, left-wrist, right-wrist), synchronously tiled into latent representations
Natural-language instructions
Segmentation masks, environment metadata, and optional proprioceptive state

Key architectural features:

Vision encoder (Gemini 2.5 Flash backbone): grid-based tokenization of views
Language encoder: transformer-based tokenization of instructions
Twin-stream transformer head: one stream for visual features, one for language, merged via cross-attention
Action decoder: predicts 1 s action chunks at 50 Hz containing end-effector Cartesian velocity and gripper commands

Losses and regularization:

Behavioral cloning loss: mean squared error over observed demonstrations
L2 weight decay (λ = 10⁻⁴)
Temporal (feature-space) consistency loss $\mathcal{L}_C$ with weight μ = 0.1, to stabilize and align state transitions
Model size: 1.8B parameters

Embodied reasoning is internally structured as chains-of-thought in natural language, enabling interpretable planning, subgoal decomposition, and recovery behaviors (Abdolmaleki et al., 2 Oct 2025).

2. Training Methodologies and Datasets

Gemini Robotics-ER 1.5 leverages a multimodal, multitask dataset framework, combining:

12 months of teleoperated ALOHA 2 bimanual robot demonstrations
Real and simulated data from heterogeneous embodiments (ALOHA, Bi-arm Franka, Apollo humanoid)
Spatial QA benchmarks (ERQA), multi-view image pairs, 3D annotation corpora
Synthetic text/image data and diverse web corpora

Training strategy:

Pre-training on masked vision-language modeling and next-token prediction (Gemini 2.0/2.5)
Task-specific heads for 2D/3D spatial grounding, trajectory waypoint prediction, grasp pose estimation, and multi-view correspondence
Autoregressive action decoding trained on next-action prediction
Fine-tuning on chunked, high-frequency action rollouts with trajectory homing loss and temporal consistency regularizer
Data augmentation: visual (crops, color jitter), language (paraphrasing, distractors), action (perturbations)
Optimization: AdamW, LR warm-up/decay, batch sizes 2048–8192

Motion Transfer (MT) mechanisms enable alignment of motion primitives and action space representations across embodiments, using a mean-squared reconstruction loss and maximum mean discrepancy alignment (Abdolmaleki et al., 2 Oct 2025).

3. Functional Capabilities and Benchmarks

GR-ER 1.5 supports a comprehensive set of perception, spatial reasoning, and low-level control tasks:

Perceptual benchmarks:

2D open-vocabulary object detection: AP ≈ 48.3% (SUN-RGBD), outperforming closed-set experts
2D/3D spatial grounding: pointing, trajectory generation, grasp prediction (IoU ≥ 0.85), and multi-view correspondence
Monocular 3D box estimation: AP@15 = 48.3%
Multi-view success detection: real-time accuracy 0.79

Manipulation and control:

Zero- and few-shot execution of long-horizon tasks (folding, packing, tool use)
Long-horizon multi-step instructions, with >65% real-world few-shot success for complex in-context learning benchmarks
Robustness: <10% drop in performance over ±20 cm object translations, >80% task success under significant visual disturbance

Reasoning and planning:

Internal chains-of-thought expose plan subgoals, intermediate hypotheses, and recovery procedures
Progress estimation across subtasks, using weighted success metrics
Dynamic re-planning and interpretability via surfaced reasoning traces

4. Evaluation and Quantitative Performance

4.1 In-Distribution and OOD Simulation Results

Using the Veo World Simulator, ER 1.5 was evaluated on 80 nominal tasks and four axes of distribution shift (novel backgrounds, small/large distractors, novel objects):

Task	Real Success Rate	Sim Success Rate	Sim Task Time (s)
Pick & Place A	0.88 ± 0.03	0.83 ± 0.04	5.2 ± 0.4
Pick & Place B	0.84 ± 0.04	0.79 ± 0.05	5.8 ± 0.6
Stack Blocks	0.91 ± 0.02	0.86 ± 0.03	6.1 ± 0.5
Open & Close Box	0.76 ± 0.05	0.72 ± 0.06	7.0 ± 0.7
Pour Liquid	0.82 ± 0.04	0.78 ± 0.05	6.5 ± 0.5
Overall	0.84 ± 0.02	0.80 ± 0.03	6.1 ± 0.3

Success rates in simulation strongly correlate with hardware (Pearson ρ = 0.92). In out-of-distribution (OOD) scenarios, performance exhibited expected degradation, most pronounced for novel objects:

Axis	OOD Sim Success	Generalization Gap Δ
Background	0.67 ± 0.03	0.13 ± 0.03
Small Distractor	0.74 ± 0.04	0.06 ± 0.04
Large Distractor	0.71 ± 0.05	0.09 ± 0.05
Novel Object	0.55 ± 0.06	0.25 ± 0.06

All generalization gaps Δ are significant (p<0.01) except small distractors (p=0.12); simulation OOD ordering matches hardware with high rank consistency (Team et al., 11 Dec 2025).

4.2 Comparative Analysis

Comparison with ER 1.0 and ER 2.0 showed:

Policy	Nominal R	OOD R	Nom V (%)	OOD V (%)
ER 1.0	0.75	0.55	18.0	24.0
ER 1.5	0.80	0.67	12.0	16.0
ER 2.0	0.85	0.74	8.5	11.5

ER 1.5 delivers +5 pp nominal R and −6 pp nominal violation rate V versus ER 1.0, with further scaling in ER 2.0. The OOD gap is reduced from 0.20 (ER 1.0) to 0.13 (ER 1.5) and 0.11 (ER 2.0), reflecting increased robustness (Team et al., 11 Dec 2025).

5. Embodied Reasoning, Planning, and Interpretability

GR-ER 1.5 provides explicit multi-level internal reasoning that underpins its agentic behavior:

Embedded chains-of-thought are used for subtask decomposition, visual-spatial anchoring, and payload and safety assessment, directly interleaved with robot actions
Embodied Reasoning Score (ERScore): combines spatial and QA averages; GR-ER 1.5 (thinking) scores 59.6%, exceeding Gemini 2.5 Pro (51.7%) and GPT-5 (51.2%)
Pointing and Progress Estimation: Achieves 52.6% on diverse pointing benchmarks (a 15 pp gain over GPT-5) and state-of-the-art multiview success detection (0.79–0.80 real-time/offline accuracy) (Abdolmaleki et al., 2 Oct 2025).

A plausible implication is that reasoning traces not only support interpretability, but also enable more reliable recovery, error messaging, and context-aware planning, which are critical for human-aligned deployment in unstructured environments.

6. Safety, Red-Teaming, and Robustness

GR-ER 1.5 integrates multi-layered safety assurance, combining learned constraints, real-time monitors, and constitutional AI techniques:

Safety-QA Head (ASIMOV-Multimodal): Text+image classifier vetoes/rewrites unsafe instruction-predicted code, >95% binary accuracy on safety prompts (Team et al., 25 Mar 2025)
Runtime monitors: Enforce geofencing, force/torque bounds, contact constraints, and workspace restrictions via real-time action filtering
Red-teaming evaluation: In 100 simulated “red-team” scenes, V = 12% total violations (12.5% human-contact, 11.7% sharp-object scenarios), all confirmed in hardware. Example failures include unsafe gripper contact near human hands and hazard interaction (e.g., cutting a screen) (Team et al., 11 Dec 2025)
Fallback and rationale: Unsafe plans elicit human confirmation or internal rationale generation

Relative to ER 1.0, ER 1.5 achieves a 33% reduction in safety violations in OOD and nominal settings. This layered design addresses both semantic and physical safety across tasks.

7. Limitations and Future Prospects

While demonstrating significant advances, GR-ER 1.5 exhibits limitations:

Dexterity remains comparable to prior architectures; progress is expected from reinforcement learning-based fine-tuning and richer contact-rich data
Generalization in highly unstructured, real-world environments remains partially reliant on Veo’s world model fidelity and continued domain randomization
Data+compute bottlenecks for scaling multi-embodiment, multi-modal training are identified; leveraging unlabeled human videos and synthetic simulation is a future focus (Abdolmaleki et al., 2 Oct 2025)
Responsible deployment remains a research area, with ongoing development of ASIMOV-2.0 safety benchmarks and deployment policies for mixed human–robot workspaces

GR-ER 1.5 is positioned as a “sweet spot” within the Gemini model series: scalable capacity and consistent regularization cuts safety incidents, delivers strong predictive validity in both simulation and on-hardware, and sets a reference point for future developments in general-purpose, robust, and interpretable robotics agents (Team et al., 11 Dec 2025, Team et al., 25 Mar 2025, Abdolmaleki et al., 2 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Evaluating Gemini Robotics Policies in a Veo World Simulator (2025)

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer (2025)

Gemini Robotics: Bringing AI into the Physical World (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gemini Robotics-ER 1.5.