Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

RoboBench: Multimodal Cognitive Benchmark

Updated 22 October 2025
  • RoboBench is a comprehensive benchmark that evaluates multimodal large language models as the cognitive core in robotic manipulation, spanning tasks from instruction comprehension to failure analysis.
  • It organizes evaluation around five critical cognitive dimensions—comprehension, perception, planning, affordance prediction, and failure analysis—using diverse, realistic multi-view data.
  • Experimental results reveal current MLLMs underperform in implicit instruction grounding, perception reasoning, and long-horizon planning, pinpointing actionable areas for improvement.

RoboBench is a systematic evaluation benchmark designed to assess multimodal LLMs (MLLMs) functioning as the cognitive core—termed the "embodied brain"—of robotic manipulation systems. Unlike previous benchmarks, which focus primarily on execution or cover only a subset of high-level reasoning capabilities, RoboBench spans the full manipulation pipeline. Its construction and evaluation protocols emphasize diverse embodiments, complex real-world tasks, nuanced affordance reasoning, and structured failure analysis. Through a large-scale, multi-dimensional QA framework and comprehensive dataset curation, RoboBench aims to standardize and advance the quantification of high-level cognition in embodied MLLMs (Luo et al., 20 Oct 2025).

1. Motivation and Scope

RoboBench addresses the longstanding challenge in robotics of evaluating not just low-level manipulation success but the broader cognitive capacities of MLLMs that serve as the reasoning and decision-making core in complex, dynamic settings. Previous systematic evaluations either measured only execution success or, when focusing on reasoning, were limited in scope, lacking realism in task, embodiment, and scene diversity. RoboBench is constructed explicitly to expose and quantify limitations in current MLLMs across instruction comprehension, perception, planning, action-affordance reasoning, and diagnostic feedback. Its scope covers 14 capabilities within five cognitive dimensions, across 25 representative tasks and 6092 QA pairs derived from real-world robotic data and multi-view scenes (Luo et al., 20 Oct 2025).

2. Benchmark Organization: Cognitive Dimensions and Capabilities

RoboBench is organized around five critical cognitive dimensions relevant for the embodied brain in manipulation tasks:

  1. Instruction Comprehension: Tests both explicit and implicit instruction grounding. Capabilities include understanding natural, indirect, or demand-based requests and correctly translating them to actionable plans.
  2. Perception Reasoning: Evaluates scene interpretation capabilities—robotic-centric (robot type and view recognition), object-centric (static and functional attributes), scene-centric (spatial/temporal/causal reasoning), and task-centric (identifying instruction-relevant objects).
  3. Generalized Planning: Measures the ability to decompose long-horizon instructions into partially ordered subgoal/action graphs that are physically feasible, robust across embodiment, object set, and view variations.
  4. Affordance Prediction: Focuses on predicting actionable spatial cues (e.g., contact points, trajectories, base positions) necessary to bridge high-level plans to low-level controllers.
  5. Failure Analysis: Quantifies the ability to detect and diagnose errors both at the execution level (alignment, trajectory deviation) and planning level (step omissions, ordering mistakes).

Each dimension contains several capabilities and is measured using dedicated tasks and QA items. For example, in planning, evaluation includes long-horizon plan sequence correctness, next-step decomposition, and binary task state estimation.

3. Data Curation and Realism

Datasets within RoboBench are explicitly curated for high realism, spanning:

  • Embodiment Diversity: Includes single-arm, dual-arm, mobile manipulation, and humanoid scenarios.
  • Attribute-rich Objects: Incorporates objects with broad variation in static (shape, material) and functional (openable, pushable, etc.) attributes, validated against world knowledge and real-world physical constraints.
  • Scene Realism and Multi-View: Scenes are constructed using multi-view setups to replicate occlusions, partial observability, and the need for memory-driven navigation.
  • Real Robotic Data Integration: Combines large-scale open-source datasets (such as RH20T, RoboMind) with rigorously labeled in-house real-robotic data, narrowing the sim-to-real gap and ensuring evaluation tasks mirror operational complexity.

4. Evaluation Framework and Methodologies

A critical innovation in RoboBench is the "MLLM-as-world-simulator" evaluation for planning tasks. Rather than scoring via multiple-choice or mere string match, RoboBench simulates whether the predicted plan achieves critical object state changes within the given scene. The plan is represented as a partially ordered directed acyclic graph (DAG), with each node parameterized as ⟨skill, object, args⟩.

The evaluation metrics for long-horizon planning are formally defined:

  • NodeCorrectness:

NodeCorrectness=VV^V×10\text{NodeCorrectness} = \left\lfloor \frac{|V^* \cap \hat{V}|}{|V^*|} \times 10 \right\rfloor

where VV^* is the set of reference nodes, V^\hat{V} is the set of predicted nodes.

  • TaskCompletion:

TaskCompletion=S^S×10\text{TaskCompletion} = \left\lfloor \frac{|\hat{S}|}{|S^*|} \times 10 \right\rfloor

where SS^* and S^\hat{S} denote the reference and predicted object state change sets.

The final long-horizon score is:

LongHorizon=NodeCorrectness+TaskCompletion20[0,1]\text{LongHorizon} = \frac{\text{NodeCorrectness} + \text{TaskCompletion}}{20} \in [0, 1]

Affordance predictions are quantitatively compared to expert-labeled spatial cues, and error-diagnosis tasks are benchmarked for both the detection of execution errors and proper failure attribution.

5. Experimental Results and Insights

Comprehensive experiments with 14 SOTA MLLMs revealed:

  • Models perform substantially worse (by ~30%) in implicit instruction comprehension compared to explicit instructions, indicating a deficiency in indirect intent grounding.
  • Perception reasoning remains a bottleneck, with poor robot-view recognition, temporal grounding, and causal inference scores.
  • Planning remains challenging: even with multi-view inputs and attribute augmentation, MLLMs fall short of human-level long-horizon plan generation, coordination of dual arms, or correct sequencing for rare/ambiguous objects.
  • Affordance prediction lags behind human reference (top MLLM: Gemini-2.5-Pro, score 65.21; human: 82.63).
  • Failure analysis, especially in the diagnosis of execution errors, is inadequately handled by current models (typical scores 10–20), revealing a fundamental gap in the embodied brain’s error comprehension and explanation.

These findings offer actionable direction—indicating where current MLLMs are brittle (implicit intent, spatiotemporal reasoning, generalization across scenes and objects) and where further multi-modal, real-world, and temporal grounding must be advanced.

6. Implications for Embodied Intelligence Research

RoboBench establishes itself as a comprehensive scaffold for systematically quantifying embodied cognition in robotic manipulation. By addressing not just task success but the underlying abilities—comprehension, perception, planning, affordance reasoning, and diagnostic feedback—it delivers nuanced diagnostics essential for iterative MLLM development.

A plausible implication is that progress in embodied MLLMs will require:

  • Integrating deep multi-view, embodiment, and attribute-aware perception systems.
  • Developing new mechanisms for robust implicit instruction grounding and spatiotemporally-attentive planning.
  • Directly benchmarking not only plan generation, but also plan physical realizability and ability to self-diagnose and recover from errors.

7. Future Directions

The paper outlines several forward-looking directions:

  • Enhanced Instruction Grounding: Developing MLLMs capable of leveraging contextual, perceptual, and world priors for resolving indirect or implicit natural language instructions.
  • Advanced Spatiotemporal Perception: Incorporating embodiment-aware scene parsing and more effective temporal/causal reasoning techniques, potentially with explicit attention mechanisms for relational reasoning.
  • Robust Plan, Affordance, and Failure Modules: Closing the gap with human reference in plan generalization, affordance spatial reasoning, and execution-level failure diagnosis through closer model-simulator integration and continual dataset diversification.
  • Continual Dataset Refinement: Expanding to include broader environments, real-world constraints, and human-in-the-loop interactions, further reducing sim-to-real discrepancies.
  • Benchmark Evolution: Updating evaluation methods as new MLLM paradigms and real-robotic control architectures arise, maintaining RoboBench’s relevance as a standard in embodied AI research.

RoboBench thus serves as both a comprehensive assessment platform and a guidepost for designing and quantifying the next generation of multimodal, high-level reasoning systems for embodied robotic manipulation (Luo et al., 20 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RoboBench.