MomaGraph-Bench: Unified Scene Graph Evaluation
- MomaGraph-Bench is a systematic evaluation suite designed to integrate unified scene graphs into high-level planning and reasoning for mobile manipulators.
- It leverages richly annotated MomaGraph-Scenes data and a 7B RL-trained vision–language model to assess action sequencing, spatial, and functional reasoning.
- The benchmark employs multi-choice VQA metrics and real-robot experiments to validate improvements in dynamic verification, task decomposition, and multi-view consistency.
MomaGraph-Bench is a systematic evaluation suite designed to rigorously measure the integration of state-aware, unified scene graphs into embodied task planning for mobile manipulators in household environments. It is constructed atop the MomaGraph-Scenes dataset—comprising richly annotated, task-driven scene graphs—and the MomaGraph-R1 vision–LLM, with the overarching aim of quantifying how structured, task-oriented intermediate representations improve high-level planning, spatial and functional reasoning, and scene understanding in robotics contexts (Ju et al., 18 Dec 2025).
1. Purpose and Motivation
MomaGraph-Bench addresses three foundational gaps in prior research on embodied agents: the absence of large-scale, richly annotated scene graphs combining spatial and functional relations; the lack of an interpretable vision–LLM (VLM) that generates actionable scene graphs and leverages them for zero-shot planning; and the need for a comprehensive benchmark to evaluate these models across a spectrum of reasoning tasks, from action sequencing to fine-grained multi-view consistency. The benchmark complements the MomaGraph-Scenes dataset (∼1,050 task-oriented subgraphs, >350 household scenes, 6,278 multi-view images) and the MomaGraph-R1 model (7B parameters, RL-trained) (Ju et al., 18 Dec 2025).
2. Benchmark Structure and Reasoning Capabilities
At its core, MomaGraph-Bench comprises a multi-choice visual question answering (VQA) suite that probes six essential reasoning abilities, each mapped to specialized question styles sampled on held-out environments:
- Action Sequence Reasoning: Evaluation of high-level, minimal action plans to achieve specified goals (e.g., selecting and ordering steps to boil water).
- Spatial Reasoning: Assessment of understanding positional relations (e.g., left_of, in_front_of, reachability).
- Object Affordance Reasoning: Functional reasoning over objects’ action capabilities (e.g., selecting which part of a microwave to interact with).
- Precondition & Effect Reasoning: Temporal reasoning about precondition–effect dependencies (e.g., recognizing when an action has no effect due to unmet preconditions).
- Goal Decomposition: Task generalization via decomposition into parallel or sequential subgoals.
- Visual Correspondence: Multi-view consistency checks, tracking object parts across camera perspectives.
Additionally, the benchmark extends to dynamic verification (replanning if objects or outcomes differ from expectation) and long-horizon task decomposition (multi-step replanning). The full suite spans 1,315 VQA instances across 294 held-out scenes and includes difficulty tiers T1–T4, with increasing complexity and requirement for multi-step reasoning.
3. Dataset Design and Task Construction
MomaGraph-Bench builds on MomaGraph-Scenes, annotated with both spatial (9 types, e.g., LEFT_OF, TOUCHING) and functional (6 types, e.g., OPEN_OR_CLOSE, ACTIVATE) relations at the part level. Each evaluation instance grounds its multi-choice question in a sampled subgraph, ensuring balanced coverage across all reasoning types and difficulty levels.
Tasks are validated through dual human annotation rounds to filter ambiguity or ill-posed prompts, and the benchmark is structured to capture real-world complexity—covering diverse household environments and a variety of manipulation and navigation scenarios.
| Statistic | MomaGraph-Scenes | MomaGraph-Bench (Eval) |
|---|---|---|
| Scenes | >350 | 294 |
| Subgraphs | ~1,050 | 352 |
| Multi-view images | 6,278 | 1,446 |
| VQA instances | - | 1,315 |
| Task templates | 93 | - |
4. Evaluation Metrics and Protocols
Performance on MomaGraph-Bench is quantified through several protocols:
- Multi-Choice Accuracy:
- Scene Graph Generation:
- Node Precision and Recall:
- Edge metrics and F1 similarly.
- Task Success Rate (Robotics): Measures overall, graph generation, and planning success—quantifying full pipeline efficacy in real-robot experiments.
- RL Training Rewards: MomaGraph-R1 learns from a composite per-sample reward
targeting alignment to both the graph structure and downstream planning use.
This methodology ensures scoring fidelity across synthetic reasoning, scene graph prediction, and embodied real-world deployment.
5. Baseline Models and Comparative Results
MomaGraph-Bench evaluates a spectrum of open- and closed-source VLMs. The following table displays overall accuracy (%) in the Graph-then-Plan (“w/ Graph”) setting:
| Model | Accuracy (%) |
|---|---|
| Claude-4.5-Sonnet | 73.9 |
| GPT-5 | 71.6 |
| Gemini-2.5-Pro | 71.6 |
| MomaGraph-R1 (ours) | 71.6 |
| Qwen2.5-VL-7B | 60.2 |
| LLaVA-OneVision-7B | 55.6 |
| DeepSeek-VL2 | 54.3 |
| InternVL2.5-8B | 51.1 |
| LLaVA-V1.5-7B | 44.8 |
| InstructBLIP-7B | 39.5 |
MomaGraph-R1 (7B RL-trained) achieves a +11.4% absolute gain over its pretrained base (Qwen2.5-VL-7B) and matches the performance of closed-source models. Ablations reveal that unified spatial+functional graphs (71.6%) exceed both spatial-only (59.9%) and functional-only (64.9%) approaches. Among training regimes, reinforcement learning with specialized graph-alignment rewards is essential: supervised finetuning (SFT: 63.9%) and in-context learning (ICL: 60.2%) lag behind.
Reward-weight sensitivity analyses demonstrate stability within ±3.5% accuracy across hyperparameter adjustments, supporting reproducibility and robustness.
6. Insights, Limitations, and Applications
Key insights from MomaGraph-Bench include:
- Explicit Intermediate Representations: The explicit incorporation of task-oriented scene graphs as an intermediate step (Graph-then-Plan) yields consistent improvements in both closed- and open-source VLMs.
- Graph Structure: Unified spatial–functional scene graphs are disproportionately beneficial for multi-step and complex manipulation tasks relative to single-relation graphs.
- Training Paradigms: Reinforcement learning that directly aligns graph outputs with downstream planning utility significantly outperforms SFT and ICL.
- Limitations of Current VLMs: The benchmark reveals that most vision–LLMs degrade sharply on Tier 4 (dynamic verification, flexible re-planning) tasks, highlighting persistent generalization bottlenecks.
- Broad Applicability: As a diagnostic tool, MomaGraph-Bench supports evaluation of embodied navigation, object manipulation (via affordance and precondition reasoning), multi-view perception, and long-horizon planning.
A plausible implication is that widespread adoption of task-aligned graph intermediates could influence both robotics and multimodal AI, suggesting fruitful areas for further research in scalable graph construction, multi-modal fusion, and robust plan execution (Ju et al., 18 Dec 2025).
7. Future Directions and Research Opportunities
MomaGraph-Bench establishes a principled standard for evaluating scene graph-based planning but also highlights several open challenges:
- Scalability: Extending annotations and reasoning complexity to larger and more diverse household environments.
- Generalization: Robustness to highly novel scenes, object configurations, and manipulator types.
- Multi-Modal Expansion: Incorporating modalities such as audio or haptics into scene graph structure and associated VQA.
- Improved Cross-Modal Reasoning: Deeper integration of spatial, functional, and temporal information during policy rollout.
This suggests an ongoing research agenda centered on improving multimodal scene understanding, advancing RL-based VLM training strategies, and bridging the gap from simulated benchmarks to reliable physical robot deployments.