MomaGraph-Bench: Unified Scene Graph Evaluation

Updated 4 May 2026

MomaGraph-Bench is a systematic evaluation suite designed to integrate unified scene graphs into high-level planning and reasoning for mobile manipulators.
It leverages richly annotated MomaGraph-Scenes data and a 7B RL-trained vision–language model to assess action sequencing, spatial, and functional reasoning.
The benchmark employs multi-choice VQA metrics and real-robot experiments to validate improvements in dynamic verification, task decomposition, and multi-view consistency.

MomaGraph-Bench is a systematic evaluation suite designed to rigorously measure the integration of state-aware, unified scene graphs into embodied task planning for mobile manipulators in household environments. It is constructed atop the MomaGraph-Scenes dataset—comprising richly annotated, task-driven scene graphs—and the MomaGraph-R1 vision–LLM, with the overarching aim of quantifying how structured, task-oriented intermediate representations improve high-level planning, spatial and functional reasoning, and scene understanding in robotics contexts (Ju et al., 18 Dec 2025).

1. Purpose and Motivation

MomaGraph-Bench addresses three foundational gaps in prior research on embodied agents: the absence of large-scale, richly annotated scene graphs combining spatial and functional relations; the lack of an interpretable vision–LLM (VLM) that generates actionable scene graphs and leverages them for zero-shot planning; and the need for a comprehensive benchmark to evaluate these models across a spectrum of reasoning tasks, from action sequencing to fine-grained multi-view consistency. The benchmark complements the MomaGraph-Scenes dataset (∼1,050 task-oriented subgraphs, >350 household scenes, 6,278 multi-view images) and the MomaGraph-R1 model (7B parameters, RL-trained) (Ju et al., 18 Dec 2025).

2. Benchmark Structure and Reasoning Capabilities

At its core, MomaGraph-Bench comprises a multi-choice visual question answering (VQA) suite that probes six essential reasoning abilities, each mapped to specialized question styles sampled on held-out environments:

Action Sequence Reasoning: Evaluation of high-level, minimal action plans to achieve specified goals (e.g., selecting and ordering steps to boil water).
Spatial Reasoning: Assessment of understanding positional relations (e.g., left_of, in_front_of, reachability).
Object Affordance Reasoning: Functional reasoning over objects’ action capabilities (e.g., selecting which part of a microwave to interact with).
Precondition & Effect Reasoning: Temporal reasoning about precondition–effect dependencies (e.g., recognizing when an action has no effect due to unmet preconditions).
Goal Decomposition: Task generalization via decomposition into parallel or sequential subgoals.
Visual Correspondence: Multi-view consistency checks, tracking object parts across camera perspectives.

Additionally, the benchmark extends to dynamic verification (replanning if objects or outcomes differ from expectation) and long-horizon task decomposition (multi-step replanning). The full suite spans 1,315 VQA instances across 294 held-out scenes and includes difficulty tiers T1–T4, with increasing complexity and requirement for multi-step reasoning.

3. Dataset Design and Task Construction

MomaGraph-Bench builds on MomaGraph-Scenes, annotated with both spatial (9 types, e.g., LEFT_OF, TOUCHING) and functional (6 types, e.g., OPEN_OR_CLOSE, ACTIVATE) relations at the part level. Each evaluation instance grounds its multi-choice question in a sampled subgraph, ensuring balanced coverage across all reasoning types and difficulty levels.

Tasks are validated through dual human annotation rounds to filter ambiguity or ill-posed prompts, and the benchmark is structured to capture real-world complexity—covering diverse household environments and a variety of manipulation and navigation scenarios.

Statistic	MomaGraph-Scenes	MomaGraph-Bench (Eval)
Scenes	>350	294
Subgraphs	~1,050	352
Multi-view images	6,278	1,446
VQA instances	-	1,315
Task templates	93	-

4. Evaluation Metrics and Protocols

Performance on MomaGraph-Bench is quantified through several protocols:

Multi-Choice Accuracy:

$\text{Accuracy} = \frac{\#\text{correctly answered questions}}{\#\text{total questions}}$

Scene Graph Generation:
- Node Precision and Recall:
$P_N = \frac{|\mathcal{N}_{\rm pred}\cap\mathcal{N}_{\rm gt}|}{|\mathcal{N}_{\rm pred}|},\quad R_N = \frac{|\mathcal{N}_{\rm pred}\cap\mathcal{N}_{\rm gt}|}{|\mathcal{N}_{\rm gt}|}$ - Edge metrics and F1 similarly.
Task Success Rate (Robotics): Measures overall, graph generation, and planning success—quantifying full pipeline efficacy in real-robot experiments.
RL Training Rewards: MomaGraph-R1 learns from a composite per-sample reward

$\mathcal{R} = w_a R_{\rm action} + R_{\rm edges} + R_{\rm nodes} + w_f R_{\rm format} + w_l R_{\rm length}$

targeting alignment to both the graph structure and downstream planning use.

This methodology ensures scoring fidelity across synthetic reasoning, scene graph prediction, and embodied real-world deployment.

5. Baseline Models and Comparative Results

MomaGraph-Bench evaluates a spectrum of open- and closed-source VLMs. The following table displays overall accuracy (%) in the Graph-then-Plan (“w/ Graph”) setting:

Model	Accuracy (%)
Claude-4.5-Sonnet	73.9
GPT-5	71.6
Gemini-2.5-Pro	71.6
MomaGraph-R1 (ours)	71.6
Qwen2.5-VL-7B	60.2
LLaVA-OneVision-7B	55.6
DeepSeek-VL2	54.3
InternVL2.5-8B	51.1
LLaVA-V1.5-7B	44.8
InstructBLIP-7B	39.5

MomaGraph-R1 (7B RL-trained) achieves a +11.4% absolute gain over its pretrained base (Qwen2.5-VL-7B) and matches the performance of closed-source models. Ablations reveal that unified spatial+functional graphs (71.6%) exceed both spatial-only (59.9%) and functional-only (64.9%) approaches. Among training regimes, reinforcement learning with specialized graph-alignment rewards is essential: supervised finetuning (SFT: 63.9%) and in-context learning (ICL: 60.2%) lag behind.

Reward-weight sensitivity analyses demonstrate stability within ±3.5% accuracy across hyperparameter adjustments, supporting reproducibility and robustness.

6. Insights, Limitations, and Applications

Key insights from MomaGraph-Bench include:

Explicit Intermediate Representations: The explicit incorporation of task-oriented scene graphs as an intermediate step (Graph-then-Plan) yields consistent improvements in both closed- and open-source VLMs.
Graph Structure: Unified spatial–functional scene graphs are disproportionately beneficial for multi-step and complex manipulation tasks relative to single-relation graphs.
Training Paradigms: Reinforcement learning that directly aligns graph outputs with downstream planning utility significantly outperforms SFT and ICL.
Limitations of Current VLMs: The benchmark reveals that most vision–LLMs degrade sharply on Tier 4 (dynamic verification, flexible re-planning) tasks, highlighting persistent generalization bottlenecks.
Broad Applicability: As a diagnostic tool, MomaGraph-Bench supports evaluation of embodied navigation, object manipulation (via affordance and precondition reasoning), multi-view perception, and long-horizon planning.

A plausible implication is that widespread adoption of task-aligned graph intermediates could influence both robotics and multimodal AI, suggesting fruitful areas for further research in scalable graph construction, multi-modal fusion, and robust plan execution (Ju et al., 18 Dec 2025).

7. Future Directions and Research Opportunities

MomaGraph-Bench establishes a principled standard for evaluating scene graph-based planning but also highlights several open challenges:

Scalability: Extending annotations and reasoning complexity to larger and more diverse household environments.
Generalization: Robustness to highly novel scenes, object configurations, and manipulator types.
Multi-Modal Expansion: Incorporating modalities such as audio or haptics into scene graph structure and associated VQA.
Improved Cross-Modal Reasoning: Deeper integration of spatial, functional, and temporal information during policy rollout.

This suggests an ongoing research agenda centered on improving multimodal scene understanding, advancing RL-based VLM training strategies, and bridging the gap from simulated benchmarks to reliable physical robot deployments.

Markdown Report Issue Upgrade to Chat

References (1)

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MomaGraph-Bench.