Papers
Topics
Authors
Recent
2000 character limit reached

MomaGraph: Unified Scene Graphs for Embodied Tasks

Updated 22 December 2025
  • MomaGraph is a unified, dynamic scene graph formalism that models both spatial and functional affordances for embodied agents in domestic environments.
  • It leverages a large-scale, richly annotated dataset of household scenes with detailed spatial and state annotations to support zero-shot and real-robot transfer.
  • Its comprehensive evaluation suite and RL-enhanced vision-language model demonstrate improved planning accuracy and actionable reasoning in complex manipulation tasks.

MomaGraph is a unified, dynamically updated scene graph formalism and associated computational framework for representing, learning, and reasoning about household environments and manipulation tasks by embodied agents. It advances the field of embodied task planning by integrating spatial and functional affordances, supporting dynamic state updates, and enabling vision-LLMs to predict and reason over richly structured, actionable scene representations. Grounded in a large, rigorously annotated dataset and a comprehensive benchmark suite, MomaGraph supports zero-shot generalization and real-robot transfer for mobile manipulators operating in complex domestic spaces (Ju et al., 18 Dec 2025).

1. Unified Scene Graph Formalism

The foundational object in MomaGraph is the task-conditioned, dynamic scene graph

GT(t)=(V(t), Es(t), Ef(t), S(t))\mathcal{G}_{\mathcal{T}}^{(t)} = (\mathcal{V}^{(t)},\,\mathcal{E}_s^{(t)},\,\mathcal{E}_f^{(t)},\,\mathcal{S}^{(t)})

where:

  • V(t)\mathcal{V}^{(t)} is the set of nodes at time tt, representing objects or part-level interactive elements (handles, knobs, buttons).
  • Es(t)⊂V(t)×V(t)\mathcal{E}_s^{(t)} \subset \mathcal{V}^{(t)} \times \mathcal{V}^{(t)} are directed edges encoding nine spatial predicates: LEFT_OF\mathtt{LEFT\_OF}, IN_FRONT_OF\mathtt{IN\_FRONT\_OF}, CLOSE\mathtt{CLOSE}, etc.
  • Ef(t)⊂V(t)×V(t)\mathcal{E}_f^{(t)} \subset \mathcal{V}^{(t)} \times \mathcal{V}^{(t)} are directed functional edges, with six supported types: OPEN_OR_CLOSE\mathtt{OPEN\_OR\_CLOSE}, ADJUST\mathtt{ADJUST}, CONTROL\mathtt{CONTROL}, ACTIVATE\mathtt{ACTIVATE}, POWER_BY\mathtt{POWER\_BY}, PAIR_WITH\mathtt{PAIR\_WITH}.
  • S(t)\mathcal{S}^{(t)} attaches state variables (on/off, open/closed, numerical parameters) to relevant nodes.

Dynamics are explicitly modeled: after action ata_t, with new state st+1s_{t+1}, the update operator

GT(t+1)=U(GT(t), at, st+1)\mathcal{G}_{\mathcal{T}}^{(t+1)} = \mathcal{U}(\mathcal{G}_{\mathcal{T}}^{(t)},\,a_t,\,s_{t+1})

performs edge consolidation/pruning and updates node states according to observed effects (e.g., which burner is actually lit after knob manipulation).

Significance: Unlike prior scene graphs that treat arrangements as static and partition spatial vs functional roles, MomaGraph encodes both, including part-level affordances and explicit state transitions, thereby supporting robust embodied planning (Ju et al., 18 Dec 2025).

2. MomaGraph-Scenes: Scale, Annotation, and Grounding

MomaGraph-Scenes is the first large-scale dataset of task-annotated, richly grounded scene graphs in real and synthetic household environments. It comprises approximately 1,050 task-oriented subgraphs, 6,278 multi-view RGB images, over 350 distinct household scenes across kitchens, living rooms, bedrooms, and bathrooms, and 93 natural-language task instructions.

Each instance features:

  • Nodes for required objects and part-level affordances (as determined by the instruction).
  • Directed edges realizing the predefined set of nine spatial and six functional relationships.
  • State annotations: binary/continuous values (on/off, open/closed, parameterized) on nodes/edges.
  • 3–8 multi-view image links per subgraph for precise grounding.

Significance: Existing corpora lack such integration of actionable affordance, state, and spatial/functional relationships at part-level fidelity, preventing prior research from benchmarking embodied task reasoning at sufficient granularity (Ju et al., 18 Dec 2025).

3. MomaGraph-Bench: Comprehensive Evaluation

MomaGraph-Bench is a systematic evaluation suite designed to probe six core capabilities required of embodied reasoning agents, each formulated as multi-choice VQA:

  1. Action Sequence Reasoning: Step ordering and dependencies.
  2. Spatial Reasoning: Reachability and spatial arrangement queries.
  3. Object Affordance Reasoning: Determination of actionable parts (turn/open/pair).
  4. Precondition–Effect Reasoning: Causal relationships, necessary preconditions, and side effects.
  5. Goal Decomposition: Task sub-goaling and hierarchical planning.
  6. Visual Correspondence: Maintenance of object identity across multi-view inputs.

The protocol uses 294 scenes, 1,446 images, 352 graphs, and 1,315 questions. Accuracy is measured both in direct plan (no graph intermediate) and graph-then-plan (explicit scene graph generation followed by reasoning).

Significance: The inclusion of part-level visual correspondence, precondition-effect chains, and action-plan decomposition benchmarks the reasoning chain from perception through to high-level planning, with diagnostic task tiers by complexity (Ju et al., 18 Dec 2025).

4. MomaGraph-R1: Vision-LLM, Training, and Optimization

MomaGraph-R1 is a 7B-parameter vision-LLM (VLM) built upon Qwen2.5-VL-7B-Instruct, comprising:

  • A vision encoder (ResNet + ViT hybrid from Qwen2.5 suite).
  • A 32-layer, 4,096-hidden, 32-head transformer decoder.

Training objectives:

  1. Supervised fine-tuning (Lsup\mathcal{L}_{\text{sup}}) on ground-truth scene graphs via cross-entropy.
  2. Reinforcement learning (DAPO: Distributed Advantage-Policy Optimization) maximizing a structured graph-alignment reward R(Gpred,Ggt)R(\mathcal{G}^{\mathrm{pred}},\mathcal{G}^{\mathrm{gt}}): LRL=−Eπ[R(Gpred,Ggt)]\mathcal{L}_{RL} = -\mathbb{E}_{\pi}[R(\mathcal{G}^{\mathrm{pred}}, \mathcal{G}^{\mathrm{gt}})]
  3. Combined training objective: L=Lsup+λ LRL\mathcal{L} = \mathcal{L}_{\text{sup}} + \lambda\,\mathcal{L}_{RL} where λ\lambda controls trade-off between supervised graph reconstruction and reward-driven policy refinement.

Significance: RL-driven graph-alignment substantially improves the task performance, as ablation studies show a 7.7% gain over pure SFT. MomaGraph-R1 directly predicts actionable scene graphs from vision and language inputs, enabling effective zero-shot and real-robot transfer (Ju et al., 18 Dec 2025).

5. Results: Accuracy, Ablations, and Generalization

MomaGraph-R1 attains 71.6% accuracy on MomaGraph-Bench (graph-then-plan), outperforming the best open-source baseline (Qwen2.5-VL-7B with graphs, 60.2%) by 11.4%. This matches or narrowly trails state-of-the-art closed models (GPT-5: 71.6%; Claude-4.5: 73.9%). Task-complexity breakdowns for graph-then-plan mode:

Tier Description Accuracy (%)
Tier 1 Simple 76.4
Tier 2 Spatial+Affordance 71.9
Tier 3 Precondition 70.1
Tier 4 Long-horizon 68.1

Ablations demonstrate the necessity of both spatial and functional edges: spatial-only graphs yield 59.9%, functional-only 64.9%, while full unified graphs reach 71.6%. The SFT baseline achieves 63.9%, with RL contributing +7.7%.

Visual correspondence (BLINK benchmark): MomaGraph-R1, 63.5% vs best open-source, 59.7%.

Zero-shot transfer: BLINK (63.5%), MomaGraph-Bench without graph (65.1%). Real-robot experiments (RobotEra Q5 + RealSense D455) show 80% graph generation accuracy, 87.5% planning, and 70% end-to-end success across four household tasks in 10 trials.

Significance: These results establish MomaGraph as a leading open-source approach for embodied reasoning with strong generalization and transfer properties (Ju et al., 18 Dec 2025).

6. Implementation, Practical Deployment, and Future Directions

Training is executed via DAPO on 8×80 GB A100 GPUs for 13 hours (25 epochs, 175 steps), with AdamW optimizer (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, weight decay 0.01), learning rate 1e−61\mathrm{e}{-6}, batch sizes: actor 128, critic 256. KL penalties follow DAPO defaults, and reward-weight sensitivity analysis indicates ±2–3% range in accuracy across settings.

On physical robots, MomaGraph-R1 demonstrates high task success with active multi-view perception in both local and remote operation.

A plausible implication is that future research may extend the MomaGraph paradigm to additional affordance types, multi-agent settings, or incorporate goal-conditioned graph manipulation in more diverse environments, with further integration of part-level scene graph morphologies inspired by mathematical morphology frameworks (Najman et al., 2014).

7. Comparative Context and Theoretical Connections

MomaGraph’s architectural unification of spatial and functional affordances at the part level differentiates it from static, object-level scene graph representations in previous planning literature. The formal dynamic update operator and explicit state variables address the need for temporally consistent, actionable environment models for manipulation and navigation.

Its benchmark design reflects an overview of task decomposition, causal reasoning, and perception across views, aligning with the growing focus on graph-indexed moment problems and their reflection-positive functionals in graph limit theory (Lovász et al., 2010).

Finally, the MomaGraph formalism supports new lines of research in graph-based mathematical morphology by enabling morphological reasoning (e.g., path-based openings, component analysis) over part-enriched, actionable scene graphs, which suggests further applicability at the interface of computational vision, robotics, and structure-aware graph learning (Najman et al., 2014).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MomaGraph.