Papers
Topics
Authors
Recent
2000 character limit reached

MomaGraph-Scenes: Task-Driven Scene Graphs

Updated 22 December 2025
  • MomaGraph-Scenes is a large-scale, task-driven dataset featuring directed scene graphs that integrate spatial, functional, and state-aware annotations.
  • The annotation schema includes 93 object categories, detailed part-level affordances, and both spatial and functional relation taxonomies.
  • The dataset underpins embodied AI applications by enabling dynamic graph updates for task planning, robotic manipulation, and vision-language reasoning.

MomaGraph-Scenes constitutes the first large-scale, richly annotated dataset of task-driven scene graphs designed to support embodied manipulation and navigation in household environments. It is introduced as part of the MomaGraph project, which advances unified scene graph representations integrating spatial, functional, and actionable part-level relations for robotics and vision-language modeling in service of real-world task planning and execution (Ju et al., 18 Dec 2025).

1. Unified Scene Graph Representation

The scene representation underlying MomaGraph-Scenes formalizes each task-oriented, instruction-conditioned environment as a directed attributed graph

G=(V,Es,Ef,A)G = (V, E_s, E_f, A)

where:

  • V={v1,…,vN}V = \{v_1, \dots, v_N\}: nodes representing object instances and interactive parts (e.g., handles, knobs, buttons) relevant to a specified instruction.
  • Es⊆V×VE_s \subseteq V \times V: spatial edges encoding geometric relationships (such as LEFT_OF, IN_FRONT_OF, TOUCHING).
  • Ef⊆V×VE_f \subseteq V \times V: functional edges encoding how one node can act to change the state of another (e.g., a handle OPEN_OR_CLOSEs a door).
  • A={Av∣v∈V}∪{Ae∣e∈Es∪Ef}A = \{A_v \mid v\in V\} \cup \{A_e \mid e \in E_s \cup E_f\}: node and edge attributes, including object category, part type, state (e.g., open/closed), and affordance.

This unified representation distinguishes itself by integrating both spatial and functional relations, associating part-level actionable elements, and capturing state variables explicitly. The set of node attributes, AvA_v, details object category, part type (where applicable), and state variables (e.g., open/closed, filled/empty). Edge attributes AeA_e encode relation type and optionally part-level affordances (e.g., graspable, rotatable).

2. Annotation Schema and Relation Taxonomy

MomaGraph-Scenes employs a comprehensive annotation schema comprising:

  • Objects: 93 categories including fridge, microwave, faucet, light_switch, cabinet, stove, etc.
  • Parts: Handles, knobs, buttons, drains, remote_controls, triggers, among others.
  • Node attributes: Every node is labeled with object_category, part_type (or None), and relevant state or affordance.
  • Spatial Relations (EsE_s): Types include LEFT_OF, RIGHT_OF, IN_FRONT_OF, BEHIND, HIGHER_THAN, LOWER_THAN (directional), as well as CLOSE, FAR, TOUCHING (distance-based).
  • Functional Relations (EfE_f): 1. OPEN_OR_CLOSE, 2. ADJUST (parameter adjustment), 3. CONTROL, 4. ACTIVATE, 5. POWER_BY, 6. PAIR_WITH.
  • Interactive Affordances: Including graspablePartOf, rotatablePartOf, pushablePartOf, pullablePartOf.

Example annotated relations:

  • ("cabinet_handle" —OPEN_OR_CLOSE→ "cabinet_body", spatial={TOUCHING, IN_FRONT_OF})
  • ("remote_control" —CONTROL→ "TV", spatial={LOWER_THAN, CLOSE})
  • ("stove_knob" —ADJUST→ "burner", spatial={TOUCHING})

This taxonomy supports the explicit grounding of actionable parts and distinguishes contextually relevant spatial and functional connections.

3. Dataset Construction and Statistics

MomaGraph-Scenes was constructed from over 350 distinctive household scenes, including both real environments and AI2-THOR simulations, with additional scenes resulting from re-annotation of public datasets. Each scene is captured from an average of 18 multi-view RGB images, yielding a total of 6278 images. The dataset encompasses 1050 task-oriented subgraphs, with each subgraph representing a subset of the environment directly relevant to a natural-language instruction.

Dataset Statistics

Aspect Value Avg. per Scene
Distinct Scenes 350 –
Multi-view Images 6278 18
Subgraphs 1050 3.0
Nodes per Subgraph – 3.2
Edges per Subgraph – 3.8
Task Instructions 93 –
Train/Val/Test Split 840/105/105 –

Room-type distributions are: Kitchen 36%, Living Room 28%, Bedroom 18%, Bathroom 18%. Scene graphs are directly linked to 93 natural-language instruction types, such as "Open the fridge" or "Turn on the television," and subgraphs are constructed such that only the nodes and edges precisely necessary for the task are retained.

4. Task-Driven Dynamic Annotations

A distinctive attribute of MomaGraph-Scenes is the task-driven nature of its subgraphs. For each instruction, a subset VT⊆VV_T \subseteq V and associated edge sets EsTE_s^T, EfTE_f^T are selected based on the requirements of the intended action. Part-level elements are included when physically relevant (e.g., fridge_handle for "Open the fridge"). Dynamic updates are supported, where

GT(t+1)=U(GT(t),at,st+1)G_T^{(t+1)} = \mathcal{U}\bigl(G_T^{(t)}, a_t, s_{t+1}\bigr)

with action ata_t and new observation st+1s_{t+1} revising the subgraph to reflect transitioned states, prune inconsistent hypotheses, and solidify confirmed relations. For example, interacting with stove_knob and observing that only a specific burner ignites results in state refinement through graph update.

Illustrative graphs include:

  • Television:

V={v1=remote,v2=TV},Ef={(v1→CONTROLv2)},Es={(v1→{LOWER_THAN,IN_FRONT_OF,CLOSE}v2)}V = \{v_1 = \text{remote}, v_2 = \text{TV}\},\quad E_f = \{(v_1 \xrightarrow[\text{CONTROL}]{} v_2)\},\quad E_s = \{(v_1 \xrightarrow[\{\text{LOWER\_THAN},\text{IN\_FRONT\_OF},\text{CLOSE}\}]{} v_2)\}

  • Fridge:

V={v1=fridge_body,v2=fridge_handle},Ef={(v2→OPEN_OR_CLOSEv1)},Es={(v2→{TOUCHING,GRASPABLE_PART_OF}v1)}V = \{v_1 = \text{fridge\_body}, v_2 = \text{fridge\_handle}\},\quad E_f = \{(v_2 \xrightarrow[\text{OPEN\_OR\_CLOSE}]{} v_1)\},\quad E_s = \{(v_2 \xrightarrow[\{\text{TOUCHING},\text{GRASPABLE\_PART\_OF}\}]{} v_1)\}

5. Benchmark Suite: MomaGraph-Bench

MomaGraph-Bench is a systematic evaluation framework linked with MomaGraph-Scenes, containing 294 scenes, 1446 images, 352 scene graphs, and 1315 benchmark instances formulated as multi-choice Visual Question Answering (VQA) tasks. Benchmark metrics employ accuracy.

Six core reasoning capabilities are evaluated:

  1. Action Sequence Reasoning: e.g., "What is the next step after removing the mug to fill it?"
  2. Spatial Reasoning: e.g., "Which knob is to the left of the burner?"
  3. Object Affordance Reasoning: e.g., "Which object can you rotate to increase the temperature?"
  4. Precondition/Effect Reasoning: e.g., "What must you do before opening the oven?"
  5. Goal Decomposition: e.g., "Which two sub-goals can be executed in parallel to make coffee?"
  6. Visual Correspondence: e.g., "Which part in view A corresponds to the button in view B?"

Evaluation tasks cover dynamic replanning and long-horizon decomposition, with four tiers of increasing difficulty: single-step manipulation, two complementary steps, multi-step/precondition, and complex dynamic replanning.

6. Distinctive Properties and Usage

MomaGraph-Scenes exhibits several distinguishing properties relative to prior scene graph datasets:

  • Simultaneous spatial and functional relations unifying object-level and part-level elements.
  • Rich annotation schema encompassing 9 spatial relation types, 6 functional relation types, 93 object categories, and explicit affordance labeling.
  • Task-oriented subgraph selection for instruction-relevant, compact scene representations.
  • Dynamic graph revision mechanisms capturing state transitions and supporting interactive hypotheses resolution.
  • Systematic linkage to MomaGraph-Bench, uniquely tying structured scene graph understanding to embodied task planning and hierarchical reasoning.

Use cases encompass zero-shot robotic mobile manipulation tasks, simulation-to-real transfer on AI2-THOR and RobotEra platforms, and benchmarking of vision-LLMs on structured scene understanding, temporal reasoning, and generalization.

7. Research Impact and Future Directions

The introduction of MomaGraph-Scenes and MomaGraph-Bench provides a foundation for developing and evaluating vision-LLMs for embodied agents. The MomaGraph-R1 model trained on this dataset demonstrates state-of-the-art performance among open-source systems, with a reported 71.6% accuracy on the benchmark—an improvement of 11.4% over the prior best baseline—and effective transfer to real-robot manipulation scenarios (Ju et al., 18 Dec 2025).

A plausible implication is that comprehensive, state-aware, and dynamically updated scene graphs tailored to task specification markedly improve the capacity for zero-shot planning and adaptive manipulation in complex environments. Future dataset expansions may include more diverse environments, more fine-grained action categories, and enhanced capabilities for capturing long-range temporal dependencies and real-world state transitions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MomaGraph-Scenes.