MomaGraph-R1: Zero-Shot Scene Graph & Task Planning

Updated 22 December 2025

The paper presents a seven-billion parameter model that leverages a vision-language backbone with cross-modal fusion to predict semantically rich scene graphs.
It employs a dual-stage training process combining supervised fine-tuning and reinforcement learning to optimize task-relevant graph predictions for embodied agents.
The model achieves 71.6% benchmark accuracy and demonstrates robust transfer to real-world robotics for zero-shot planning in household environments.

MomaGraph-R1 is a 7-billion parameter vision–LLM designed to predict unified, task-relevant scene graphs and support zero-shot planning for embodied agents operating in household environments. Developed on the Qwen2.5-VL-7B-Instruct backbone and trained through a combination of supervised learning and reinforcement learning, MomaGraph-R1 introduces a compact yet semantically rich representation that integrates spatial, functional, and part-level affordance relationships in household scenes. The model achieves 71.6% accuracy on the MomaGraph benchmark, outperforming prior open-source models by over 11 percentage points, and demonstrates strong generalization to both public benchmarks and physical robot deployments (Ju et al., 18 Dec 2025).

1. Model Architecture

MomaGraph-R1 employs a multi-component architecture consisting of:

Vision–Language Backbone:
- An image encoder, composed of a ResNet-like CNN and a transformer projector, converts each RGB image into 1,024-dimensional visual tokens.
- A 32-layer transformer text encoder (hidden size 4,096) processes the task instruction.
- Cross-modal fusion is achieved by inserting visual tokens into the lower layers of the text transformer via cross-attention, yielding a unified 4,096-dimensional feature space.
Graph-Prediction Head:
- Multimodal features are passed to a causal LLM head (vocabulary ≈32,000), producing a structured JSON string that encodes the scene graph.
- The JSON schema contains a list of nodes (with unique IDs, labels, and part-level flags) and edges (each with functional and spatial relationship types, and source/target IDs).
- Node representation $f_n \in \mathbb{R}^{864}$ , comprising a 768-dimensional semantic embedding, 32-dimensional part indicator, and 64-dimensional bounding-box encoding.
- Edge embedding consists of one-hot vectors for functional (6 types: OPEN_OR_CLOSE, ADJUST, CONTROL, ACTIVATE, POWER_BY, PAIR_WITH) and spatial (9 types: LEFT_OF, RIGHT_OF, IN_FRONT_OF, BEHIND, HIGHER_THAN, LOWER_THAN, CLOSE, FAR, TOUCHING) relations, linearly mapped to a 128-dimensional feature.
Plan-Generation Head:
- Once the scene graph is generated, it is re-tokenized and combined with the instruction. The same LLM then decodes a sequence of high-level actions (e.g., "turn knob X clockwise," "press microwave door handle").

This architecture enables MomaGraph-R1 to predict compact, action-relevant scene representations and generate executable plans in a fully neural, end-to-end fashion.

2. Training Procedures

Training proceeds in two stages:

Supervised Fine-Tuning (SFT):

The model is trained on approximately 1,050 annotated graphs from the MomaGraph-Scenes dataset. The loss function is a standard token-level cross-entropy over the ground-truth graph JSON sequence:

$L_{SFT} = -\sum_{t=1}^T \log P_{\theta}(y_t^{gt} \mid y_{<t}^{gt}, I, T)$

Reinforcement Learning (DAPO-style Policy Gradient):

Post-SFT, the model is further optimized using a reward-driven RL objective. For each sampled candidate graph $G$ and ground truth $G^{gt}$ , the reward combines action agreement, node and edge set overlaps, format correctness, and brevity:

$R(G, G^{gt}) = w_a R_{action} + R_{edges} + R_{nodes} + w_f R_{format} + w_l R_{length}$

The RL loss:

$L_{RL} = \mathbb{E}_{G \sim \pi_{\theta}} \left[ - (R(G, G^{gt}) - b) \log \pi_{\theta}(G \mid I, T) + \beta \, \mathrm{KL}[\pi_{\theta} \,\|\, \pi_{ref}] \right]$

where $b$ is a learned value-function baseline and $\beta \approx 0.01$ is the KL-penalty coefficient.

This two-phase procedure first instills precise output structure and semantics, then further aligns generation to task-relevant, compact graphs via reward shaping.

3. Graph-then-Plan Inference Framework

Inference in MomaGraph-R1 follows a dual-stage process:

Scene-Graph Generation: Given multi-view images $\{I_i\}_{i=1}^{n}$ and instruction $T$ , the model predicts the most likely task-relevant scene graph $G^*$ via beam or greedy search:

$G^* = \operatorname*{arg\,max}_G P_{\theta}(G \mid \{I_i\}, T)$

The resulting graph represents nodes (objects/parts), edges (spatial/functional), and part-level affordances in structured JSON format.

Zero-Shot Task Planning: The predicted graph $G^*$ is encoded as text and prepended to the original instruction. The LLM then generates a sequence of high-level actions:

$A^* = \operatorname*{arg\,max}_A P_{\theta}(A \mid G^*, T)$

This process requires no additional training or domain-specific engineering; both stages utilize the same underlying LM head.

Graph updates are performed post-action via:

$G^{(t+1)} = U(G^{(t)}, a_t, s_{t+1}),$

where $U$ prunes or modifies edges according to the observed state change, maintaining a dynamically consistent scene representation after each step.

4. Internal Representation and Dynamic Updates

Node representations ( $d_n = 864$ ) combine object semantics, part indicators, and geometric bounding boxes. Edge features ( $d_e = 128$ ) are composed from spatial and functional codes. Explicit part-level nodes enable fine-grained affordance detection (e.g., “knob” vs. “stove”).

Dynamic scene updates ensure that after each action and state transition, the representation remains accurate:

Edges corresponding to realized functional effects (e.g., knob controlling burner) are retained.
Edges contradicted by new observations are removed.
The update operator $U(\cdot)$ is responsible for this pruning and augmentation at each step.

5. Empirical Performance and Benchmarking

MomaGraph-Bench Results

The model is evaluated on zero-shot multi-choice VQA (294 scenes):

Model	Params	Tier-1	Tier-2	Tier-3	Tier-4	Overall
Qwen2.5-VL-7B	7B	62.1%	58.5%	51.9%	56.5%	60.2%
LLaVA-Onevision	7B	60.0%	52.4%	58.4%	43.4%	55.6%
MomaGraph-R1	7B	76.4%	71.9%	70.1%	68.1%	71.6%

MomaGraph-R1 achieves an 11.4% absolute improvement over its base (Qwen2.5-VL-7B), approaching performance levels of closed-source models.

Visual Correspondence

Model	BLINK	MomaGraph-Corr
DeepSeek-VL2	57.4%	68.4%
Qwen2.5-VL-7B	58.7%	72.7%
LLaVA-Onevision	59.7%	70.7%
MomaGraph-R1	63.5%	77.5%

Real-Robot Transfer (RobotEra Q5; 10 trials)

Graph-generation success: 80%
Planning success (given correct graph): 87.5%
End-to-end task completion: 70%

These results indicate robust generalization from simulation to real-world robotic manipulation (Ju et al., 18 Dec 2025).

6. Strengths, Limitations, and Prospects

Strengths:

Unified modeling of spatial, functional, and part-level relationships within a single, updatable scene graph.
Reinforcement learning with graph-alignment rewards yields concise and task-relevant graphs.
The Graph-then-Plan paradigm supports zero-shot planning and strong generalization across benchmarks and hardware platforms.

Limitations:

Dependence on high-quality graph annotations for RL training rewards poses challenges for scaling to unlabeled datasets.
JSON-based decoding may generate malformed graphs under severe occlusion.
Low-level control and fine-grained closed-loop corrections remain outside the model’s planning scope.

Potential Extensions:

Self-supervised graph refinement through learning from failed execution feedback.
Integration of 3D point-cloud data for enhanced geometric awareness.
End-to-end training of both high-level graph prediction and low-level motor control for fully closed-loop visuomotor grounding.

MomaGraph-R1 demonstrates that explicit prediction and updating of unified, state-aware scene graphs is an effective approach for vision-language-based embodied task planning, delivering strong zero-shot performance in both virtual and real-world task executions (Ju et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MomaGraph-R1.