Papers
Topics
Authors
Recent
2000 character limit reached

MomaGraph-R1: Zero-Shot Scene Graph & Task Planning

Updated 22 December 2025
  • The paper presents a seven-billion parameter model that leverages a vision-language backbone with cross-modal fusion to predict semantically rich scene graphs.
  • It employs a dual-stage training process combining supervised fine-tuning and reinforcement learning to optimize task-relevant graph predictions for embodied agents.
  • The model achieves 71.6% benchmark accuracy and demonstrates robust transfer to real-world robotics for zero-shot planning in household environments.

MomaGraph-R1 is a 7-billion parameter vision–LLM designed to predict unified, task-relevant scene graphs and support zero-shot planning for embodied agents operating in household environments. Developed on the Qwen2.5-VL-7B-Instruct backbone and trained through a combination of supervised learning and reinforcement learning, MomaGraph-R1 introduces a compact yet semantically rich representation that integrates spatial, functional, and part-level affordance relationships in household scenes. The model achieves 71.6% accuracy on the MomaGraph benchmark, outperforming prior open-source models by over 11 percentage points, and demonstrates strong generalization to both public benchmarks and physical robot deployments (Ju et al., 18 Dec 2025).

1. Model Architecture

MomaGraph-R1 employs a multi-component architecture consisting of:

  • Vision–Language Backbone:
    • An image encoder, composed of a ResNet-like CNN and a transformer projector, converts each RGB image into 1,024-dimensional visual tokens.
    • A 32-layer transformer text encoder (hidden size 4,096) processes the task instruction.
    • Cross-modal fusion is achieved by inserting visual tokens into the lower layers of the text transformer via cross-attention, yielding a unified 4,096-dimensional feature space.
  • Graph-Prediction Head:
    • Multimodal features are passed to a causal LLM head (vocabulary ≈32,000), producing a structured JSON string that encodes the scene graph.
    • The JSON schema contains a list of nodes (with unique IDs, labels, and part-level flags) and edges (each with functional and spatial relationship types, and source/target IDs).
    • Node representation fnR864f_n \in \mathbb{R}^{864}, comprising a 768-dimensional semantic embedding, 32-dimensional part indicator, and 64-dimensional bounding-box encoding.
    • Edge embedding consists of one-hot vectors for functional (6 types: OPEN_OR_CLOSE, ADJUST, CONTROL, ACTIVATE, POWER_BY, PAIR_WITH) and spatial (9 types: LEFT_OF, RIGHT_OF, IN_FRONT_OF, BEHIND, HIGHER_THAN, LOWER_THAN, CLOSE, FAR, TOUCHING) relations, linearly mapped to a 128-dimensional feature.
  • Plan-Generation Head:
    • Once the scene graph is generated, it is re-tokenized and combined with the instruction. The same LLM then decodes a sequence of high-level actions (e.g., "turn knob X clockwise," "press microwave door handle").

This architecture enables MomaGraph-R1 to predict compact, action-relevant scene representations and generate executable plans in a fully neural, end-to-end fashion.

2. Training Procedures

Training proceeds in two stages:

The model is trained on approximately 1,050 annotated graphs from the MomaGraph-Scenes dataset. The loss function is a standard token-level cross-entropy over the ground-truth graph JSON sequence:

LSFT=t=1TlogPθ(ytgty<tgt,I,T)L_{SFT} = -\sum_{t=1}^T \log P_{\theta}(y_t^{gt} \mid y_{<t}^{gt}, I, T)

  • Reinforcement Learning (DAPO-style Policy Gradient):

Post-SFT, the model is further optimized using a reward-driven RL objective. For each sampled candidate graph GG and ground truth GgtG^{gt}, the reward combines action agreement, node and edge set overlaps, format correctness, and brevity:

R(G,Ggt)=waRaction+Redges+Rnodes+wfRformat+wlRlengthR(G, G^{gt}) = w_a R_{action} + R_{edges} + R_{nodes} + w_f R_{format} + w_l R_{length}

The RL loss:

LRL=EGπθ[(R(G,Ggt)b)logπθ(GI,T)+βKL[πθπref]]L_{RL} = \mathbb{E}_{G \sim \pi_{\theta}} \left[ - (R(G, G^{gt}) - b) \log \pi_{\theta}(G \mid I, T) + \beta \, \mathrm{KL}[\pi_{\theta} \,\|\, \pi_{ref}] \right]

where bb is a learned value-function baseline and β0.01\beta \approx 0.01 is the KL-penalty coefficient.

This two-phase procedure first instills precise output structure and semantics, then further aligns generation to task-relevant, compact graphs via reward shaping.

3. Graph-then-Plan Inference Framework

Inference in MomaGraph-R1 follows a dual-stage process:

  1. Scene-Graph Generation: Given multi-view images {Ii}i=1n\{I_i\}_{i=1}^{n} and instruction TT, the model predicts the most likely task-relevant scene graph GG^* via beam or greedy search:

G=arg maxGPθ(G{Ii},T)G^* = \operatorname*{arg\,max}_G P_{\theta}(G \mid \{I_i\}, T)

The resulting graph represents nodes (objects/parts), edges (spatial/functional), and part-level affordances in structured JSON format.

  1. Zero-Shot Task Planning: The predicted graph GG^* is encoded as text and prepended to the original instruction. The LLM then generates a sequence of high-level actions:

A=arg maxAPθ(AG,T)A^* = \operatorname*{arg\,max}_A P_{\theta}(A \mid G^*, T)

This process requires no additional training or domain-specific engineering; both stages utilize the same underlying LM head.

Graph updates are performed post-action via:

G(t+1)=U(G(t),at,st+1),G^{(t+1)} = U(G^{(t)}, a_t, s_{t+1}),

where UU prunes or modifies edges according to the observed state change, maintaining a dynamically consistent scene representation after each step.

4. Internal Representation and Dynamic Updates

Node representations (dn=864d_n = 864) combine object semantics, part indicators, and geometric bounding boxes. Edge features (de=128d_e = 128) are composed from spatial and functional codes. Explicit part-level nodes enable fine-grained affordance detection (e.g., “knob” vs. “stove”).

Dynamic scene updates ensure that after each action and state transition, the representation remains accurate:

  • Edges corresponding to realized functional effects (e.g., knob controlling burner) are retained.
  • Edges contradicted by new observations are removed.
  • The update operator U()U(\cdot) is responsible for this pruning and augmentation at each step.

5. Empirical Performance and Benchmarking

MomaGraph-Bench Results

The model is evaluated on zero-shot multi-choice VQA (294 scenes):

Model Params Tier-1 Tier-2 Tier-3 Tier-4 Overall
Qwen2.5-VL-7B 7B 62.1% 58.5% 51.9% 56.5% 60.2%
LLaVA-Onevision 7B 60.0% 52.4% 58.4% 43.4% 55.6%
MomaGraph-R1 7B 76.4% 71.9% 70.1% 68.1% 71.6%

MomaGraph-R1 achieves an 11.4% absolute improvement over its base (Qwen2.5-VL-7B), approaching performance levels of closed-source models.

Visual Correspondence

Model BLINK MomaGraph-Corr
DeepSeek-VL2 57.4% 68.4%
Qwen2.5-VL-7B 58.7% 72.7%
LLaVA-Onevision 59.7% 70.7%
MomaGraph-R1 63.5% 77.5%

Real-Robot Transfer (RobotEra Q5; 10 trials)

  • Graph-generation success: 80%
  • Planning success (given correct graph): 87.5%
  • End-to-end task completion: 70%

These results indicate robust generalization from simulation to real-world robotic manipulation (Ju et al., 18 Dec 2025).

6. Strengths, Limitations, and Prospects

Strengths:

  • Unified modeling of spatial, functional, and part-level relationships within a single, updatable scene graph.
  • Reinforcement learning with graph-alignment rewards yields concise and task-relevant graphs.
  • The Graph-then-Plan paradigm supports zero-shot planning and strong generalization across benchmarks and hardware platforms.

Limitations:

  • Dependence on high-quality graph annotations for RL training rewards poses challenges for scaling to unlabeled datasets.
  • JSON-based decoding may generate malformed graphs under severe occlusion.
  • Low-level control and fine-grained closed-loop corrections remain outside the model’s planning scope.

Potential Extensions:

  • Self-supervised graph refinement through learning from failed execution feedback.
  • Integration of 3D point-cloud data for enhanced geometric awareness.
  • End-to-end training of both high-level graph prediction and low-level motor control for fully closed-loop visuomotor grounding.

MomaGraph-R1 demonstrates that explicit prediction and updating of unified, state-aware scene graphs is an effective approach for vision-language-based embodied task planning, delivering strong zero-shot performance in both virtual and real-world task executions (Ju et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MomaGraph-R1.