MomaGraph-R1: Unified Vision–Language Planning

Updated 4 May 2026

The paper introduces MomaGraph-R1, a novel 7B model using a Graph-then-Plan paradigm for unified scene graph generation and embodied task planning.
It employs a multimodal encoder–decoder architecture with a specialized graph decoder and reinforcement learning optimization via graph-alignment rewards.
Empirical results demonstrate significant improvements in planning accuracy and scene graph prediction across synthetic benchmarks and real-robot household tasks.

MomaGraph-R1 is a 7-billion-parameter vision–LLM (VLM) architected to generate unified, task-aware scene graphs and perform zero-shot embodied task planning in complex household environments. Leveraging a “Graph-then-Plan” paradigm, it integrates spatial–functional reasoning, part-level affordance modeling, and robust vision–language inference, yielding state-of-the-art results on both synthetic benchmarks and real-robot tasks (Ju et al., 18 Dec 2025).

1. Model Architecture and Representation

MomaGraph-R1 builds upon Qwen2.5-VL-7B-Instruct, a Transformer-based multimodal encoder–decoder. The model ingests a set of RGB images $\{I_1, \ldots, I_n\}$ and a text instruction $T$ , encoding images via a vision projection module that produces visual tokens. These tokens are concatenated with textual tokens and processed through shared self-attention layers.

A specialized graph decoder is prompted to emit a JSON-structured scene graph $G_T = (N_T, E_s^T, E_f^T)$ for each task $T$ :

$N_T$ : task-relevant objects and part-level nodes.
$E_s^T$ : directed spatial edges (e.g., “in_front_of,” “touching”).
$E_f^T$ : directed functional edges (e.g., “control,” “open_or_close,” “adjust,” “power_by,” “activate,” “pair_with”).

Each semantic edge $e$ has both a functional label and an array of spatial relations, e.g.: $T$ 5 Dynamic scene graph updates are supported online, where at each interaction time $t$ , the graph $G_T^{(t)}$ is refined via an update function $T$ 0 based on the action $T$ 1 and state transition $T$ 2: $T$ 3.

2. Training Protocol: Supervised and Reinforcement Learning

The MomaGraph-R1 training pipeline comprises two sequential stages:

Supervised Fine-tuning (SFT): The model is initially trained on ground-truth scene graphs from the MomaGraph-Scenes dataset.
Reinforcement Learning with DAPO: Leveraging a PPO-style procedure, MomaGraph-R1 is optimized for graph prediction quality using a graph-alignment reward:

$T$ 4

with: - $T$ 5 - $T$ 6: alignment of edge predicates via $T$ 7 - $T$ 8: Jaccard index of node sets - $T$ 9: JSON validity constraint - $G_T = (N_T, E_s^T, E_f^T)$ 0: penalty for excessive output length

The objective is to maximize expected reward $G_T = (N_T, E_s^T, E_f^T)$ 1, with PPO-style policy gradient updates:

$G_T = (N_T, E_s^T, E_f^T)$ 2

where $G_T = (N_T, E_s^T, E_f^T)$ 3 denotes the advantage function estimated by a learned value network.

3. Graph-then-Plan Inference Paradigm

MomaGraph-R1 is deployed under a two-stage “Graph-then-Plan” inference regime:

Scene-Graph Prediction: Maximizing $G_T = (N_T, E_s^T, E_f^T)$ 4, the system infers a task-oriented graph $G_T = (N_T, E_s^T, E_f^T)$ 5 in response to the current instruction and visual observations.
Zero-Shot Task Planning: Given $G_T = (N_T, E_s^T, E_f^T)$ 6 and $G_T = (N_T, E_s^T, E_f^T)$ 7, the model or a lightweight planner head generates an action sequence $G_T = (N_T, E_s^T, E_f^T)$ 8 by sampling from $G_T = (N_T, E_s^T, E_f^T)$ 9.

This process is typically executed with two prompts:

Stage 1: “Given images … and instruction $T$ 0, output JSON scene graph $T$ 1.”
Stage 2: “Given $T$ 2 and $T$ 3, output an ordered list of primitive actions.”

A sketch of the operational pseudocode is as follows: $T$ 6

4. Datasets and Annotation Schema

MomaGraph-Scenes is the foundational dataset for model training and evaluation. It incorporates:

$T$ 41,050 task-driven subgraphs derived from 6,278 multi-view RGB frames.
350+ real and simulated indoor household scenes.
93 naturally phrased task instructions (e.g., “Fill the bathtub”), some of which are under-specified.
Nodes encompass objects and part-level elements (handles, knobs, switches).
Edges include 9 spatial types (LEFT_OF, RIGHT_OF, IN_FRONT_OF, BEHIND, HIGHER_THAN, LOWER_THAN, CLOSE, FAR, TOUCHING) and 6 functional types (OPEN_OR_CLOSE, ADJUST, CONTROL, ACTIVATE, POWER_BY, PAIR_WITH).

Annotations are provided in a JSON format with fields such as subgraph_id, scene_id, action_type, function_type, task_instruction, nodes, and edges.

5. Benchmark Suite and Evaluation Metrics

MomaGraph-Bench is a systematic evaluation suite designed for multi-choice reasoning over visual and structural input. It comprises:

294 held-out test scenes, 1,446 multi-view images, 352 annotated task graphs, and 1,315 VQA-style instances distributed across four difficulty tiers.
Six core reasoning competencies:
1. Action Sequence Reasoning (step ordering, dependency structure)
2. Spatial Reasoning (reachability, spatial relations)
3. Object Affordance Reasoning (functionality, tool application)
4. Precondition–Effect Reasoning (prerequisite and effect tracking)
5. Goal Decomposition (subgoal extraction, sequence vs. parallelism)
6. Visual Correspondence (multi-view identity resolution)

Performance is measured by top-1 accuracy in a 4-way multiple-choice setup.

6. Empirical Performance and Comparative Analysis

MomaGraph-R1 demonstrates the following experimental results:

71.6% overall accuracy on MomaGraph-Bench, representing a +11.4 percentage point (pp) increase over Qwen2.5-VL-7B SFT (60.2%) and +5.9 pp over the best open-source baseline (LLaVA-OneVision at 55.6%).
All models benefit from the Graph-then-Plan strategy versus direct action decoding, with 3–9 pp improvement.
On visual correspondence (BLINK), the model reaches 63.5% (+4.8 pp over the next best).
In the Graph-then-Plan regime, closed-source models such as GPT-5 (71.6%) and Claude-4.5-Sonnet (73.9%) are matched or closely approached.
On a mobile, bimanual robot in four distinct household tasks: 80% scene-graph prediction success, 87.5% planner success given correct graphs, and 70% end-to-end success on a previously unseen, extended-horizon manipulation challenge.

7. Significance and Implications

MomaGraph-R1 establishes a new standard for unified, state-aware, spatial-functional scene graph construction and task-oriented planning in vision–language agents. Its joint reasoning over structural representations and high-dimensional visual input, achieved via reinforcement learning on richly annotated, task-grounded data, underpins improved generalization, compositionality, and practical real-world transfer. The Graph-then-Plan paradigm outperforms direct planning from raw images and instructions, highlighting the utility of structured, task-sensitive abstractions for complex embodied decision-making. The accessibility of large-scale, high-fidelity datasets and systematic benchmarks further supports rigorous evaluation and reproducibility in this domain (Ju et al., 18 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MomaGraph-R1 Model.