MomaGraph-R1: Unified Vision–Language Planning
- The paper introduces MomaGraph-R1, a novel 7B model using a Graph-then-Plan paradigm for unified scene graph generation and embodied task planning.
- It employs a multimodal encoder–decoder architecture with a specialized graph decoder and reinforcement learning optimization via graph-alignment rewards.
- Empirical results demonstrate significant improvements in planning accuracy and scene graph prediction across synthetic benchmarks and real-robot household tasks.
MomaGraph-R1 is a 7-billion-parameter vision–LLM (VLM) architected to generate unified, task-aware scene graphs and perform zero-shot embodied task planning in complex household environments. Leveraging a “Graph-then-Plan” paradigm, it integrates spatial–functional reasoning, part-level affordance modeling, and robust vision–language inference, yielding state-of-the-art results on both synthetic benchmarks and real-robot tasks (Ju et al., 18 Dec 2025).
1. Model Architecture and Representation
MomaGraph-R1 builds upon Qwen2.5-VL-7B-Instruct, a Transformer-based multimodal encoder–decoder. The model ingests a set of RGB images and a text instruction , encoding images via a vision projection module that produces visual tokens. These tokens are concatenated with textual tokens and processed through shared self-attention layers.
A specialized graph decoder is prompted to emit a JSON-structured scene graph for each task :
- : task-relevant objects and part-level nodes.
- : directed spatial edges (e.g., “in_front_of,” “touching”).
- : directed functional edges (e.g., “control,” “open_or_close,” “adjust,” “power_by,” “activate,” “pair_with”).
Each semantic edge has both a functional label and an array of spatial relations, e.g.: 5 Dynamic scene graph updates are supported online, where at each interaction time , the graph is refined via an update function 0 based on the action 1 and state transition 2: 3.
2. Training Protocol: Supervised and Reinforcement Learning
The MomaGraph-R1 training pipeline comprises two sequential stages:
- Supervised Fine-tuning (SFT): The model is initially trained on ground-truth scene graphs from the MomaGraph-Scenes dataset.
- Reinforcement Learning with DAPO: Leveraging a PPO-style procedure, MomaGraph-R1 is optimized for graph prediction quality using a graph-alignment reward:
4
with: - 5 - 6: alignment of edge predicates via 7 - 8: Jaccard index of node sets - 9: JSON validity constraint - 0: penalty for excessive output length
The objective is to maximize expected reward 1, with PPO-style policy gradient updates:
2
where 3 denotes the advantage function estimated by a learned value network.
3. Graph-then-Plan Inference Paradigm
MomaGraph-R1 is deployed under a two-stage “Graph-then-Plan” inference regime:
- Scene-Graph Prediction: Maximizing 4, the system infers a task-oriented graph 5 in response to the current instruction and visual observations.
- Zero-Shot Task Planning: Given 6 and 7, the model or a lightweight planner head generates an action sequence 8 by sampling from 9.
This process is typically executed with two prompts:
- Stage 1: “Given images … and instruction 0, output JSON scene graph 1.”
- Stage 2: “Given 2 and 3, output an ordered list of primitive actions.”
A sketch of the operational pseudocode is as follows: 6
4. Datasets and Annotation Schema
MomaGraph-Scenes is the foundational dataset for model training and evaluation. It incorporates:
- 41,050 task-driven subgraphs derived from 6,278 multi-view RGB frames.
- 350+ real and simulated indoor household scenes.
- 93 naturally phrased task instructions (e.g., “Fill the bathtub”), some of which are under-specified.
- Nodes encompass objects and part-level elements (handles, knobs, switches).
- Edges include 9 spatial types (LEFT_OF, RIGHT_OF, IN_FRONT_OF, BEHIND, HIGHER_THAN, LOWER_THAN, CLOSE, FAR, TOUCHING) and 6 functional types (OPEN_OR_CLOSE, ADJUST, CONTROL, ACTIVATE, POWER_BY, PAIR_WITH).
Annotations are provided in a JSON format with fields such as subgraph_id, scene_id, action_type, function_type, task_instruction, nodes, and edges.
5. Benchmark Suite and Evaluation Metrics
MomaGraph-Bench is a systematic evaluation suite designed for multi-choice reasoning over visual and structural input. It comprises:
- 294 held-out test scenes, 1,446 multi-view images, 352 annotated task graphs, and 1,315 VQA-style instances distributed across four difficulty tiers.
- Six core reasoning competencies:
- Action Sequence Reasoning (step ordering, dependency structure)
- Spatial Reasoning (reachability, spatial relations)
- Object Affordance Reasoning (functionality, tool application)
- Precondition–Effect Reasoning (prerequisite and effect tracking)
- Goal Decomposition (subgoal extraction, sequence vs. parallelism)
- Visual Correspondence (multi-view identity resolution)
Performance is measured by top-1 accuracy in a 4-way multiple-choice setup.
6. Empirical Performance and Comparative Analysis
MomaGraph-R1 demonstrates the following experimental results:
71.6% overall accuracy on MomaGraph-Bench, representing a +11.4 percentage point (pp) increase over Qwen2.5-VL-7B SFT (60.2%) and +5.9 pp over the best open-source baseline (LLaVA-OneVision at 55.6%).
- All models benefit from the Graph-then-Plan strategy versus direct action decoding, with 3–9 pp improvement.
- On visual correspondence (BLINK), the model reaches 63.5% (+4.8 pp over the next best).
- In the Graph-then-Plan regime, closed-source models such as GPT-5 (71.6%) and Claude-4.5-Sonnet (73.9%) are matched or closely approached.
- On a mobile, bimanual robot in four distinct household tasks: 80% scene-graph prediction success, 87.5% planner success given correct graphs, and 70% end-to-end success on a previously unseen, extended-horizon manipulation challenge.
7. Significance and Implications
MomaGraph-R1 establishes a new standard for unified, state-aware, spatial-functional scene graph construction and task-oriented planning in vision–language agents. Its joint reasoning over structural representations and high-dimensional visual input, achieved via reinforcement learning on richly annotated, task-grounded data, underpins improved generalization, compositionality, and practical real-world transfer. The Graph-then-Plan paradigm outperforms direct planning from raw images and instructions, highlighting the utility of structured, task-sensitive abstractions for complex embodied decision-making. The accessibility of large-scale, high-fidelity datasets and systematic benchmarks further supports rigorous evaluation and reproducibility in this domain (Ju et al., 18 Dec 2025).