MomaGraph-R1: Zero-Shot Scene Graph & Task Planning
- The paper presents a seven-billion parameter model that leverages a vision-language backbone with cross-modal fusion to predict semantically rich scene graphs.
- It employs a dual-stage training process combining supervised fine-tuning and reinforcement learning to optimize task-relevant graph predictions for embodied agents.
- The model achieves 71.6% benchmark accuracy and demonstrates robust transfer to real-world robotics for zero-shot planning in household environments.
MomaGraph-R1 is a 7-billion parameter vision–LLM designed to predict unified, task-relevant scene graphs and support zero-shot planning for embodied agents operating in household environments. Developed on the Qwen2.5-VL-7B-Instruct backbone and trained through a combination of supervised learning and reinforcement learning, MomaGraph-R1 introduces a compact yet semantically rich representation that integrates spatial, functional, and part-level affordance relationships in household scenes. The model achieves 71.6% accuracy on the MomaGraph benchmark, outperforming prior open-source models by over 11 percentage points, and demonstrates strong generalization to both public benchmarks and physical robot deployments (Ju et al., 18 Dec 2025).
1. Model Architecture
MomaGraph-R1 employs a multi-component architecture consisting of:
- Vision–Language Backbone:
- An image encoder, composed of a ResNet-like CNN and a transformer projector, converts each RGB image into 1,024-dimensional visual tokens.
- A 32-layer transformer text encoder (hidden size 4,096) processes the task instruction.
- Cross-modal fusion is achieved by inserting visual tokens into the lower layers of the text transformer via cross-attention, yielding a unified 4,096-dimensional feature space.
- Graph-Prediction Head:
- Multimodal features are passed to a causal LLM head (vocabulary ≈32,000), producing a structured JSON string that encodes the scene graph.
- The JSON schema contains a list of nodes (with unique IDs, labels, and part-level flags) and edges (each with functional and spatial relationship types, and source/target IDs).
- Node representation , comprising a 768-dimensional semantic embedding, 32-dimensional part indicator, and 64-dimensional bounding-box encoding.
- Edge embedding consists of one-hot vectors for functional (6 types: OPEN_OR_CLOSE, ADJUST, CONTROL, ACTIVATE, POWER_BY, PAIR_WITH) and spatial (9 types: LEFT_OF, RIGHT_OF, IN_FRONT_OF, BEHIND, HIGHER_THAN, LOWER_THAN, CLOSE, FAR, TOUCHING) relations, linearly mapped to a 128-dimensional feature.
- Plan-Generation Head:
- Once the scene graph is generated, it is re-tokenized and combined with the instruction. The same LLM then decodes a sequence of high-level actions (e.g., "turn knob X clockwise," "press microwave door handle").
This architecture enables MomaGraph-R1 to predict compact, action-relevant scene representations and generate executable plans in a fully neural, end-to-end fashion.
2. Training Procedures
Training proceeds in two stages:
- Supervised Fine-Tuning (SFT):
The model is trained on approximately 1,050 annotated graphs from the MomaGraph-Scenes dataset. The loss function is a standard token-level cross-entropy over the ground-truth graph JSON sequence:
- Reinforcement Learning (DAPO-style Policy Gradient):
Post-SFT, the model is further optimized using a reward-driven RL objective. For each sampled candidate graph and ground truth , the reward combines action agreement, node and edge set overlaps, format correctness, and brevity:
The RL loss:
where is a learned value-function baseline and is the KL-penalty coefficient.
This two-phase procedure first instills precise output structure and semantics, then further aligns generation to task-relevant, compact graphs via reward shaping.
3. Graph-then-Plan Inference Framework
Inference in MomaGraph-R1 follows a dual-stage process:
- Scene-Graph Generation: Given multi-view images and instruction , the model predicts the most likely task-relevant scene graph via beam or greedy search:
The resulting graph represents nodes (objects/parts), edges (spatial/functional), and part-level affordances in structured JSON format.
- Zero-Shot Task Planning: The predicted graph is encoded as text and prepended to the original instruction. The LLM then generates a sequence of high-level actions:
This process requires no additional training or domain-specific engineering; both stages utilize the same underlying LM head.
Graph updates are performed post-action via:
where prunes or modifies edges according to the observed state change, maintaining a dynamically consistent scene representation after each step.
4. Internal Representation and Dynamic Updates
Node representations () combine object semantics, part indicators, and geometric bounding boxes. Edge features () are composed from spatial and functional codes. Explicit part-level nodes enable fine-grained affordance detection (e.g., “knob” vs. “stove”).
Dynamic scene updates ensure that after each action and state transition, the representation remains accurate:
- Edges corresponding to realized functional effects (e.g., knob controlling burner) are retained.
- Edges contradicted by new observations are removed.
- The update operator is responsible for this pruning and augmentation at each step.
5. Empirical Performance and Benchmarking
MomaGraph-Bench Results
The model is evaluated on zero-shot multi-choice VQA (294 scenes):
| Model | Params | Tier-1 | Tier-2 | Tier-3 | Tier-4 | Overall |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 7B | 62.1% | 58.5% | 51.9% | 56.5% | 60.2% |
| LLaVA-Onevision | 7B | 60.0% | 52.4% | 58.4% | 43.4% | 55.6% |
| MomaGraph-R1 | 7B | 76.4% | 71.9% | 70.1% | 68.1% | 71.6% |
MomaGraph-R1 achieves an 11.4% absolute improvement over its base (Qwen2.5-VL-7B), approaching performance levels of closed-source models.
Visual Correspondence
| Model | BLINK | MomaGraph-Corr |
|---|---|---|
| DeepSeek-VL2 | 57.4% | 68.4% |
| Qwen2.5-VL-7B | 58.7% | 72.7% |
| LLaVA-Onevision | 59.7% | 70.7% |
| MomaGraph-R1 | 63.5% | 77.5% |
Real-Robot Transfer (RobotEra Q5; 10 trials)
- Graph-generation success: 80%
- Planning success (given correct graph): 87.5%
- End-to-end task completion: 70%
These results indicate robust generalization from simulation to real-world robotic manipulation (Ju et al., 18 Dec 2025).
6. Strengths, Limitations, and Prospects
Strengths:
- Unified modeling of spatial, functional, and part-level relationships within a single, updatable scene graph.
- Reinforcement learning with graph-alignment rewards yields concise and task-relevant graphs.
- The Graph-then-Plan paradigm supports zero-shot planning and strong generalization across benchmarks and hardware platforms.
Limitations:
- Dependence on high-quality graph annotations for RL training rewards poses challenges for scaling to unlabeled datasets.
- JSON-based decoding may generate malformed graphs under severe occlusion.
- Low-level control and fine-grained closed-loop corrections remain outside the model’s planning scope.
Potential Extensions:
- Self-supervised graph refinement through learning from failed execution feedback.
- Integration of 3D point-cloud data for enhanced geometric awareness.
- End-to-end training of both high-level graph prediction and low-level motor control for fully closed-loop visuomotor grounding.
MomaGraph-R1 demonstrates that explicit prediction and updating of unified, state-aware scene graphs is an effective approach for vision-language-based embodied task planning, delivering strong zero-shot performance in both virtual and real-world task executions (Ju et al., 18 Dec 2025).