Scene-Graph Skill Learning Framework

Updated 26 September 2025

The paper presents a novel framework that leverages dynamic scene graph construction and graph neural networks to enhance robotic skill execution.
It details a method for constructing task-specific 3D representations using RGB-D segmentation, point cloud embedding, and relation inference with vision–language models.
Experimental results in simulation and real-world settings demonstrate superior robustness and generalization in compositional tasks over traditional approaches.

A Scene-Graph Skill Learning Framework is a computational paradigm that leverages structured, graph-based representations of visual or physical environments to facilitate the acquisition, transfer, and robust execution of atomic and compositional skills in robotics and embodied AI. These frameworks integrate dynamic scene graphs—encoding spatial, semantic, and task-relevant relations among entities—with learning architectures such as graph neural networks (GNNs), diffusion-based imitation learning, and vision–LLM (VLM)–driven planners to achieve robust, generalizable, and compositional skill execution in complex and variable environments (Qi et al., 19 Sep 2025).

1. Focused Scene Graph Construction

Scene-Graph Skill Learning Frameworks begin by representing the environment as a dynamic 3D scene graph, specifically tailored to each current task and skill:

Task-Relevant Object Selection: Raw RGB-D observations are processed using a vision foundation model (e.g., Grounded SAM) that segments only those objects referenced by a task description, removing distractors and focusing the representation space on entities relevant for the current skill or sub-goal.
3D Embedding: The depth channel is unprojected into a point cloud using camera parameters. Downsampling (e.g., farthest point sampling) creates compact point sets for each object, which are embedded into fixed-length vectors with lightweight MLP-based encoders.
Relation Inference: Edges are dynamically constructed by querying a VLM (e.g., ChatGPT) for spatial or semantic relations such as “grasp,” “next,” or “inside” between pairs of objects.
Graph Structure: The scene graph is defined over the subset of the environment involving the gripper, manipulated objects, and salient obstacles, producing a focused, dynamically updated subgraph per skill invocation.

This “focused” representation significantly reduces irrelevant variability, as it is recomputed at every sub-goal, supporting policy robustness to distribution shift and clutter.

2. Graph Neural Network Encoding and Policy Design

The focused scene graph forms the basis for downstream policy learning and execution:

Graph Encoding: A two-layer Graph Attention Network (GAT) processes the scene graph, producing updated node embeddings via attention-based message passing:

$h_i^{(\ell+1)} = \sigma_\ell\left( \mathop{\|}_{m=1}^{H_\ell} \sum_{j \in \mathcal{N}(i)} \alpha_{ij}^{(\ell,m)} W^{(\ell,m)} h_j^{(\ell)} \right)$

where $\alpha_{ij}^{(\ell,m)}$ are attention coefficients, $W^{(\ell,m)}$ are head-specific projection matrices, $\sigma_\ell$ are nonlinearities, and $\|$ denotes concatenation. After processing, node features $h_i^{(2)}$ are pooled (typically mean pooling) to yield a global representation $F$ .

Conditioning Policy on Scene Graphs: The encoded global feature $F$ is concatenated with a language embedding $P$ (e.g., CLIP-encoded task description) and a robot state vector $Q$ . This combined representation provides structured, context-rich input to the policy module.
Diffusion-based Imitation Learning: Actions are generated via a conditional denoising diffusion model, which starts from noise and iteratively denoises towards a demonstration-consistent action:

$A_t^{(k-1)} = \alpha_k \left( A_t^k - \gamma_k \epsilon_\theta(A_t^k, k, F, P, Q) \right) + \sigma_k \mathcal{N}(0, I)$

Training minimizes the mean squared error between injected noise and the denoiser’s prediction over the demonstration dataset.

This architecture ensures that both geometric and semantic context, as well as the task specification, critically inform low-level skill execution.

3. High-Level Planning and Skill Composition via Vision–LLMs

For long-horizon or compositional tasks, these frameworks tightly integrate high-level reasoning:

Task Decomposition with VLMs: An external VLM (e.g., GPT-4V) receives the natural-language task description and decomposes it into a sequence of sub-goals, each corresponding to an atomic skill (e.g., “pick up carrot”, “move to bowl”).
On-the-Fly Scene Graph Updates: For each sub-goal, the system dynamically constructs a focused scene graph by re-segmenting the current RGB-D frame for involved objects and updating relations as the environment changes.
Iterative Skill Execution: Each updated scene graph–language pair is fed into the GNN+diffusion policy to sequentially solve each sub-goal, composing new long-horizon behaviors at inference time.

This integration eliminates the need for demonstration coverage of all task permutations; atomic skills, learned once, can be flexibly recombined at runtime via high-level planning and focused scene representation.

4. Experimental Evaluation and Robustness

Experiments validate the efficacy of focused scene-graph skill learning in both simulated and real-world robotic manipulation:

Simulation: In composed tasks (e.g., “Sort by Color,” “Blocks Stacking Game”), the scene-graph-based approach consistently achieves $0.78-0.93$ success rates, substantially outperforming 2D/3D diffusion baselines that lack scene-structured input, especially as the number of skill compositions increases.
Real-World Manipulation: On tasks such as vegetable picking and tool usage, the framework achieves near-perfect success rates (e.g., 1.0 in vegetable picking, 0.9 in tool usage) even in the presence of distractors or novel obstacles, demonstrating robustness to scene variation and transferability from simulation to real environments.
Ablations: Disabling scene graph structure, using flat 2D/3D representations, or omitting point cloud encoding significantly reduce compositional generalization and robustness, confirming the necessity of these architectural choices for skill generalization.

5. Compositional Generalization and Modularity

Key generalization properties are supported by the following findings:

Distributional Robustness: Focusing on only task-relevant objects and graph-based relations significantly reduces distribution shift induced by variations in scene clutter or background, enabling transfer from clean demonstration environments to real-world, visually complex deployments with minimal loss in performance.
Compositionality without Exhaustive Demonstrations: The graph-based framing, in conjunction with VLM-driven decomposition, allows for combinatorial skill recomposition. Sub-goal skill modules learned on isolated tasks are reused in novel sequences, supporting generalization across unseen long-horizon instructions without retraining.
Scalability: The graph neural policy supports variable-sized and variable-structure inputs, accommodating scenes with differing numbers of objects or relations without explicit adjustment, a requirement for flexible open-world skill deployment.

6. Distinctiveness and Broader Impact

Relative to prior planners that sequence pre-learned but typically vision- or pose-centric skills, the scene-graph skill learning approach achieves:

Robustness to unseen compositions, background distractions, and environmental changes due to explicit attention to both semantic and geometric relational structure.
Efficiency in demonstration and annotation requirements, as atomic skills once learned are composable for broader classes of instructions.
Interpretability stemming from the graph structure, as policies can be analyzed in terms of object-centric and relation-centric state, facilitating debugging and potential symbolic reasoning overlays.

A plausible implication is that extending this paradigm to other forms of structured representations (e.g., affordance graphs, temporal event graphs) may generalize the benefits observed in robotic manipulation to further classes of embodied and multimodal agents.

In summary, a Scene-Graph Skill Learning Framework, as instantiated in (Qi et al., 19 Sep 2025), integrates dynamic, focused scene graph representation, GNN-based conditioning, and diffusion policies with VLM-driven task decomposition. This architecture yields robust, modular, and generalizable skill execution for long-horizon, compositional tasks in both simulation and real robotic settings, addressing longstanding bottlenecks in skill transfer, robustness, and scalability in embodied intelligence.

PDF Markdown Chat (Pro)

References (1)

Compose by Focus: Scene Graph-based Atomic Skills (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Scene-Graph Skill Learning Framework.