Scene Graph-based Atomic Skills

Updated 23 October 2025

Scene graph-based atomic skills are fundamental visuomotor and perceptual competencies that use graph representations of objects, attributes, and relationships.
The approach integrates structured perception, graph neural networks, and diffusion-based policy learning to support modularity and compositional generalization.
Empirical benchmarks show that focused scene graph methods improve success rates and enhance robustness against distractors in complex, long-horizon tasks.

Scene graph-based atomic skills are a class of fundamental visuomotor and perceptual competencies represented, learned, and executed using scene graphs—graph-structured abstractions capturing objects, attributes, relationships, and possibly spatial or temporal context within an environment. This approach leverages structured graph representations to support modularity, compositional generalization, and robust transfer of skills in AI systems, particularly in embodied agents and robotic platforms. The field integrates developments in structured perception, graph neural networks, energy-based learning, high-level planning, and diffusion-based visuomotor policy learning, and targets the reliable synthesis and sequencing of atomic (indivisible, re-usable) capabilities across long-horizon, complex, and dynamic tasks.

1. Formalization of Scene Graph-based Atomic Skills

Atomic skills are defined as elemental, indivisible capabilities—such as “pick,” “place,” “push,” or “open”—that are composable to solve complex tasks. Scene graphs abstract a scene as $G = (O, E)$ , where $O$ is the set of object nodes (each often parameterized by category, spatial pose, and attributes) and $E$ encodes explicit relationships as labeled edges (e.g., support, containment, proximity, kinematic joints) (Qi et al., 18 Nov 2024, Jiao et al., 2022). In manipulation contexts, contact graph+ ( $cg^+$ ) representations augment nodes with predicate-like attributes $A$ capturing spatial affordances or accessibility, as in $cg^+ = (pt, E, A)$ , where supporting ( $a^s$ ) and status ( $a^c$ ) attributes facilitate geometric and task-level reasoning (Jiao et al., 2022).

In compositional generalization, the atomic skill is tightly associated with subgraphs focusing on task-relevant subsets of $G$ , thus “scene graph-based atomic skill” denotes a policy or module whose inputs, decision logic, and effectors operate on a graph-centric representation filtered by task relevance (Qi et al., 19 Sep 2025). This design contrasts with monolithic, image-centric or unstructured policies, providing explicit, interpretable intermediates for connectivity and grounding.

2. Skill Learning and Policy Architectures

Scene graph-based atomic skill frameworks often employ a sequence of modules for perception, graph construction/focusing, GNN-based encoding, and skill policy inference:

Perception and Graph Focusing: Scene segmentation is performed to extract masks (e.g., Grounded SAM), point clouds, and bounding boxes for objects of interest. A vision-LLM (VLM) can further filter scene content to retain only those objects and relations germane to the current skill instance (Qi et al., 19 Sep 2025).
Graph Neural Networks (GNNs): Node features (object embeddings) and edge features (relations) are processed by multi-layer GNNs, including Graph Attention Networks (GAT), to learn embeddings $\mathbf{h}_i^{(\ell)}$ that encode both spatial and semantic context:

$\mathbf{h}_i^{(\ell+1)} = \sigma_\ell \left( \bigoplus_{m=1}^{H_\ell} \sum_{j \in \mathcal{N}(i)} \alpha_{ij}^{(\ell,m)} W^{(\ell,m)} \mathbf{h}_j^{(\ell)} \right)$

Mean-pooling over nodes provides a global scene feature $\mathbf{F}$ for downstream policies (Qi et al., 19 Sep 2025).

Skill Policies: For low-level control, diffusion-based imitation learning is used, with conditioning on the graph feature $\mathbf{F}$ , a language embedding $P$ (e.g., from CLIP of the subgoal description), and pose history $Q$ . The denoising diffusion process iteratively refines action predictions $A_t^k$ , producing robust, multi-modal action distributions:

$A_t^{k-1} = \alpha_k (A_t^{k} - \gamma_k \epsilon_\theta(A_t^k, k, \mathbf{F}, P, Q)) + \sigma_k \mathcal{N}(0, I)$

The loss is minimized via mean squared error between predicted and true noise (Qi et al., 19 Sep 2025). Some frameworks also use hierarchical architectures where a high-level planner decomposes a long-horizon task into a sequence of atomic skills and subgoals, often assigning focused scene graphs to each (Lu et al., 2022).

Action Grounding: In manipulation, graph edit operations (e.g., INSERT, DELETE, SUBSTITUTE on the scene graph) correspond directly to robot actions (e.g., Pick, Place, Open/Close), their sequential feasibility enforced via topological sorting given constraints (Jiao et al., 2022).

3. Compositional Generalization and Robustness

One of the principal motivations is compositional generalization: the ability to sequence and combine atomic skills, each robustly executable across varied scenes, to solve previously unseen long-horizon tasks.

Mitigating Distribution Shift: By focusing only on the task-relevant portion of the scene graph (nodes and relations most pertinent to the current step), the skill policy becomes less sensitive to irrelevant scene variability (“distractor” objects or spatial layouts) (Qi et al., 19 Sep 2025).
Interfacing with High-Level Planners: High-level planners—typically VLMs such as ChatGPT-4V—decompose goals (e.g., “put all vegetables in the basket”) into atomic subskills (e.g., “pick up carrot”), dynamically constructing “focused” subgraphs for each subskill (Qi et al., 19 Sep 2025). This modular pipeline scales better than monolithic end-to-end policies and, as shown experimentally, maintains high success rates even as the number and complexity of subtasks grow.
Parallel and Incremental Graph Operations: Parallel graph construction via attention mechanisms (rather than sequential parsing) leads to more efficient and accurate representations, as in the Attention Graph model, which lifts graph structure directly from the outputs of a Transformer (Andrews et al., 2019). Incremental scene graph expansion (e.g., insertions or edge modifications in response to language queries) enables granular, interpretable atomic skill tracking and localized updates (Hu et al., 2022).

4. Empirical Outcomes and Benchmarks

Empirical studies have established the practical benefits of scene graph-based atomic skill frameworks:

Task/Benchmark	Structured Scene Graph Approach	Baseline/Alternative	Performance Trend
Long-horizon simulated manipulation (Qi et al., 19 Sep 2025)	Focused scene graph + diffusion policy	Flat diffusion/2D/3D-only	Success rates 0.78–0.93 vs. <0.50 for baselines
Real-world sequential tool usage (Qi et al., 19 Sep 2025)	Focused scene graph system	Unstructured policies	Maintains robustness with distractors/unseen obstacles
Multitask embodied AI (Lu et al., 2022)	Atomic skill + hierarchical arch.	Black-box monolithic nets	2×–4× higher success in novel settings
Scene graph parsing (Andrews et al., 2019)	Attention Graph Transformer	Dependency tree/transition	F-score improved to 52.21% (SPICE metric, +2.5% over prior)
Surgical scene interpretation (Shin et al., 21 Jul 2025)	Action-centric scene graphs	Spatial graphs (no actions)	mAP: Triplet recognition up to 24.2 (vs. 18.0)

As shown, explicit graph-based skill learning consistently yields improvements in both success rate and robustness compared to unstructured or sequence-based alternatives, especially as task, scene, or instruction complexity increases.

5. Extensions: Structured Evaluation, Tactile Feedback, and Sim2Real Dynamics

Controllable Generation and Feedback: In generative tasks, the iteration of atomic skills is paralleled by scene graph–based evaluation (SGScore), which decomposes consistency into object and relationship recall, and a feedback refinement loop that “edits” generations to better match input graphs. This supports atomic operations such as selective relation correction (“move cup to be next to plate”) (Chen et al., 23 Nov 2024).
Skill Transfer and Tactile Integration: In manipulation, skill libraries organized via task, scene, and state graphs facilitate transfer via high-level reasoning (using LLMs) and low-level adaptation (with path planning, adaptive contours from tactile sensors) (Qi et al., 18 Nov 2024). Tactile sensing is mapped back to the scene graph to inform contact-based atomic skills.
Dynamic and Hierarchical Environments: Frameworks such as FOGMACHINE combine discrete-event simulation with dynamic scene graphs to simulate stochastic, multi-agent activities under partial observation, enabling agents to learn and benchmark atomic skills like efficient navigation, uncertainty-aware planning, and online belief update in large, hierarchical environments (Ohnemus et al., 10 Oct 2025).

6. Challenges and Future Directions

Despite their success, several challenges persist:

Scalability of Graph Reasoning: As the number of objects and relationships increases, efficient subgraph extraction and reasoning become bottlenecks. Future work will require advances in scalable GNN architectures and attention mechanisms for variable-size graphs (Qi et al., 19 Sep 2025, Kassab et al., 2 Dec 2024).
Flexible Relation Modeling: Current frameworks often model limited edge/parent connectivities. Extensions—such as multi-head Transformers predicting multiple concurrent edges per node—are proposed to better model densely connected or multi-relation environments (Andrews et al., 2019).
Integration of Heterogeneous Modalities: Skills must be robust not only to visual but also to semantic, tactile, and language cues. Methods that update the scene graph online using multi-modal data—such as gestures, speech, and tactile interaction—can further improve robustness in dynamic, open-world tasks (Shirasaka et al., 25 Jun 2025, Colombani et al., 22 Nov 2024).
Atomicity and Skill Isolation: Recent evidence reveals that even state-of-the-art VLMs often lack robust atomic skill performance (e.g., 2D spatial reasoning). Purpose-built datasets and targeted pretraining for atomic skills are necessary to close the gap between synthetic compositional generalization and real-world applicability (Chae et al., 26 May 2025).
Transfer and Sim2Real: Scene-graph centric architectures facilitate modular transfer, but aligning graph semantics and action primitives between simulation and real hardware remains a nontrivial open challenge (Qi et al., 18 Nov 2024, Ohnemus et al., 10 Oct 2025).

7. Applications and Impact

Scene graph-based atomic skills underpin major strides in robotics, vision-language interfaces, dynamic human–robot collaboration, and controllable generation in both simulated and real domains:

Embodied AI and Robotics: Modular skill learning via scene graphs enables scalable multi-task learning, rapid adaptation to scene changes, and robust sequencing of actions, serving complex environments from domestic settings to surgery (Qi et al., 19 Sep 2025, Shin et al., 21 Jul 2025).
Navigation and Spatial Reasoning: Agents leveraging graph-based spatial priors achieve higher sample efficiency, generalization, and path success in navigation benchmarks (Seymour et al., 2022, Zhang et al., 15 Oct 2024, Ma et al., 11 Aug 2025).
Language-Conditioned Planning and Feedback: Graph-centric skill representations support fine-grained interpretability and atomicity in complex task decompositions, allowing LLM-guided reasoning and plan correction based on semantic map updates (Colombani et al., 22 Nov 2024, Chen et al., 23 Nov 2024).
Real-time Adaptation and Simulation: Online updating of semantic scene graphs and their tight coupling to discrete-event simulation enable robust multi-agent systems that act under uncertainty and build atomic skills for dynamic real-world environments (Shirasaka et al., 25 Jun 2025, Ohnemus et al., 10 Oct 2025).

In summary, scene graph-based atomic skills mark an overview of structured, explainable perception and learning with modular, robust control—providing the architectural basis for scalable, generalist AI agents capable of compositional behavior across diverse, real-world domains.