Embodied Graph: Linking Perception & Action

Updated 4 July 2026

Embodied graph is a graph-structured representation that explicitly links sensorimotor data with memory, reasoning, and control.
They model diverse structures such as robot APIs, 3D scene hierarchies, and morphology to reflect both external environments and internal states.
These frameworks facilitate adaptive planning, real-time memory updates, and integrated control by transforming complex embodied experiences into actionable graph forms.

Embodied graph denotes a family of graph-structured representations in which the graph is coupled to embodied perception, memory, reasoning, planning, or control, rather than serving only as an offline symbolic artifact. Across recent work, the graph may encode robot APIs and task dependencies, online 3D metric-semantic scene structure, morphology–controller couplings, semantic-spatial memory, or multimodal concept knowledge. In each case, the graph mediates between a situated agent and the world it observes or acts upon, so that language, sensing, and action are organized through explicit relational structure rather than only through raw sequences or latent vectors (Chen et al., 18 Feb 2025, Saxena et al., 2024, Wang et al., 20 Mar 2026).

1. Scope and terminological boundaries

The term is not standardized. In robotics and embodied AI, “embodied graph” usually refers to a graph that is grounded in sensorimotor interaction, online memory, or executable action. In the cited literature, this includes text-attributed API graphs for medical robotics, hierarchical 3D scene graphs for embodied question answering, observation-centric graphs for retrieval and exploration, semantic-spatial memory graphs for robot interaction, and morphology graphs for soft-robot co-design (Chen et al., 18 Feb 2025, Saxena et al., 2024, Lee et al., 23 Jun 2026, Riva et al., 20 Apr 2026, Wang et al., 20 Mar 2026).

A common source of confusion is the distinction between embodied graph and embedded-graph. “Embedded-graph theory” defines a graph $G=(V,E,X)$ in which each edge is assigned an embedding vector $X(e)\in\mathbb{R}^d$ , so the emphasis is semantic vectorization of relations rather than physical grounding or sensorimotor coupling (Yokoyama, 2017). This is relevant to representation learning, but it is not an embodiment framework in the robotics sense.

A second boundary concerns embodied-symbolic formulations. The conceptual program of dual embodied-symbolic concept representations treats the embodied level as modality-specific feature vectors and the symbolic level as concept graphs or knowledge graphs, arguing that both are needed for deep learning and symbolic AI integration (Chang, 2022). This broadens the notion of embodied graph beyond mobile robots: a graph can be “embodied” whenever graph-structured symbolic knowledge is explicitly linked to grounded perceptual or structural representations. The chemistry formulation that pairs molecular graphs with chemical knowledge graphs is a concrete instance of this broader usage (Chang, 2022).

2. Representational forms

Across the literature, embodied graphs recur in several canonical forms. Some graphs encode world structure; some encode action structure; some encode body structure; and some encode memory structure. Representative formulations include $G=(V,E,T)$ for a text-attributed API/task graph in robotic ultrasound, $\mathbf{S}=(\mathbf{O},\mathbf{E})$ for a multi-modal 3D scene graph in zero-shot navigation, $G=(\mathcal{N},\mathcal{E})$ for a mutable scene graph in episodic-memory question answering, and $G=(V,E)$ for a morphology graph in soft robotics (Chen et al., 18 Feb 2025, Huang et al., 13 Nov 2025, Ali et al., 1 Jun 2025, Wang et al., 20 Mar 2026).

Formulation	Graph role	Embodied coupling
$G=(V,E,T)$	API/task/planning graph	Constrains executable robot actions
$\mathbf{S}=(\mathbf{O},\mathbf{E})$	Multi-modal 3D scene graph	Built from RGB-D exploration
$G=(\mathcal{N},\mathcal{E})$	Mutable scene memory	Updated during task inference
$G=(V,E)$	Morphology graph	Body-aware control and transfer

The simplest embodied graph stores entities and pairwise relations, but recent systems generally exceed that baseline. M3DSG preserves visual cues by replacing textual relational edges with dynamically assigned images, so an edge is not only a predicate but also a set of embodied observations showing the relation (Huang et al., 13 Nov 2025). ObsGraph uses a room–view–object hierarchy in which room nodes provide semantic anchors, view nodes preserve contextual object co-visibility, and object nodes retain fine-grained visual evidence (Lee et al., 23 Jun 2026). GraphPad extends the usual scene graph into a mutable workspace containing a 3D scene graph, navigation log, scratch-pad, and frame memory, all exposed through callable APIs (Ali et al., 1 Jun 2025).

A notable pattern is that node and edge semantics are rarely purely symbolic. In scene-centric systems, nodes often store 3D position, point cloud, room affiliation, captions, image provenance, or visibility histories, while edges may encode traversability, containment, attachment, support, API dependency, or object co-occurrence (Saxena et al., 2024, Ali et al., 1 Jun 2025, Huang et al., 13 Nov 2025). In morphology-centric systems, node features combine local and global embodiment information, and edge features may include spatial offsets such as $X(e)\in\mathbb{R}^d$ 0 (Wang et al., 20 Mar 2026). This suggests that embodied graphs are best understood as relational carriers of grounded state, not merely as symbolic adjacency structures.

3. Embodied memory and world modeling

A major research line uses graphs as persistent embodied memory. In GraphEQA, the agent incrementally constructs a real-time 3D metric-semantic scene graph with Hydra, enriches room nodes with semantic labels, and adds frontier nodes connected to nearby objects so that unexplored space becomes semantically queryable rather than geometrically anonymous (Saxena et al., 2024). The graph is then serialized into structured language for a VLM planner, which alternates between object-directed revisitation and frontier-guided exploration.

GraphPad makes the memory graph explicitly editable at inference time. Its Structured Scene Memory combines a mutable 3D scene graph, a navigation log, a graphical scratch-pad, and frame memory; a VLM can call find_objects, analyze_objects, or analyze_frame to insert nodes, edges, or notes as the question demands (Ali et al., 1 Jun 2025). The key point is that the graph is not fixed before the task is known. It is revised after inspection of selected frames, which converts memory from a static scene summary into a task-conditioned workspace.

ObsGraph pushes this memory perspective further by making the graph observation-centric and hierarchical. Retrieval proceeds coarse-to-fine over room, view, and object layers under a bounded budget, and the abstraction level of retrieved evidence determines the next exploration mode: room exploration, view refinement, or frontier exploration (Lee et al., 23 Jun 2026). Here the graph does not merely store what has been seen; it structures what should be explored next.

EmbodiedLGR adopts a lighter-weight design. Its graph memory is object-centric semantic-spatial-temporal memory, with atomic tuples of the form $X(e)\in\mathbb{R}^d$ 1, where an object-label embedding is bound to pose and time (Riva et al., 20 Apr 2026). Richer scene semantics are kept separately in a vector database. This hybrid architecture shows a different embodied-graph trade-off: reduce redundancy and query latency by reserving the graph for atomic grounded facts while delegating broader semantic descriptions to retrieval.

In dynamic outdoor navigation, CausalNav uses a multi-level semantic scene graph called the Embodied Graph, integrating building nodes from offline maps, fine-grained object nodes from online perception, ego-vehicle trajectory nodes, and hierarchical cluster nodes derived by LLM summarization (Duan et al., 5 Jan 2026). Dynamic objects are filtered through a spatial-temporal corridor mechanism, and the graph becomes a retrievable semantic-topological memory for long-range navigation under open-vocabulary queries.

LookPlanGraph applies the same memory principle to instruction following. It initializes a graph with places, assets, and uncertain object priors, marks prior objects as unseen, and then updates a memory graph online through VLM-based graph augmentation during discover_objects actions (Onishchenko et al., 24 Dec 2025). The graph is therefore both prior memory and corrigible world state.

4. Planning, control, and executable action structure

A second major line treats the embodied graph as the interface between language and constrained action generation. USPilot is exemplary: robotic ultrasound capabilities are represented as a text-attributed task graph $X(e)\in\mathbb{R}^d$ 2, where vertices are robotic ultrasound APIs and edges encode dependency relations (Chen et al., 18 Feb 2025). The planner, LLMEG, performs node selection with an LLM-enhanced GNN and then uses an LLM-based subgraph generator to assign execution order. The graph narrows the planning space to legal capabilities and legal dependencies, avoiding free-form low-level control generation in a safety-critical medical domain.

GiG generalizes this action-centric view through a Graph-in-Graph architecture (Li et al., 29 Jan 2026). The inner graph is a scene graph $X(e)\in\mathbb{R}^d$ 3 encoding the current embodied world state; a GAT maps it to an embedding $X(e)\in\mathbb{R}^d$ 4; these embeddings become nodes in an outer execution-trace graph whose edges are action labels. Retrieval operates over structurally similar prior graph states, and bounded lookahead uses symbolic transition logic to enumerate grounded successor states. The resulting planner is graph-centric at both the world-model and experience-memory levels.

Hypothesis Graph Refinement extends planning graphs into the epistemic domain. It defines a persistent graph $X(e)\in\mathbb{R}^d$ 5 with observed nodes, hypothesis nodes, navigability edges, and a dependency DAG used for verification-driven cascade correction (Chen et al., 5 Apr 2026). Frontier-conditioned semantic predictions are represented as revisable hypothesis nodes rather than facts. When on-site observations contradict a prediction, the refuted node and all descendants in $X(e)\in\mathbb{R}^d$ 6 are pruned. This makes the embodied graph non-monotonic: it can grow by prediction and shrink by correction.

Embodied manipulation under occlusion motivates a simpler but still action-coupled graph. In cluttered Manipulation Question Answering, a dynamic scene graph is rebuilt after each push action, aligning graph snapshots across timesteps so that active exploration reveals previously hidden nodes and relations (Deng et al., 2022). Graph reasoning itself is symbolic—enumeration and BFS rather than GNN inference—but the graph is nonetheless action-conditioned and updated by the robot’s interventions.

Embodied graph construction can itself be the control objective. In embodied semantic scene graph generation, navigation policy is optimized to maximize node recall of the evolving global semantic scene graph under a 40-step horizon, with PPO-based control, stagnation signals derived from graph embedding change, and explicit stop rewards (Kueble et al., 26 Mar 2026). In this formulation, the graph is not just a planning aid; it is the world model whose completeness is being optimized.

5. Morphology and co-design

In soft robotics, embodied graph refers to a graph representation of the robot’s own body. The co-design framework for soft robots represents each morphology as $X(e)\in\mathbb{R}^d$ 7, with nodes corresponding to position sensors and edges capturing spatial adjacency in EvoGym (Wang et al., 20 Mar 2026). Node features combine global properties such as orientation with local information such as coordinates, voxel type, and velocity; edge features include relative offsets $X(e)\in\mathbb{R}^d$ 8. A Graph Attention Network performs one attention-based message passing round, pooled node embeddings feed lightweight MLP heads, and actor–critic policies are optimized with PPO inside a genetic algorithm loop.

The central significance is not merely that a GNN is used for control. The graph is the embodiment model through which morphology changes become computable. A topology-consistent inheritance mechanism maps controller parameters across mutations by reusing shared GAT layers and hidden MLP layers while copying matched actuator outputs and randomly initializing unmatched ones. This allows learned control to survive body changes more gracefully than fixed-input/output MLPs. The paper explicitly frames graph-structured policies as “an effective interface between evolving bodies and brains,” which is one of the clearest formulations of embodied graph as morphology-aware control structure (Wang et al., 20 Mar 2026).

A plausible implication is that embodied-graph research spans at least two distinct but compatible senses of embodiment: world-grounded graphs, where the graph models an external environment, and body-grounded graphs, where the graph models the agent’s own morphology. The literature on soft-robot co-design makes this distinction explicit without separating it from the broader embodied-intelligence agenda.

6. Multimodal and knowledge-augmented formulations

A broader line of work treats embodied graph as the coupling of grounded representations with symbolic graph knowledge. The dual embodied-symbolic framework argues that concept representations should include both embodied feature vectors and symbolic concept graphs or knowledge-graph embeddings, with examples including scene graph generation with knowledge-graph bridging and multimodal knowledge graphs (Chang, 2022). In this view, a graph is embodied when graph-structured symbols are explicitly linked to modality-specific features.

Molecular graph learning provides a structurally grounded version of this idea. “Embodied-Symbolic Contrastive Graph Self-Supervised Learning for Molecular Graphs” treats the molecular graph as the embodied representation and chemical knowledge graphs as symbolic augmentation (Chang, 2022). Semantic augmentation adds property nodes or moiety nodes to the molecular graph, and exemplar-based contrastive learning aligns the original molecular graph with its semantically augmented counterpart. Here “embodied” does not mean sensorimotor robotics; it means grounding in the molecule’s concrete graph structure.

Scene-MMKG extends the same logic to embodied robotics by constructing a scene-driven multimodal knowledge graph that combines conventional knowledge engineering with LLM-assisted schema design (Yaoxian et al., 2023). Its instantiated ManipMob-MMKG supports manipulation and mobility through scene-bounded textual and visual knowledge, retrieved as $X(e)\in\mathbb{R}^d$ 9, denoised against the current observation, encoded with a GCN, and injected into downstream models without major architectural redesign. This makes the graph an explicit scene-knowledge substrate rather than a generic commonsense resource.

Long-horizon egocentric video understanding adopts yet another variant. FocusGraph converts clips into graph-based scene captions $G=(V,E,T)$ 0, embeds those captions, and uses a Scene-Caption LLM Selector to retrieve relevant clips before training-free keyframe extraction (Zemskova et al., 4 Mar 2026). The graph here is textual and clip-level rather than geometric, but it still functions as a compact episodic memory for embodied experience.

Taken together, these works show that embodied graph can denote a continuum from physically situated robotic scene graphs to embodied-symbolic graph learning in domains where the “body” is a structured substrate such as a molecule. The common property is not a single modality, but the explicit linking of graph structure to grounded evidence.

7. Limitations, controversies, and open directions

Several limitations recur. First, many embodied graphs remain hand-defined or small-scale. USPilot’s deployed robot uses a graph with 21 APIs and 24 edges, and the paper explicitly notes that scalability to richer clinical workflows remains untested (Chen et al., 18 Feb 2025). Similar concerns appear in scene-memory systems where graph growth increases prompt size and reasoning cost (Ali et al., 1 Jun 2025).

Second, graph quality is often bottlenecked by perception and LLMs. GraphPad notes the lack of a verification mechanism once incorrect objects or relations are inserted (Ali et al., 1 Jun 2025). LookPlanGraph reports VLM graph-extraction errors, especially with repeated instances, and identifies erroneous discovery actions as a primary cause of failures (Onishchenko et al., 24 Dec 2025). MSGNav emphasizes that even with M3DSG, VFM and VLM inference latency remains a major barrier to real-time deployment (Huang et al., 13 Nov 2025).

Third, dynamic embodied graphs face the problem of error accumulation. Hypothesis Graph Refinement is explicit that confidence attenuation alone cannot resolve structurally wrong predictions; it introduces cascade correction precisely because additive graph growth can preserve false downstream inferences (Chen et al., 5 Apr 2026). A related concern appears implicitly in any system that stores generated notes or inferred room labels without formal verification.

Fourth, many systems still separate graph reasoning from full closed-loop replanning. USPilot notes that the current system lacks replanning ability, even though low-level scanning adjusts probe orientation and force during execution (Chen et al., 18 Feb 2025). This suggests a gap between graph-conditioned high-level planning and graph updates driven by downstream task success or failure.

Finally, there remains a conceptual controversy over the term itself. In some papers, “embodied graph” refers to online scene graphs for mobile robots; in others, it refers to morphology graphs, semantic-spatial memory graphs, clip-level scene-caption graphs, or embodied-symbolic graph pairings. The contrast with “embedded-graph” in edge-embedding theory makes the ambiguity sharper (Yokoyama, 2017). A reasonable synthesis is that embodied graph is not a single formalism but a research pattern: graph structure is used as the explicit interface by which grounded experience, memory, and action constrain one another.

Current directions point toward richer and more adaptive graph substrates: unified graphs that model both sensors and actuators in evolving morphologies (Wang et al., 20 Mar 2026), hybrid graph-plus-retrieval memories for low-latency robot interaction (Riva et al., 20 Apr 2026), editable scene graphs that remain task-conditioned at inference time (Ali et al., 1 Jun 2025), and revisable hypothesis graphs that support non-monotonic memory correction (Chen et al., 5 Apr 2026). Across these directions, the defining ambition is stable: to make relational structure an operational component of embodied intelligence rather than a post hoc description of it.