Semantic-Geometric Task Graph Representation

Updated 23 January 2026

Semantic-geometric task graphs are structured models integrating high-level semantic descriptors with precise geometric attributes for detailed spatial reasoning.
They dynamically capture spatial, physical, and temporal relations through nodes and edges, enabling real-time scene understanding and autonomous task planning.
This integrated framework advances robotics by supporting planning, manipulation, and skill transfer through joint optimization of semantic cues and geometric data.

A semantic-geometric task graph-representation is a structured data model integrating semantic, geometric, and often temporal information in a graph-based format, supporting decision making, reasoning, planning, and transfer in robotics, embodied AI, and scene understanding. Each node typically encodes both high-level object/region semantics and precise geometric descriptors, while edges capture spatial, physical, or task-oriented relations, often with learned or symbolic attributes. Recent research has advanced this framework from static offline representations to online, dynamic, multimodal, and temporally-evolving scene/task graphs that drive autonomous task execution, manipulation, navigation, and skill transfer.

1. Formal Graph-Based Definitions and Core Structures

At its foundation, a semantic-geometric task graph is an attributed directed or undirected graph: $G = (V, E)$ where $V = \{v_i\}$ are nodes (objects, locations, agents, body parts, regions), and $E \subseteq V \times R \times V$ are typed edges with $R$ the relation vocabulary (e.g., "on", "next_to", "support", "contains"). Nodes are vectors $x_i = [p_i; h_i; c_i]$ , concatenating 3D position $p_i$ , geometric/shape attributes $h_i$ , and semantic descriptors $c_i$ (such as CLIP embeddings or one-hot class labels). Edge features $g_{ij} = [x_i;\varphi_r;x_j]$ include node features and a relation embedding $\varphi_r$ tuned for each predicate type (Shirasaka et al., 25 Jun 2025).

Many frameworks expand this base:

Contact graph+ (cg⁺): Adds mesh-level or bounding-box primitives and support/containment surfaces, plus container "opened/closed" attributes, and encodes spatial support/precedence relations central to sequential manipulation planning (Jiao et al., 2022).
Hierarchical scene-task-state graph triplets: Distinguish a task graph (procedure DAG), scene graph (world geometry/kinematics), and state graph (requirements/observations linking task and scene) for skill libraries and transfer (Qi et al., 2024).
Spatial-temporal task graphs: Model framewise temporal evolution, with node features as histories and edge features as multi-hot semantic/physical relations over trajectories (Herbert et al., 16 Jan 2026).

2. Semantic and Geometric Encoding in Nodes and Edges

Joint semantic-geometric encoding is central:

Semantic attributes ( $c_i$ ): Typically learned embeddings (CLIP, LLM, or language-model features), one-hot class encodings, or symbolic taxonomies. Tasks, skills, predicates, and roles are often factored as one-hot IDs or softmax embeddings (Shirasaka et al., 25 Jun 2025, Qi et al., 2024).
Geometric attributes ( $p_i$ , $h_i$ ): Include 3D position, orientation, object bounding-boxes (size, centroid), convex hulls, normals, or even rigid-body pose matrices $B_i \in SE(3)$ . At higher spatial resolution, mesh/point-cloud segments support fine manipulation and contact (Jiao et al., 2022, Li et al., 24 Sep 2025).
Edge relations: Encoded as tuples $(v_i, r, v_j)$ , where $r$ is a binary or higher-order spatial, physical, or procedural relation (e.g., "on", "adjacent", "contact", "reachable"). Edges may carry soft confidence scores, symbolic attributes, or learned neural embeddings (e.g., ELMO, GNN outputs) (Shirasaka et al., 25 Jun 2025, Millan-Romera et al., 2024). For SLAM, edges can encode geometric constraint factors optimized for pose/structure consistency (Millan-Romera et al., 2024).

Graph data structures (adjacency lists, attribute dictionaries, scene-state mappings) are updated in real time in online systems as new observations or semantic cues arrive.

3. Online Construction, Update, and Temporal Evolution

Real-world deployment requires continuous, online synthesis and update:

Observation buffers collect geometric (point clouds, SLAM, RGB-D) and semantic (speech, OCR, visual detections, gesture) streams (Shirasaka et al., 25 Jun 2025).
Geometric update modules perform nearest-neighbor association, Kalman-like state fusion, and node birth/death for new observations. Edge relations (spatial, reachable) are recomputed via analytic predicates (e.g., spatial proximity, support, shortest-path reachability) with probabilistic or kernelized scoring (Shirasaka et al., 25 Jun 2025).
Semantic update modules integrate LLM/GPT outputs, object-detection, and contextual queries, updating existing edges/nodes or creating new graph elements. Confidence scores are thresholded to gate incorporation (Shirasaka et al., 25 Jun 2025, Li et al., 24 Sep 2025).
Temporal structure is captured via framewise node/edge histories, with MPNN/graph attention propagating structure through time (Herbert et al., 16 Jan 2026). This supports long-horizon reasoning and stateful prediction (see Table 1 for examples).

Framework	Node Features	Edge Relations
SPARK (Shirasaka et al., 25 Jun 2025)	$[p_i; h_i; c_i]$ (position, shape, semantics)	$r$ : on, next_to, reachable (score)
cg+ (Jiao et al., 2022)	mesh set, class, bounding box, container status	support, proximity (collision), open/close
Task/Scene (Qi et al., 2024)	task/subtask/action id, input/output requirements	contain, next, start/end, require/obtain
QSR (Li et al., 24 Sep 2025)	object instance, point-cloud, NeRF, multimodal	on, next-to, contains
MPNN-Task (Herbert et al., 16 Jan 2026)	class ID, 3D trajectory history	semantic relations (multi-hot labels)

4. Integration into Planning, Reasoning, and Control

These representations unify perception, memory, planning, and execution:

Task planning: Subgraphs encode goal predicates (e.g., $\mathrm{On}(o_{\rm cup}, o_{\rm plate})$ ), and symbolic planners (A*, Dijkstra, PDDL, GRU/Transformer controllers) search for grounding subgraphs and action sequences that minimize task/replan cost (Shirasaka et al., 25 Jun 2025, Li et al., 24 Sep 2025).
Sequential manipulation: Graph edit distance (GED) between current and goal graphs produces a sequence of actions (Pick/Place/Open/Close), subject to symbolic and geometric preconditions and access constraints (Jiao et al., 2022).
Skill transfer: Hierarchical graphs separate "what" (task graph) and "how/where" (scene graph), with a state graph fusing requirement/observation edges, enabling subtask lifting from demonstration to new objects/regions (Qi et al., 2024).
Closed-loop control: Temporal message-passing through semantic-geometric graphs supports long-horizon forecasting, action-object prediction, and motion trajectory generation, enabling robust execution even in high-variability tasks (e.g., bimanual manipulation) (Herbert et al., 16 Jan 2026).
Downstream interfacing: Graphs support queryability (e.g., “Where is the red cup?”), visual question answering, navigation waypoint generation, and adaptive subgraph selection for constrained search (e.g., terrain-aware navigation) (Samuelson et al., 6 Jun 2025, Li et al., 24 Sep 2025, Kim et al., 2019).

5. Algorithmic and Learning Frameworks

Various algorithmic strategies underpin graph construction, alignment, and optimization:

Message Passing Neural Networks (MPNNs) and Graph Attention: Learn to propagate both semantic and geometric features, updating node/edge attributes, and enabling task-relevant pooling (e.g., for policy learning and forecasting) (Herbert et al., 16 Jan 2026, Seymour et al., 2022).
Graph Transformers and Convolutions: Enable multi-hop reasoning, composite relation learning, and node/edge feature refinement, supporting efficient navigation policies (Seymour et al., 2022, Xie et al., 2024).
Loss functions: Jointly optimize cross-entropy for semantic node/edge classification, MSE for geometric trajectory prediction, margin ranking or contrastive objectives for temporal ordering and cross-object alignment (Herbert et al., 16 Jan 2026, Jin et al., 2022).
Partial matching and fusion: For registration, alignment, and multi-view consistency, semantic-geometric feature fusion and differentiable matching (e.g., Sinkhorn, SoftTopK, learned rescoring) are critical for robust point/graph-level correspondences (Xie et al., 2024).
Factor graph optimization: In SLAM and mapping, semantic-geometric graphs constitute the variable/factor structure for joint probabilistic inference over poses, planes, rooms, and wall segments, with GNNs learning factor parameters and helping cluster geometric primitives into semantic entities (Millan-Romera et al., 2024).
Online, real-time scalability: Efficient data structures (adjacency lists, hashmaps) and streaming update operations are employed to support the dynamics of robot-environment interaction (Shirasaka et al., 25 Jun 2025).

6. Empirical Impact and Application Domains

Semantic-geometric task graph-representations have demonstrated impact in:

Robotic service and household manipulation: Adaptive, online graph updates facilitate robust goal achievement in dynamic human environments and with unconventional semantic cues (e.g., gestures, speech), yielding measurable gains in plan-tracking accuracy (Shirasaka et al., 25 Jun 2025).
Embodied navigation and visual reasoning: GraphPooling as intermediate representations accelerates learning, improves sample efficiency, and increases interpretability in navigation and language-instruction following (Seymour et al., 2022, Xie et al., 2024).
Long-horizon manipulation and skill transfer: Hierarchical and state-linked graphs enable LLM-informed subtask transfer, motion re-planning, and tactile-driven refinement, generalizing structured skills across categories and physical contexts (Qi et al., 2024).
SLAM and 3D mapping: Semantic-geometric factor graphs allow for joint optimization of geometric structure and semantic grouping in challenging environments, improving data association and leading to real-time tractability (Millan-Romera et al., 2024).
Vision-language and task-based retrieval: Multimodal scene graphs and object-centric retrieval directly ground high-level queries into geometric substructures used in planning and control (Li et al., 24 Sep 2025).
Demonstration learning: Temporal task graphs extracted from video demonstrations outperform sequence-only models for tasks with high object and action variability, and enable zero-shot transfer to physical systems (Herbert et al., 16 Jan 2026).

7. Limitations and Future Research Directions

Current research highlights several open challenges and frontiers:

Online semantic fusion in highly dynamic, cluttered spaces remains challenging; continual learning and error correction are required for reliable long-term deployment (Shirasaka et al., 25 Jun 2025).
Scalability: Efficient subgraph search, partial matching, and compact encoding of dense, large-scale scenes are critical for applications in open-world and outdoor environments (Samuelson et al., 6 Jun 2025).
Skill transfer boundaries: The mapping between high-level task structure and low-level physical execution in transfer to novel scenes or object types is not always straightforward; state graphs and tactile feedback help but are not a complete solution (Qi et al., 2024).
Unified multimodal representations: Integration of point clouds, NeRF/radiance fields, panoptic segmentation, and language requires modular, extensible designs as in 3D QSR (Li et al., 24 Sep 2025).

A plausible implication is that semantic-geometric task graph-representations will remain a foundational abstraction for robotics and multimodal embodied AI, with future work focusing on deeper semantic grounding, lifelong adaptation, and integration across vision, language, and physical control.