Papers
Topics
Authors
Recent
Search
2000 character limit reached

ArtiSG: Functional 3D Scene Graphs

Updated 4 July 2026
  • ArtiSG is a framework that converts human-demonstrated articulated object manipulation into functional 3D scene graphs that encode semantic, kinematic, and functional attributes.
  • It integrates multi-view RGB-D perception with techniques like DBSCAN, Grounding DINO, and SAM to robustly detect and refine small functional elements.
  • The system shifts scene encoding from static description to interaction-grounded memory, improving manipulation accuracy and performance metrics over traditional methods.

Searching arXiv for the ArtiSG paper and closely related articulated-object and scene-graph work. ArtiSG is a framework for constructing functional 3D scene graphs from human-demonstrated articulated-objects manipulation. In contrast to 3D scene graphs that primarily encode semantic structure for navigation and planning, ArtiSG augments scene representation with functional elements, articulation type, articulation axis, and demonstrated motion trajectories, thereby treating human manipulation as structured robotic memory rather than incidental observation. Its central premise is that static visual inference is insufficient for articulated objects because articulation mechanisms are visually ambiguous, state-change methods often depend on constrained sensing setups, and inconspicuous functional parts such as small handles are frequently missed by general perception pipelines (Gu et al., 31 Dec 2025).

1. Problem formulation and scope

ArtiSG addresses the gap between semantic scene understanding and manipulation-oriented scene understanding. Standard 3D scene graphs can represent object identity and spatial structure, but they generally do not encode the information required to operate articulated objects: which component is actionable, what joint family governs its motion, where the articulation axis lies, and how that motion was previously executed (Gu et al., 31 Dec 2025).

The framework is motivated by three limitations. First, static articulation inference is ambiguous: methods that infer articulation from appearance alone can fail when visually similar objects obey different kinematics. Second, state-change-based estimation is brittle: methods that estimate articulation from state transitions often assume fixed cameras or unobstructed views. Third, functional elements are often too small or inconspicuous: handles, knobs, and hidden grips may be missed by generic object detectors or 3D instance segmentation pipelines. ArtiSG addresses these limitations by incorporating human demonstrations as an additional information source, so that the graph reflects not only appearance but also observed interaction (Gu et al., 31 Dec 2025).

A plausible implication is that ArtiSG shifts scene-graph construction from descriptive scene encoding toward interaction-grounded functional abstraction. In that view, a scene graph is not merely a semantic map but a memory substrate for later open-vocabulary querying and language-directed execution.

2. Graph representation and stored functional memory

ArtiSG defines a scene graph as

G={Nobj,Nele,E}\mathcal{G} = \{\mathcal{N}^{\mathrm{obj}}, \mathcal{N}^{\mathrm{ele}}, \mathcal{E}\}

where Nobj\mathcal{N}^{\mathrm{obj}} are object nodes, Nele\mathcal{N}^{\mathrm{ele}} are functional element nodes, and E\mathcal{E} are edges linking each functional element to its parent object (Gu et al., 31 Dec 2025).

The representation is explicitly hierarchical. Object nodes represent static object bodies and store a category label, an open-vocabulary semantic feature, and a point cloud. Functional element nodes represent actionable parts and store a functional label, articulation type, articulation axis

Aj={pc,pd},\mathbf{A}_j = \{\mathbf{p}_c, \mathbf{p}_d\},

and a demonstrated trajectory

Tj={p1,,pn}.\mathcal{T}_j = \{\mathbf{p}_1, \dots, \mathbf{p}_n\}.

Here pcR3\mathbf{p}_c \in \mathbb{R}^3 is an axis center, pdR3\mathbf{p}_d \in \mathbb{R}^3 is an axis direction, and each pkR7\mathbf{p}_k \in \mathbb{R}^7 is a 6-DoF pose in the manipulation sequence. The node relation is one-to-many: one object may own multiple functional elements, while each functional element belongs to exactly one parent object (Gu et al., 31 Dec 2025).

This graph is designed as functional memory. When queried by language, a robot can retrieve not only an object identity but also the interaction point and stored kinematic prior. The framework therefore encodes “what to touch” and “how it moves,” not only “what it is.” This suggests that ArtiSG treats trajectories and articulation axes as first-class symbolic-physical attributes within the scene graph, rather than as auxiliary metadata.

3. System pipeline and initialization from static perception

ArtiSG is organized into three stages: functional scene graph initialization, viewpoint-robust articulation estimation, and interaction-augmented graph refinement (Gu et al., 31 Dec 2025).

During initialization, the environment is scanned using RGB-D observations to obtain posed frames and a scene point cloud. An off-the-shelf 3D instance segmentation model is used to create object instances, and DBSCAN is applied to remove outliers before instantiating object nodes. Functional elements are then sought by selecting informative views. For each object, ArtiSG computes a per-frame contribution score st,is_{t,i} by projecting object points into each camera view, discarding points outside the image or with inconsistent depth, and defining Nobj\mathcal{N}^{\mathrm{obj}}0 as the fraction of valid projected points relative to all object points. The top-Nobj\mathcal{N}^{\mathrm{obj}}1 frames are selected for each object (Gu et al., 31 Dec 2025).

Using those selected views, the system crops the image around the object, runs Grounding DINO with prompts such as “handle” or “knob,” applies SAM for pixel-accurate masks, back-projects the masks into 3D, aggregates multi-view lifted points into a unified functional-element point cloud, and applies DBSCAN again for denoising. Open-vocabulary node embeddings for both object and element nodes are extracted with SigLIP 2 from the selected views and combined as a weighted average using the visibility scores Nobj\mathcal{N}^{\mathrm{obj}}2 (Gu et al., 31 Dec 2025).

The significance of this stage lies in its attempt to compensate for the viewpoint sensitivity of functional-part detection. ArtiSG does not rely on a single canonical image; it explicitly ranks views by object visibility and uses multi-view lifting to recover tiny parts that a purely 3D detector may fail to isolate.

4. Portable articulation capture under camera ego-motion

A major technical component of ArtiSG is its data-collection setup for recovering articulation trajectories when the observing camera is itself moving. The hardware consists of a head-mounted RGB-D camera, a UMI gripper, and a custom polyhedral sphere carrying dense ArUco markers. The gripper serves as a rigid manipulation interface, providing a stable proxy for the interaction point, while the head-mounted camera supplies scene observations and SLAM-based world-frame poses (Gu et al., 31 Dec 2025).

ArUco detections on the sphere yield 2D–3D correspondences, from which the sphere pose is estimated via PnP: Nobj\mathcal{N}^{\mathrm{obj}}3 Using the camera pose from SLAM, the estimate is transformed into the world frame: Nobj\mathcal{N}^{\mathrm{obj}}4 A known rigid transform from sphere center to gripper tip then gives

Nobj\mathcal{N}^{\mathrm{obj}}5

The resulting world-frame tool-tip poses constitute the demonstrated trajectory Nobj\mathcal{N}^{\mathrm{obj}}6 (Gu et al., 31 Dec 2025).

Raw pose estimates are refined by an adaptive Kalman filter that smooths jitter from hand tremor and marker noise, handles rotational wrap-around by rotation unwrapping, and adjusts confidence based on PnP reprojection error. This stage is critical because articulation estimation is highly sensitive to trajectory noise. A plausible implication is that ArtiSG treats robust egocentric capture not as a convenience but as a prerequisite for converting demonstrations into reusable graph-level kinematic priors.

5. Articulation estimation and interaction-augmented graph refinement

From the smoothed trajectory Nobj\mathcal{N}^{\mathrm{obj}}7, ArtiSG infers articulation mechanism and joint family. It considers two joint types: prismatic and revolute (Gu et al., 31 Dec 2025).

For prismatic joints, the trajectory is approximated by a 3D line using SVD/PCA on centered trajectory points. The axis direction Nobj\mathcal{N}^{\mathrm{obj}}8 is taken as the eigenvector with the largest singular value, and the axis center Nobj\mathcal{N}^{\mathrm{obj}}9 is the centroid. For revolute joints, the trajectory is modeled as a circular arc: the axis direction is estimated from the eigenvector corresponding to the smallest singular value, the trajectory is projected onto the orthogonal plane, and the rotation center is solved by nonlinear least squares minimizing radial deviation. Joint-family selection compares reconstruction residuals for the prismatic and revolute fits, with an added penalty for model complexity, producing the articulation axis

Nele\mathcal{N}^{\mathrm{ele}}0

(Gu et al., 31 Dec 2025)

The estimated interaction is then integrated into the graph through trajectory-to-node association. The starting pose Nele\mathcal{N}^{\mathrm{ele}}1 is treated as the initial contact point and compared with nearby functional-element centroids. If the nearest node lies within a threshold, the system associates the trajectory with that node and attaches the articulation axis, joint type, and full trajectory. If no node is sufficiently close, ArtiSG assumes a missed functional element, instantiates a new element node centered at Nele\mathcal{N}^{\mathrm{ele}}2, attaches its kinematic attributes, and links it to the nearest parent object (Gu et al., 31 Dec 2025).

This refinement mechanism directly addresses a common misconception about scene graphs for manipulation: that all relevant actionable structure must be visually observable. ArtiSG explicitly rejects that premise. Functional elements may be visually weak yet interactionally decisive, and the framework uses demonstration to elevate them into persistent graph entities.

6. Empirical performance and downstream manipulation use

ArtiSG is evaluated in Behavior-1k simulation scenes and in real environments including a real kitchen, an office pantry, and a tabletop scene, covering 79 articulated objects and 139 functional elements (Gu et al., 31 Dec 2025). The framework is compared against Lost&Found and OpenFunGraph for functional scene graph construction, and against GFlow, CoTracker, and MediaPipe for articulation tracking. In downstream language-directed manipulation, it is compared with a strong VLM baseline (Gu et al., 31 Dec 2025).

For functional scene graph construction, the reported gains are concentrated in recall. In simulation, ArtiSG w.o. human achieves recall 78.6, precision 70.4, and F1 74.2, while full ArtiSG reaches recall 82.6, precision 71.4, and F1 76.6. In real-world settings, ArtiSG w.o. human attains recall 55.8, precision 41.0, and F1 47.2, whereas full ArtiSG reaches recall 88.5, precision 51.6, and F1 65.2. The largest reported gain is the increase in real-world recall from 55.8% to 88.5%, supporting the claim that human demonstrations recover functional elements missed by static perception (Gu et al., 31 Dec 2025).

For articulation tracking, ArtiSG reports substantially lower trajectory and axis errors than baselines. In static settings, for prismatic joints, ArtiSG achieves 0.976 cm / 1.026°; for revolute joints, it achieves 1.092 cm / 1.627° / 0.811 cm. In dynamic settings, for prismatic joints the framework reports 0.820 cm / 1.314°, and for revolute joints 0.899 cm / 2.322° / 1.225 cm. The paper notes that in dynamic scenarios ArtiSG reduces trajectory RMSE by about 70% compared to the best baseline on revolute joints (Gu et al., 31 Dec 2025).

In downstream manipulation, the task is expressed as language such as “Open the [object]”. The reported advantage is that ArtiSG retrieves the correct functional element node, its demonstrated trajectory, and its stored kinematic prior, whereas a VLM baseline may miss the interaction point, hallucinate the action direction, or choose the wrong articulation model. This suggests that the principal contribution is not only improved perception metrics but a more reliable bridge from language to physical execution.

Setting Method Key reported result
Real-world graph construction ArtiSG w.o. human Recall 55.8, Precision 41.0, F1 47.2
Real-world graph construction ArtiSG Recall 88.5, Precision 51.6, F1 65.2
Dynamic revolute tracking ArtiSG 0.899 cm / 2.322° / 1.225 cm
Dynamic prismatic tracking ArtiSG 0.820 cm / 1.314°

7. Relation to adjacent articulated-object modeling and limitations

ArtiSG belongs to a broader research trajectory on articulated-object understanding, but its emphasis is distinctive. SINGAPO formulates articulated object creation as a single-image conditional generative problem and produces structured 3D articulated assets from one resting-state RGB image (Liu et al., 2024). ArtGS instead focuses on interactive visual-physical modeling using 3D Gaussian Splatting, combining multi-view RGB-D reconstruction, VLM-based structure inference, and differentiable optimization for manipulation (Yu et al., 3 Jul 2025). ArtiSG differs from both by centering the representation on functional 3D scene graphs built from human-demonstrated interaction rather than from single-image generation or joint-aware visual-physical reconstruction.

This contrast clarifies ArtiSG’s niche. SINGAPO addresses ambiguity in object geometry and kinematics from impoverished visual input; ArtGS constructs a physically consistent articulated digital twin from multi-view sensing and interaction; ArtiSG stores interaction-derived kinematic and functional knowledge in an open-vocabulary graph suitable for later retrieval and execution. A plausible implication is that these approaches are complementary rather than interchangeable: ArtiSG provides functional memory, whereas SINGAPO and ArtGS emphasize asset generation and articulated visual-physical modeling, respectively (Liu et al., 2024, Yu et al., 3 Jul 2025).

The paper notes several limitations. The current hardware uses markers on the gripper sphere, so the system is not fully markerless. The articulation model currently focuses on prismatic and revolute joints. Graph coverage depends on what humans have actually demonstrated, so unobserved affordances remain absent from memory. Future directions identified in the paper include markerless tracking and integration with more general manipulation policies (Gu et al., 31 Dec 2025).

Taken together, ArtiSG can be understood as an attempt to make 3D scene graphs operational for robotics by grounding them in demonstrated manipulation. Its defining contribution is the conversion of interaction traces into persistent symbolic-kinematic structure: object nodes become function-bearing entities, functional elements become explicit graph nodes, and human demonstrations become reusable robotic priors.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ArtiSG.