Spatial Perception eNgine (SPIN)

Updated 30 December 2025

SPIN is a fully automatic framework for constructing layered 3D dynamic scene graphs that integrate visual-inertial SLAM, semantic understanding, and dynamic agent tracking.
It leverages advanced techniques such as IMU-aided optical-flow, TEASER++ registration, and Graph-CNN based human mesh tracking to enhance spatial perception.
The framework supports autonomy, planning, and human–robot interaction by providing robust, hierarchical representations tested in both simulated and real-world conditions.

The Spatial Perception eNgine (SPIN) is a fully automatic framework for constructing layered 3D Dynamic Scene Graphs (DSGs) from sensor data, integrating modules for visual-inertial SLAM, semantic scene understanding, and dynamic agent tracking. SPIN establishes a unified representation that enables actionable spatial perception, with direct implications for autonomy, planning, human-robot interaction, and scene prediction. It is structured to ingest synchronized stereo imagery and inertial measurements, outputting a hierarchical, multi-layer directed graph encoding metric, semantic, topological, and spatio-temporal information about complex, dynamic environments (Rosinol et al., 2020).

1. System Architecture and Modular Components

SPIN operates as a streaming pipeline, comprising several main modules:

Visual-Inertial SLAM Front-End: Utilizes an IMU-aided optical-flow tracker, leveraging IMU integration for short-term feature prediction and employing 2-point RANSAC for outlier rejection based on IMU-measured rotations.
SLAM Back-End: State estimation is formulated over robot poses $X = \{X_1, ..., X_T\} \subset SE(3)$ and 3D landmarks $P = \{p_1, ..., p_N\}$ , optimized by a factor graph approach (Kimera-RPGO). The graph incorporates IMU preintegration and stereo reprojection factors, with additional loop closure constraints.
Object Detection and Pose Estimation: Frame-wise panoptic segmentation differentiates both “stuff” and “instance” classes. For CAD-modeled objects, robust 6-DOF pose is recovered via keypoint matching and TEASER++ registration; for “unknown shape” objects, clusters in the metric-semantic mesh are extracted and assigned bounding boxes.
Dense Human Mesh Tracking: SMPL meshes are regressed per cropped human image using a Graph-CNN method. Full spatial registration adopts a PnP formulation. Temporal associations are encoded in per-human pose graphs, with strict data association logic and dynamic masking to prevent moving-object bias in the reconstructed map.
Graph Construction: The DSG comprises five layers: (1) metric-semantic mesh, (2) agents/objects, (3) places/structures, (4) rooms/corridors, and (5) the overall building. Edges represent inclusion, adjacency, traversability, and temporal association at multiple abstraction levels (Rosinol et al., 2020).

2. Formal Foundations: Graph Model and Optimization Objectives

The DSG is defined as a directed graph $G = (V, E)$ with nodes $V$ partitioned into layers, where each node $v$ is attributed with type, pose, geometry, and semantic class label. Relation edges $e_{ij}$ represent “contains,” “adjacent_to,” “co-visible,” and analogous spatial or temporal associations.

SLAM optimization within SPIN targets minimization of the reprojection and preintegration errors: $\min_{X, P} \sum_k \left\|z_k - h_k(X_{t(k)}, P_{j(k)})\right\|^2_{\Sigma_k} + \lambda R(X)$ where $z_k$ are feature and IMU measurements, $h_k(\cdot)$ the observation models, $\Sigma_k$ the noise covariances, and $R(X)$ a smoothness regularizer (e.g., zero-velocity prior) (Rosinol et al., 2020).

Room segmentation leverages energy minimization over ESDF grid labeling: $E(L) = \sum_i \varphi_i(L_i) + \sum_{\langle i, j \rangle}\psi_{ij}(L_i,L_j)$ where $\varphi_i$ encodes unary free-space terms and $\psi_{ij}$ penalizes label discontinuities except at detected doorways.

3. Hierarchical Environment Parsing and Scene Decomposition

SPIN constructs hierarchical representations using sequential algorithms:

Structural Extraction: Mesh sections with “wall,” “floor,” or “ceiling” panoptic labels are segmented, axis-aligned bounding boxes fitted, and canonical orientations assigned.
Room and Place Segmentation: Global ESDF computation via Voxblox facilitates free-space sampling to define “places”; connected components on 2D ESDF slices segment rooms, with place, room, and adjacency relations constructed via majority voting and neighborhood connectivity.
Topological Graph Construction: Places (Layer 3) serve as vertices in a navigation roadmap; objects are linked to their nearest place nodes, and agent timesteps to their spatial position, supporting both high-level (“go to the red couch in Room 3”) and low-level (geodesic traversal) planning.

4. Spatio-Temporal Association and Dynamic Reasoning

SPIN’s dynamic modeling incorporates:

Temporal Edges: Consecutive torso poses of agents in the DSG are temporally linked, encoding continuous motion.
Occlusion and Masking: Human pixels are excluded from the carving of ESDF free-space, resolving dynamic artifacts.
Co-visibility Edges: Entities observed simultaneously are linked, enabling relational queries and supporting dynamic scene understanding.
Data Association and Consistency: Human identity tracking employs bounding box thresholds, velocity checks ( $d_\text{max}$ ), and pose continuity; new agents are spawned when correspondence cannot be established.

5. Benchmark Evaluation and Implementation

SPIN was evaluated in a Unity-based office simulator (65 m × 65 m) with the uHumans dataset (uH_01–uH_03), featuring up to 60 dynamic agents and corresponding ground-truth trajectories.

Metrics:

Visual-Inertial Odometry (VIO) ATE: On uH_03, Kimera-VIO achieved 0.92 m, improved to 0.63 m with IMU-aware tracking.
Mesh Reconstruction: RMSE on uH_02 was reduced from 0.133 m to 0.061 m using dynamic masking.
Human Mesh Localization: Full tracking yielded 0.65 m mean torso error on uH_01.
Object Localization: 1.31 m (unknown shape), 0.20 m (with CAD alignment).
Room Labeling: Achieved 99.9% precision and 99.8% recall on uH_01 (Rosinol et al., 2020).

Qualitative examples demonstrate the layered DSG structure in crowded, dynamic conditions and the impact of dynamic masking on mesh fidelity.

6. Integration of ViPE within the SPIN Framework

ViPE functions as a low-level geometric module within SPIN, outputting per-frame camera intrinsics, globally consistent poses, and dense near-metric depth maps from unconstrained video (Huang et al., 12 Aug 2025). Its principal contributions are:

Calibration: GeoCalib and bundle adjustment refine intrinsics and pose for arbitrary camera models, including pinhole, wide-angle, and 360° configurations.
Odometry: Keyframe-based bundle adjustment supports real-time, metric-scale trajectory tracking across diverse scenarios.
Depth Estimation: ViPE fuses video depth priors and sparse 3D structure, producing full-resolution, temporally coherent, scale-consistent depths via affine alignment and filtering.
Dynamic Masking: Semantic segmentation-driven masking mitigates moving-object bias, complementing SPIN’s dynamic scene modeling.

ViPE demonstrated superior performance: on TUM RGB-D, it reduced ATE from 4.4 cm (DROID-SLAM) to 3.6 cm (static) and from 2.7 cm to 1.5 cm (dynamic); on KITTI, ATE improved from 21.3 m (MASt3R-SLAM) to 9.2 m, representing 18–50% relative gains (Huang et al., 12 Aug 2025).

A large-scale dataset (96 million frames) annotated by ViPE underpins advanced SPIN usage, spanning natural videos, synthetic sequences, and panoramic content.

7. Applications, Limitations, and Prospects

SPIN’s richly structured DSG enables:

Task and Motion Planning: Multiscale graphs form optimal roadmaps and support efficient collision checking through hierarchy-encoded BVH strategies.
Human–Robot Interaction: Enables querying of agent trajectories and object interactions (“Where was human H at time t?”).
Long-Term Autonomy: Supports memory-efficient abstraction and pruning through layered graph representations.
Scene Prediction: Integrates pose graphs and mesh geometry with physics simulators for anticipation of short-term dynamics.

Identified limitations include the need for offline parsing for full DSG inference, current support limited to single-story indoors, and reliance on panoptic segmentation. Prospective research directions are: incremental online DSG updates, extension to large-scale and outdoor environments, joint optimization of semantic segmentation and graph inference, and enrichment of node representations with material or affordance attributes (Rosinol et al., 2020), as well as fusion of additional sensor modalities (IMU, LiDAR) and differentiated bundle adjustment in ViPE (Huang et al., 12 Aug 2025).