Panoptic Multi-TSDFs for Dynamic Mapping

Updated 22 December 2025

Panoptic Multi-TSDFs are advanced 3D mapping techniques that decompose scenes into independent TSDF submaps using object and semantic segmentation to optimize memory and computation.
They fuse depth data with panoptic segmentation masks to update per-object submaps in real time, ensuring high-resolution geometry and semantic consistency under dynamic changes.
Empirical evaluations demonstrate improved runtime, scene coverage, and labeling accuracy, addressing limitations of traditional monolithic TSDF methods.

Panoptic Multi-TSDFs enable online, multi-resolution volumetric mapping in dynamic, long-term environments by leveraging per-object (instance) and semantic segmentation to allocate and update a set of Truncated Signed Distance Field (TSDF) submaps. By reasoning at the object and class level, this representation maintains up-to-date, high-coverage scene geometry and semantic consistency over time, even as objects move, appear, or disappear. Panoptic Multi-TSDFs address the primary limitations of conventional, monolithic TSDF and semantic-TSDF methods: inefficient memory usage and inability to reliably update geometry amid dynamic scene changes. Several works realize this paradigm, notably "Panoptic Multi-TSDFs" (Schmid et al., 2021), "DHP-Mapping" (Hu et al., 2024), and TSDF++ (Grinvald et al., 2021), each adapting data structures and algorithms to promote scalability, accuracy, and temporal coherence in online multimodal mapping.

1. Fundamental Principles and Representation

The Panoptic Multi-TSDF approach decomposes the global scene volume into a set of submaps—each representing a single semantic object instance (“thing”), background region (“stuff”), or free space—maintained as independent TSDF volumes at selectable resolutions. Let $S = \{S_1, \ldots, S_N\}$ be the set of TSDF submaps. Each submap $S_i$ is associated with a class/instance label, a voxel size $\nu_i$ , a sparse set of allocated blocks $B_i \subset \mathbb{Z}^3$ , and per-voxel TSDF values and weights. This design enables multi-resolutionity; for instance, small or geometrically complex objects can use fine voxels (e.g., 2–4 cm), while large furniture or background regions use coarser ones (e.g., up to 30 cm) (Schmid et al., 2021, Hu et al., 2024).

The global TSDF at query point $x$ is defined as

$\mathrm{sdf}_{\mathrm{global}}(x) = \min_{\,S_i \in S_{\mathrm{present}}} \mathrm{interpTSDF}_{S_i}(x)$

where “present” denotes active or persistent submaps.

Submap allocation, deletion, and management are triggered by incoming panoptic segmentation masks, yielding a direct one-to-one correspondence between submaps and high-level semantic entities.

2. TSDF Fusion and Label Integration

Depth data integration follows conventional TSDF update schemes but is performed independently for each submap: $s_{\text{new}}(v) = \frac{w(v) s(v) + w_{\text{in}}(v)\operatorname{tsdf}(v)}{w(v) + w_{\text{in}}(v)}, \qquad w(v) \leftarrow w(v) + w_{\text{in}}(v)$ with the projective-bias weight

$w_{\text{in}}(v) = \frac{f_x f_y \nu_i^2}{z(v)^4}$

where $(f_x, f_y)$ are camera intrinsics and $z(v)$ is the depth at the projected voxel (Schmid et al., 2021).

To ensure semantic consistency, each voxel maintains either an exponentially-decaying belonging probability $P_b(v)$ (Schmid et al., 2021) or explicit label/instance counts $c^v_s(l)$ , $c^v_i(k)$ that determine distributions over semantic and instance IDs (Hu et al., 2024). Instance label fusion, as in DHP-Mapping, further eliminates label duplications among overlapping submaps and enforces spatial exclusivity.

In DHP-Mapping, a fully connected CRF is deployed to refine both semantic and instance labels per submap: $P(X_S,X_I)\propto \exp[-E(X_S,X_I)]$ with unary and pairwise potentials designed over voxel pairs, leveraging spatial and color proximity (Hu et al., 2024).

3. Pipeline for Panoptic Segmentation and Submap Management

The mapping pipeline processes each input RGB-D frame as follows:

Panoptic segmentation generates both “stuff” (background) and “thing” (object instance) masks.
Each active submap’s surface is projected to the image; IoU with panoptic masks yields associations (IoU threshold $\ge 0.1$ in (Schmid et al., 2021)).
Unmatched panoptic masks allocate new submaps, with resolution heuristically matched to semantic class.
Integration fuses depth points and label information per-segment into the corresponding submap TSDF.
A spatial hash or k-d tree over submap bounding spheres accelerates region queries for both integration and submap lookup.

Inactive or unobserved submaps undergo lifecycle transitions (active, inactive, persistent, absent), and change detection routines compare inactive/active submap iso-surfaces to decide whether old geometry should be preserved or pruned.

4. Handling Scene Dynamics and Long-Term Consistency

Panoptic Multi-TSDFs maintain geometric and semantic consistency over time, even as the scene evolves:

Submaps become inactive if unobserved over a configurable threshold and are tagged with persistent/absent/unobserved state.
Change detection: New observations are used to compare old submaps’ iso-surfaces with the current map via signed distance evaluation and conflict/agreement counting; a confidence-based mechanism determines whether to keep, delete, or merge submaps (Schmid et al., 2021).
Submap merging re-integrates previously removed or unseen objects, reusing historical geometry and label data.
TSDF++ instead uses multi-layer per-voxel storage, allowing a single volumetric grid to represent multiple overlapping object surfaces and restoring occluded geometry upon re-observation (Grinvald et al., 2021).
CRF-based refinement further promotes temporal and spatial label consistency after data association and fusion (Hu et al., 2024).

5. Data Structures and Scalability

Each submap is implemented as a hash table of allocated voxel blocks (keyed by spatial location). Higher-level bookkeeping maintains a SubmapCollection, indexed by submap bounding spheres or spatial hash keys, supporting $O(\log N)$ query/retrieval (Schmid et al., 2021, Hu et al., 2024).

This decomposed submap architecture significantly reduces computational cost for object retrieval and surface meshing: in DHP-Mapping, scanning all voxels for an object in a monolithic 80M-voxel TSDF takes ~628 ms; in the multi-TSDF approach, only the submap’s 61k voxels are queried ( $\ll$ 10 ms) (Hu et al., 2024). Submap-level operations naturally amortize the cost of heavy combinatorial routines (label fusion, CRF) across the mapping process.

6. Empirical Performance and Evaluation

Experiments conducted on synthetic and real datasets demonstrate accurate geometry and improved panoptic label performance under both simulated and real scene changes.

Metric	Panoptic Multi-TSDF	DHP-Mapping
Mean Abs. Distance	~1.4 cm (≈voxel size)	0.037–0.055 m (Chamfer-L1)
Scene Coverage	Up to 100%	F-score up to 89.8%
Labeling Accuracy	PQ up to 0.708	PQ up to 0.708
Runtime	5–6 Hz (CPU, 640×480)	$\ll$ 10 ms per submap query

With perfect segmentation, Panoptic Multi-TSDFs achieve geometry errors at the voxel-size limit and rapidly recover scene coverage by incorporating prior submaps. DHP-Mapping delivers state-of-the-art geometry and labeling metrics, with CRF post-processing yielding up to 7% absolute improvement in PQ (panoptic quality) and significant runtime advantages in large-scale scenes (Schmid et al., 2021, Hu et al., 2024).

TSDF++ preserves 100% of occluded surfaces, eliminating holes after dynamic occlusions, and maintains sub-centimeter per-object tracking errors on synthetic sequences (Grinvald et al., 2021).

7. Limitations and Research Directions

Current limitations include reliance on panoptic segmentation quality; over- and under-segmentation can generate or propagate submap errors. While SubmapCollection removes redundant geometry, global drift in robot odometry can result in spatial misalignment across submaps over long distances; pose-graph SLAM or loop-closure mechanisms are plausible remedies (Schmid et al., 2021).

Short-term dynamic tracking (e.g., fast-moving objects) is not deeply integrated. Scene representation can be further improved by fusing geometric cues into the segmentation pipeline, employing multi-layer per-voxel models for severe clutter, and accelerating key routines via parallelization or GPU hardware. The use of EM-style data association and joint optimization over submap poses and labels are suggested as further avenues (Grinvald et al., 2021, Hu et al., 2024).

Panoptic Multi-TSDFs represent a scalable, principled, and empirically validated solution for online multi-resolution, object-centric mapping of dynamic environments, supporting robust long-term consistency in both geometry and semantics (Schmid et al., 2021, Hu et al., 2024, Grinvald et al., 2021).

Markdown Upgrade to Chat

References (3)

Panoptic Multi-TSDFs: a Flexible Representation for Online Multi-resolution Volumetric Mapping and Long-term Dynamic Scene Consistency (2021)

DHP-Mapping: A Dense Panoptic Mapping System with Hierarchical World Representation and Label Optimization Techniques (2024)

TSDF++: A Multi-Object Formulation for Dynamic Object Tracking and Reconstruction (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Panoptic Multi-TSDFs.