ReScene4D: Temporal 4D Segmentation
- ReScene4D is a framework for temporally consistent 4D semantic instance segmentation of indoor 3D scenes, handling sparse and discontinuous scans.
- It leverages a 3DSIS backbone with spatio-temporal transformers and novel modules like cross-time contrastive loss and 4D masking to fuse temporal information.
- The framework achieves significant gains in t-mAP and mAP by preserving object identity consistency across scans, outperforming state-of-the-art non-temporal approaches.
ReScene4D is a framework for temporally consistent semantic instance segmentation of evolving indoor 3D scenes. It formalizes and addresses the 4D semantic instance segmentation (4DSIS) task: given a set of temporally distinct 3D scans—potentially with long gaps and object rearrangements—the objective is to both segment and temporally associate the full set of object instances, preserving consistent identities for objects as they move, appear, or disappear over time. ReScene4D demonstrates substantial gains over prior art on public datasets, and introduces technical innovations necessary for robust temporal correspondence in sparse, non-continuous 4D settings (Steiner et al., 16 Jan 2026).
1. Formal Problem Specification
ReScene4D operates on a sequence of temporally separated 3D point clouds from an evolving scene . For all stages, the model predicts a set of binary masks (across total points ) and class labels , such that each selects the union of points belonging to a single object instance throughout the sequence, regardless of spatial or temporal discontiguity. Temporal identity consistency is a requirement: an object’s mask must maintain its semantic and instance identity over time, even as its geometry or position changes, or if it is temporarily unobserved.
This temporally sparse regime sharply contrasts with assumptions in 4D LiDAR tracking (which exploits dense, high-frequency data) or conventional 3DSIS (which does not encode temporal reasoning), necessitating explicit temporal association mechanisms.
2. Model Architecture and Temporal Correspondence
ReScene4D adapts a 3DSIS backbone (e.g., Mask3D, Mask2Former) to fuse and reason over sparse temporal sequences. It ingests the union of all scans voxelized into a 4D grid with time as the fourth dimension. For each scan , a hierarchical feature extractor—sparse-conv U-Net (Minkowski) or pre-trained point transformer (PTv3 Sonata/Concerto)—produces per-stage features at multiple spatial resolutions.
Instance queries are maintained and decoded through a spatio-temporal masked transformer. At each decoder layer, masked cross-attention is performed over concatenated all-stage features, with queries augmented by 4D Fourier positional encodings on . Self-attention among queries encourages mutual exclusivity. Each query predicts masks and class logits at every temporal stage.
To enable robust temporal information sharing, ReScene4D introduces three modules:
- Cross-Time Contrastive Loss: Enforces similarity between superpoint features from the same object across scans using InfoNCE with a log-odds normalized cosine similarity.
- Spatio-Temporal Masking: Pools per-query masks across stages via logical OR and uses the resulting spatial support to restrict cross-attention, guiding queries to be consistent over time.
- Spatio-Temporal Decoder Serialization: Randomizes neighborhood serialization patterns (including 4D space-filling curves) during transformer decoding to enhance temporal receptive field, especially when using PTv3 backbones.
Instance prediction is optimized by bipartite Hungarian assignment, minimizing a composite loss: class, binary cross-entropy, and Dice loss, plus the temporal contrastive loss (coefficient ). Non-object queries are discouraged via an additional penalty.
3. Temporal mAP (t-mAP): A Metric for 4DSIS
Standard mAP metrics fail to penalize identity switches or fragmentation across time, as they treat each scan independently. ReScene4D introduces the t-mAP metric, explicitly designed for 4DSIS:
- t-IoU Definition: For predicted and ground-truth instance trajectories, t-IoU is the minimum IoU achieved across all stages. Stages where both are absent are ignored; if prediction exists but GT is absent, t-IoU is set to 0.
- t-mAP Calculation: Greedy assignment ensures a unique trajectory per GT group. Precision–recall is computed for each class over all predictions based on t-IoU thresholds, averaging to obtain -AP per class and final -mAP.
By construction, t-mAP collapses to standard mAP for , but penalizes temporal inconsistencies—including identity switches, merges, and fragmentations—as sequence length increases.
4. Experimental Protocols and Quantitative Results
Experiments are conducted on the 3RScan dataset (478 environments, 1428 scans), with all length-2 sequences built by pairing primary and rescan. Single-scan ScanNet data is added to improve coverage. Inputs are processed at 2 cm voxel size, with queries and mixed-stage batches. Training is performed for 450 epochs with AdamW and a 1-cycle learning rate schedule.
Ablation studies utilize PTv3 (Concerto) encoders frozen, with the decoder trained from scratch or mixed with other backbones (Minkowski, Sonata). Hardware configurations range from 2 to 8 NVIDIA H100 GPUs, with runtimes of 26–42 hours depending on the architecture.
ReScene4D shows substantial improvements over prior methods:
| Method | t-mAP | mAP |
|---|---|---|
| Mask3D+geo match | 20.7 | 29.7 |
| ReScene4D (C) | 34.8 | 43.3 |
ReScene4D attains +14 t-mAP and +21 mAP over the best non-temporal baseline, with even the Minkowski variant outpacing all baselines by 11 t-mAP. Per-stage 3DSIS mAP is also improved, demonstrating that temporal sharing not only yields identity consistency, but fuses information to enhance per-scan performance.
5. Analysis of Temporal Fusion Mechanisms
Ablation analysis reveals the impact of each temporal information sharing module:
- Cross-time contrastive loss yields a +5.7 t-mAP boost over the transformer baseline alone.
- Spatio-temporal serialization combined with contrastive loss gives the highest overall t-mAP (34.8) and excels on rigid instances.
- Spatio-temporal masking achieves highest recall for non-rigid deformations.
- Combining all three modules delivers the highest recall on non-rigid objects, though with marginally lower overall t-mAP due to diminishing returns—this suggests the dominance of rigid or unchanged instances in current datasets.
Table summarizing ablation outcomes (PTv3 Concerto):
| Modules | t-mAP | t-mRec | Rigid | Non-Rigid |
|---|---|---|---|---|
| None | 28.4 | 41.8 | 44.9 | 62.1 |
| Contrastive | 34.1 | 49.6 | 48.4 | 63.2 |
| ST-serial+contr. | 34.8 | 52.1 | 48.6 | 66.5 |
| All combined | 33.3 | 53.0 | 56.4 | 68.0 |
Pretrained backbones exhibit different preferences for fusion modules, depending on their initial feature representational capacity.
6. Comparative Context and Future Directions
Compared to prior 4DSIS and 3DSIS pipelines, ReScene4D is the first end-to-end framework optimized for sparse 4D temporal reasoning. Unlike LiDAR tracking, which leverages dense, high-frame-rate data, or multi-view reconstruction approaches such as U4D (Mustafa et al., 2019), ReScene4D directly addresses the problem posed by infrequent, incomplete 3D scans with complex object transformations.
The introduction of t-mAP as a metric aligns evaluation with the unique demands of the evolving indoor scene domain—rewarding accurate instance identity preservation rather than per-frame mask accuracy alone.
Current limitations include the scarcity of objects with substantial geometric or semantic changes in existing datasets, which makes it difficult to reach the theoretical benefits of full temporal fusion. Computational cost scales with sequence length, motivating research into efficient attention/query architectures and scalability for longer sequences or larger scene graphs. Integration with advanced query/backbone architectures (e.g., Relation3D, Competitor) and extension to more dynamic scenes are indicated as promising future directions (Steiner et al., 16 Jan 2026).
7. Relationship to Related 4D Methods
ReScene4D fundamentally differs from direct 4D reconstruction approaches such as:
- Unsupervised multi-view dynamic scene mesh and flow estimation: Systems such as U4D produce joint mesh, label, and flow sequences through global graph-cut inference over per-view depth, label, and flow, with strong photometric and pose-based priors and require temporally dense, calibrated multi-view video (Mustafa et al., 2019).
- Neural field-based decomposition (e.g., DRSM): Methods like DRSM focus on dense radiance field modeling with explicit plane factorization, operate principally in stationary monocular or multi-view video settings, and emphasize photometric/color reconstruction and fast supervised optimization (Xie et al., 2024). They typically do not solve the instance-level segmentation or temporal identity tracking with sparse scans addressed by ReScene4D.
A plausible implication is that ReScene4D’s approach to temporally consistent instance-level reasoning complements, rather than supersedes, dense geometric or neural field 4D scene representations, as it operates robustly in scenarios with few or incomplete temporal observations and prioritizes semantic consistency over radiance replication.
References:
- "ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes" (Steiner et al., 16 Jan 2026)
- "U4D: Unsupervised 4D Dynamic Scene Understanding" (Mustafa et al., 2019)
- "DRSM: efficient neural 4d decomposition for dynamic reconstruction in stationary monocular cameras" (Xie et al., 2024)