Towards Learning to Complete Anything in Lidar (2504.12264v1)

Published 16 Apr 2025 in cs.CV

Abstract: We propose CAL (Complete Anything in Lidar) for Lidar-based shape-completion in-the-wild. This is closely related to Lidar-based semantic/panoptic scene completion. However, contemporary methods can only complete and recognize objects from a closed vocabulary labeled in existing Lidar datasets. Different to that, our zero-shot approach leverages the temporal context from multi-modal sensor sequences to mine object shapes and semantic features of observed objects. These are then distilled into a Lidar-only instance-level completion and recognition model. Although we only mine partial shape completions, we find that our distilled model learns to infer full object shapes from multiple such partial observations across the dataset. We show that our model can be prompted on standard benchmarks for Semantic and Panoptic Scene Completion, localize objects as (amodal) 3D bounding boxes, and recognize objects beyond fixed class vocabularies. Our project page is https://research.nvidia.com/labs/dvl/projects/complete-anything-lidar

Summary

Towards Learning to Complete Anything in Lidar

The research paper titled Towards Learning to Complete Anything in Lidar introduces a novel approach for Lidar-based zero-shot shape completion, challenging the constraints of traditional taxonomy-bound methods by inferring full 3D scene layouts from partial observations. This paper addresses the limitations of existing Lidar semantic and panoptic scene completion systems, which rely heavily on predefined semantic categories and supervised learning from labels in existing Lidar datasets. The authors propose a model termed Complete Anything in Lidar (), capable of completing a scene using only a single Lidar scan input and identifying objects regardless of whether they belong to previously labeled categories.

Methodology

The key innovation of the proposed method lies in decoupling the dependency on fixed vocabulary labels by leveraging the temporal context from sequences of multi-modal sensor data. The approach involves mining object shapes and semantic features from visual data sequences, which are subsequently distilled into a Lidar-only model for instance-level object shape completion and recognition. The model learns to infer complete shapes from these partial observations distributed across datasets.

In practice, the method employs visual segmentation foundation models, such as SAM for video-object segmentation, to segment and track objects within a video using RGB input. These foundation models, which have been trained extensively on diverse datasets, help localize object instances across sequences. Once localized, 2D object mask proposals are lifted into the Lidar coordinate space using known camera-to-Lidar transformations. These masklets are temporally aggregated and backprojected to form voxelized 3D shape representations. In parallel, semantic representations are extracted in the form of CLIP features, which are likewise temporally aggregated to allow object recognition beyond a fixed set of class labels through text prompts.

To address the challenge of partial visibility inherent in Lidar data, a pseudo-labeling engine aggregates multi-view temporal data, improving shape predictions through a Conditional Random Field (CRF) based refinement step. This operationalizes the transfer of 2D visual knowledge to 3D Lidar scenes, connecting semantic insights with spatial masks.

Model Architecture and Training

The model consists of a sparse-generative U-Net backbone with a dual-path architecture. Occupancy estimates are generated through multiple decoding scales, with instance-level predictions coming from a novel transformer instance decoder. This decoder not only predicts occupancy but also produces a CLIP-based semantic representation for each instance. Thus, the model is capable of generating both segmentation masks and contextual symbolic embeddings, guiding semantic completion tasks at inference time.

Loss functions utilized in the training include occupancy, mask, and cosine similarity for CLIP feature alignment, ensuring the network effectively learns to segment, complete, and semantically classify scene objects.

Results and Implications

The results demonstrate the model's capability to perform zero-shot completion and recognition, achieving promising performance on SemanticKITTI and SSCBench-KITTI360 datasets compared to supervised baselines. Even in the absence of training data labeled for specific classes, achieves a meaningful portion of the accuracy of state-of-the-art systems. The use of an auxiliary learning task through semantic feature prototypes further enhances model robustness by introducing implicit semantic structuring during training.

The implications of this research extend to various domains, notably autonomous driving and robotics, where robust scene understanding and dynamic adaptation of perception systems are critical. By utilizing a zero-shot paradigm, this method promises adaptability in environments with evolving class definitions and provides a framework for training on large volumes of unlabeled data, reducing the reliance on labor-intensive annotation processes. Additionally, the ability to recognize and categorize unseen objects potentially streamlines object detection pipelines in novel settings.

Conclusion

This paper contributes a significant step in the evolution of Lidar-based scene completion technologies, proposing a methodology that circumvents the limitations of label dependency through cross-modal semantic distillation and open-vocabulary recognition. While challenges remain in fully closing the accuracy gap with supervised methods, especially regarding rare object classes and challenging scene configurations, this paper lays essential groundwork for future work in adaptable and generalizable Lidar perception systems in AI.