Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection

Published 27 Mar 2025 in cs.CV | (2503.21099v2)

Abstract: Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78%, 90%, and 96% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method. Code is available at https://github.com/zyrant/CPDet3D.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a prototype-based mining module that clusters labeled features to assign accurate labels to unlabeled objects.
It integrates a multi-label cooperative refinement module that combines pseudo and prototype labels to recover missed detections.
Experimental results indicate the method achieves up to 96% of fully supervised performance on datasets like KITTI with just one label per scene.

Sparse Supervised 3D Object Detection via Class Prototypes

The paper "Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection" (2503.21099) introduces a novel approach to 3D object detection under sparse supervision, applicable to both indoor and outdoor environments. It addresses the limitations of existing sparse supervision methods that are primarily designed for outdoor scenes and rely on ground truth (GT) sampling strategies unsuitable for indoor environments. The method leverages class prototypes to mine unlabeled objects effectively and refines detections through a multi-label cooperative approach. With only one labeled object per scene, the method achieves approximately 78%, 90%, and 96% of the performance of a fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI datasets, respectively.

Prototype-based Object Mining

The paper addresses the challenge of limited annotations by introducing a prototype-based object mining module (Figure 1). This module converts the problem of mining unlabeled objects into a matching problem between class prototypes and unlabeled features.

Figure 1: Comparison of sparse supervised 3D object detection methods. Previous methods rely on the premise of full category coverage within each scene, achieved through a GT sampling strategy, as represented in (a). This approach encounters limitations in indoor scenes, which have scene-specific categories. In contrast, we propose a unified sparse 3D object detection scheme (b) applicable to both indoor and outdoor scenes, utilizing nearest prototype retrieval for effective object mining.

The module consists of two key components: class-aware prototype clustering and prototype label matching. Class-aware prototype clustering learns class-specific prototypes by clustering features of labeled objects across different scenes. This allows the model to capture the semantic feature distributions of different classes. Prototype label matching then assigns labels to unlabeled objects based on the learned prototypes. The method uses optimal transport to establish correspondences between prototypes and unlabeled features, assigning category labels to high-confidence features. This approach enables the mining of unlabeled objects beyond the limitations of individual scenes.

The t-SNE visualization (Figure 2) illustrates the distribution of class-aware prototypes before and after the warm-up phase on ScanNet V2.

Figure 2: t-SNE results of class-aware prototypes before and after warm-up on ScanNet V2.

To address the issue of missed detections, the paper introduces a multi-label cooperative refinement module. This module integrates pseudo-labels with prototype labels to improve detection accuracy. The module incorporates iterative pseudo-labeling, which generates high-quality pseudo-labels by filtering out inaccurate predictions using classification score thresholds and IoU filtering. Furthermore, prototype label cooperating leverages prototype labels to fill in missed detections by assigning labels to undetected objects in the residual foreground areas. This cooperative approach effectively refines the detection results and recovers missed objects.

Architecture and Implementation Details

The overall architecture of the proposed method is depicted in (Figure 3).

Figure 3: The architecture of our method for sparse supervised 3D object detection is as follows. Given a point cloud and a detector, we first project the features from the detector and cluster them into class-aware prototypes. Based on the learned similarity between prototypes and features, we assign pseudo labels to unlabeled objects. Next, we introduce an effective refinement module that cooperatively utilizes sparse, pseudo, and prototype labels to reduce missed detections during iterative training.

Given a point cloud and a detector, the features are projected and clustered into class-aware prototypes. Based on the similarity between prototypes and features, pseudo-labels are assigned to unlabeled objects. The refinement module then uses sparse, pseudo, and prototype labels to reduce missed detections during iterative training. The method employs a two-stage training paradigm. In the first stage, an initial detector is trained using sparse annotations and the prototype-based mining module. The loss function for this stage includes a detection loss $\mathcal{L}_{det}$ , a prototype classification loss $\mathcal{L}_{pcls}$ , and an Info-NCE loss $\mathcal{L}_{pcon}$ . In the second stage, the initial model generates pseudo-labels, and the multi-label cooperative refinement module is introduced. The total loss for the second stage is the sum of the first-stage loss and a refinement loss $\mathcal{L}_{ref}$ .

Experimental Results

The experimental results demonstrate the effectiveness of the proposed method on both indoor and outdoor datasets. On ScanNet V2 and SUN RGB-D, the method achieves 78% and 90% of the performance of a fully supervised detector, respectively, with only one labeled object per scene. The method also achieves up to 96% of fully supervised performance on the KITTI dataset under sparse supervision. (Figure 4) shows qualitative results of the method on ScanNet V2, SUN RGB-D, and KITTI validation sets.

Figure 4: Visualization results of our method on the ScanNet V2, SUN RGB-D, and KITTI validation sets trained under one object per scene sparse supervised setting.

Ablation studies validate the contribution of each component of the method. Ablation studies on the ScanNet V2 dataset (Figure 5) demonstrate the impact of different components on performance.

Figure 5: Ablation study of $\alpha_{iou}$

The results show that the multi-label cooperative refinement module and the class-aware prototype clustering contribute significantly to the overall performance. The paper also analyzes the precision and recall of the mined labels, showing that the pseudo-labels have a high precision of 95.5%, and the prototype labels have a precision of about 71.1%. The combination of sparse, prototype, and pseudo labels results in a mean average recall (mAR) of 67.1%, confirming the complementarity of these labels. (Figure 6) and (Figure 7) show per category analysis of the number of labels on ScanNet V2 and SUN RGB-D, respectively.

Conclusion

The paper presents a unified sparse supervised 3D object detection method for indoor and outdoor environments. The method leverages class prototypes and multi-label cooperative refinement to effectively utilize unlabeled objects and improve detection accuracy. The experimental results demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on multiple datasets under sparse supervision. The approach of learning and leveraging class prototypes offers a promising direction for future research in 3D object detection with limited annotations.

Markdown Report Issue