MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection (2401.16305v1)

Published 29 Jan 2024 in cs.CV and cs.RO

Abstract: Label-efficient LiDAR-based 3D object detection is currently dominated by weakly/semi-supervised methods. Instead of exclusively following one of them, we propose MixSup, a more practical paradigm simultaneously utilizing massive cheap coarse labels and a limited number of accurate labels for Mixed-grained Supervision. We start by observing that point clouds are usually textureless, making it hard to learn semantics. However, point clouds are geometrically rich and scale-invariant to the distances from sensors, making it relatively easy to learn the geometry of objects, such as poses and shapes. Thus, MixSup leverages massive coarse cluster-level labels to learn semantics and a few expensive box-level labels to learn accurate poses and shapes. We redesign the label assignment in mainstream detectors, which allows them seamlessly integrated into MixSup, enabling practicality and universality. We validate its effectiveness in nuScenes, Waymo Open Dataset, and KITTI, employing various detectors. MixSup achieves up to 97.31% of fully supervised performance, using cheap cluster annotations and only 10% box annotations. Furthermore, we propose PointSAM based on the Segment Anything Model for automated coarse labeling, further reducing the annotation burden. The code is available at https://github.com/BraveGroup/PointSAM-for-MixSup.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a mixed-grained supervision framework that blends abundant coarse labels with scarce fine labels to maintain robust 3D detection performance.
It redesigns the label assignment process for various detectors and incorporates PointSAM for automated coarse labeling via instance segmentation.
Experiments on nuScenes, Waymo, and KITTI demonstrate that MixSup achieves over 90% of fully supervised performance using as little as 10% precise labels.

MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection

In the domain of autonomous driving, LiDAR-based 3D object detection systems are hampered by the high cost and complexity of obtaining accurately labeled training data. The paper MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection addresses this challenge by introducing a novel paradigm called MixSup, which combines the use of large quantities of inexpensive, coarse-grained labels with a smaller number of precise, fine-grained labels. This hybrid approach aims to enhance the label efficiency while maintaining the robustness and performance of the detection models.

Observations and Motivation

The authors start by examining several intrinsic properties of point clouds:

Texture Absence: Point clouds lack distinctive textures and appearances, complicating semantic learning tasks.
Scale Invariance: The scale of point clouds remains constant irrespective of the distance from the sensor, unlike 2D images.
Geometric Richness: Point clouds are inherently rich in geometric information, making the estimation of object shapes and poses more straightforward.

These observations form the foundation of the MixSup paradigm. The authors posit that while massive semantic labels are essential for the difficult task of semantic learning, accurate geometric labels are necessary but in much smaller quantities. Recognizing the efficiency on both fronts, they propose using coarse clusters of points for semantic supervision and a limited set of accurately labeled bounding boxes for geometric learning.

Key Contributions

Redesigning Label Assignment

The most crucial part of adapting MixSup is redesigning the label assignment module for integration with existing detectors. This adaptation ensures that the coarse and fine labels work seamlessly with different types of detectors, such as point-based, voxel-based, and hybrid methods. The paper categorizes label assignments into two types:

Center-based Assignment: Redefines centers using the cluster labels to adapt the detection pipeline for models like CenterPoint.
Box-based Assignment: Uses a newly defined box-cluster IoU metric to match cluster labels with proposals, which is essential for two-stage detectors like PV-RCNN and anchor-based methods like SECOND.

Introduction of PointSAM

To automate and further alleviate the burden of obtaining coarse labels, the authors propose PointSAM, leveraging the Segment Anything Model (SAM). PointSAM automates the generation of coarse cluster labels by performing instance segmentation on point clouds through 2D projections and mask generation. The method introduces a Separability-Aware Refinement (SAR) to mitigate over-segmentation issues by utilizing connected component analysis.

Experimental Validation

The effectiveness of MixSup is validated across three prominent datasets: nuScenes, Waymo Open Dataset (WOD), and KITTI. Employing different detectors, the results demonstrate that MixSup achieves up to 97.31% of the fully supervised performance while significantly reducing labeling costs. Notable performance metrics include:

On Waymo, MixSup achieves an average performance of 96.41% compared to fully supervised training using only 10% of the accurate labels.
For KITTI, MixSup maintains a performance ratio exceeding 90% across various detection models.

Comparative Analysis

The paper provides a comparative analysis against several other label-efficient frameworks such as semi-supervised and weakly supervised methods. It underscores that while MixSup does not surpass all existing methods in every scenario, it consistently provides robust and practical gains across varied settings and detectors. It also highlights the compatibility and potential for further enhancement through integration with techniques like self-training, common in semi-supervised learning.

Practical and Theoretical Implications

The practical implications of MixSup are significant for industries reliant on LiDAR-based perception systems. By drastically reducing the training data annotation cost while maintaining high detection performance, MixSup allows more agile and cost-effective deployment of autonomous driving systems. Theoretically, the work challenges the reliance on large quantities of fine-grained labels and showcases the potential of mixed-grained supervision paradigms in complex detection tasks.

Future Directions

The paper opens up avenues for future research, particularly in:

Integration with Semi-supervised Methods: Further exploration of synergies between MixSup and sophisticated semi-supervised techniques could yield better performance and efficiency.
Advanced Auto-labeling Methods: Incorporating recent advancements in automatic labeling tools could enhance the quality and reliability of the coarse labels generated.
Application to Other 3D Understanding Tasks: Extending the principles of MixSup to other 3D tasks such as segmentation and tracking could prove beneficial.

In conclusion, the MixSup framework presents a practical and efficient approach to LiDAR-based 3D object detection by leveraging the complementary strengths of coarse and fine labels. By integrating innovative label assignment techniques and automated labeling with PointSAM, this approach sets a new benchmark for label-efficient learning in autonomous perception systems.