MonoLSS: Learnable Sample Selection For Monocular 3D Detection (2312.14474v2)
Abstract: In the field of autonomous driving, monocular 3D detection is a critical task which estimates 3D properties (depth, dimension, and orientation) of objects in a single RGB image. Previous works have used features in a heuristic way to learn 3D properties, without considering that inappropriate features could have adverse effects. In this paper, sample selection is introduced that only suitable samples should be trained to regress the 3D properties. To select samples adaptively, we propose a Learnable Sample Selection (LSS) module, which is based on Gumbel-Softmax and a relative-distance sample divider. The LSS module works under a warm-up strategy leading to an improvement in training stability. Additionally, since the LSS module dedicated to 3D property sample selection relies on object-level features, we further develop a data augmentation method named MixUp3D to enrich 3D property samples which conforms to imaging principles without introducing ambiguity. As two orthogonal methods, the LSS module and MixUp3D can be utilized independently or in conjunction. Sufficient experiments have shown that their combined use can lead to synergistic effects, yielding improvements that transcend the mere sum of their individual applications. Leveraging the LSS module and the MixUp3D, without any extra data, our method named MonoLSS ranks 1st in all three categories (Car, Cyclist, and Pedestrian) on KITTI 3D object detection benchmark, and achieves competitive results on both the Waymo dataset and KITTI-nuScenes cross-dataset evaluation. The code is included in the supplementary material and will be released to facilitate related academic and industrial studies.
- M3d-rpn: Monocular 3d region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10379–10388, 2021.
- 3d object proposals for accurate object class detection. Advances in neural information processing systems, 28, 2015.
- Monopair: Monocular 3d object detection using pairwise spatial relationships. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Transiff: An instance-level feature fusion framework for vehicle-infrastructure cooperative 3d detection with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 18205–18214, 2023.
- Retinaface: Single-shot multi-level face localisation in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Learning depth-guided convolutions for monocular 3d object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
- Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures. US Government Printing Office, 1954.
- Cross-modality knowledge distillation network for monocular 3d object detection. In ECCV. Springer, 2022.
- Monodtr: Monocular 3d object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4012–4021, 2022.
- Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations (ICLR 2017). OpenReview. net, 2017.
- MonoUNI: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
- Competition for roadside camera monocular 3d object detection. National Science Review, 10(6):nwad121, 2023b.
- Danet: Dimension apart network for radar object detection. In Proceedings of the 2021 International Conference on Multimedia Retrieval, page 533–539, New York, NY, USA, 2021. Association for Computing Machinery.
- Multiple anchor learning for visual object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Segment anything. arXiv:2304.02643, 2023.
- Dusa: Decoupled unsupervised sim2real adaptation for vehicle-to-everything collaborative perception. In Proceedings of the 31st ACM International Conference on Multimedia, page 1943–1954, New York, NY, USA, 2023. Association for Computing Machinery.
- Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement. In International Conference on Machine Learning, pages 3499–3508. PMLR, 2019.
- Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- GrooMeD-NMS: Grouped mathematically differentiable nms for monocular 3333D object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Deviant: Depth equivariant network for monocular 3d object detection. In European Conference on Computer Vision, pages 664–683. Springer, 2022.
- Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Anchor-free location refinement network for small license plate detection. In Pattern Recognition and Computer Vision, pages 506–519, Cham, 2022a. Springer Nature Switzerland.
- Diversity matters: Fully exploiting depth clues for reliable monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2791–2800, 2022b.
- Exploring geometric consistency for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1685–1694, 2022.
- Network in network. CoRR, abs/1312.4400, 2013.
- Ground-aware monocular 3d object detection for autonomous driving. IEEE Robotics and Automation Letters, 6(2):919–926, 2021a.
- Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020.
- Autoshape: Real-time shape-aware monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15641–15650, 2021b.
- Geometry uncertainty projection network for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3111–3121, 2021.
- Rethinking pseudo-lidar representation. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Delving into localization errors for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4721–4730, 2021.
- Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3142–3152, 2021.
- Did-m3d: Decoupling instance depth for monocular 3d object detection. In European Conference on Computer Vision, 2022.
- Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8555–8564, 2021.
- Yolo9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
- Geometry-based distance decomposition for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15172–15181, 2021a.
- Geometry-based distance decomposition for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15172–15181, 2021b.
- Disentangling monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020.
- Progressive coordinate transforms for monocular 3d object detection. Advances in Neural Information Processing Systems, 34:13364–13377, 2021.
- Sparse fuse dense: Towards high quality 3d detection with depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5418–5427, 2022.
- Tianfu Wu Xianpeng Liu, Nan Xue. Learning auxiliary monocular contexts helps monocular 3d object detection. In 36th AAAI Conference on Artifical Intelligence (AAAI), 2022.
- Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Monocular 3d object detection via feature domain adaptation. In European Conference on Computer Vision, pages 17–34. Springer, 2020.
- Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21341–21350, 2022.
- Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21361–21370, 2022.
- V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5486–5495, 2023.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
- How does mixup help with robustness and generalization? In International Conference on Learning Representations, 2020a.
- Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
- Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3289–3298, 2021.
- Objects as points. In arXiv preprint arXiv:1904.07850, 2019.
- Monocular 3d object detection: An extrinsic parameter free approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7556–7566, 2021.
- The devil is in the task: Exploiting reciprocal appearance-localization features for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2713–2722, 2021.
- Zhenjia Li (1 paper)
- Jinrang Jia (3 papers)
- Yifeng Shi (17 papers)