MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues (2404.05280v1)
Abstract: 3D object detection based on roadside cameras is an additional way for autonomous driving to alleviate the challenges of occlusion and short perception range from vehicle cameras. Previous methods for roadside 3D object detection mainly focus on modeling the depth or height of objects, neglecting the stationary of cameras and the characteristic of inter-frame consistency. In this work, we propose a novel framework, namely MOSE, for MOnocular 3D object detection with Scene cuEs. The scene cues are the frame-invariant scene-specific features, which are crucial for object localization and can be intuitively regarded as the height between the surface of the real road and the virtual ground plane. In the proposed framework, a scene cue bank is designed to aggregate scene cues from multiple frames of the same scene with a carefully designed extrinsic augmentation strategy. Then, a transformer-based decoder lifts the aggregated scene cues as well as the 3D position embeddings for 3D object location, which boosts generalization ability in heterologous scenes. The extensive experiment results on two public benchmarks demonstrate the state-of-the-art performance of the proposed method, which surpasses the existing methods by a large margin.
- M3D-RPN: monocular 3d region proposal network for object detection. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV, pages 9286–9295, 2019.
- Kinematic 3d object detection in monocular video. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, volume 12368 of Lecture Notes in Computer Science, pages 135–152. Springer, 2020.
- End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, pages 213–229, 2020.
- Monopair: Monocular 3d object detection using pairwise spatial relationships. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pages 12090–12099, 2020.
- Calibration-free BEV representation for infrastructure perception. CoRR, abs/2303.03583, 2023.
- Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
- Monodtr: Monocular 3d object detection with depth-aware transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pages 4002–4011, 2022.
- Pointpillars: Fast encoders for object detection from point clouds. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 12697–12705, 2019.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Brian Williams, Yiling Chen, and Jennifer Neville, editors, in AAAI Conference on Artificial Intelligence, AAAI, pages 1477–1485, 2023.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX, volume 13669 of Lecture Notes in Computer Science, pages 1–18, 2022.
- PETR: position embedding transformation for multi-view 3d object detection. In Computer Vision - ECCV 2022 - 17th European Conference, pages 531–548, 2022.
- Petrv2: A unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256, 2022.
- SMOKE: single-stage monocular 3d object detection via keypoint estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops, pages 4289–4298, 2020.
- Delving into localization errors for monocular 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 4721–4730, 2021.
- Time will tell: New outlooks and A baseline for temporal multi-view 3d object detection. In International Conference on Learning Representations, ICLR, 2023.
- Monoground: Detecting monocular 3d objects from the ground. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 3783–3792. IEEE, 2022.
- Categorical depth distribution network for monocular 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 8555–8564. Computer Vision Foundation / IEEE, 2021.
- Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV, pages 1265–1274, 2022.
- Mvx-net: Multimodal voxelnet for 3d object detection. In International Conference on Robotics and Automation, ICRA, pages 7276–7282, 2019.
- Centerloc3d: Monocular 3d vehicle localization network for roadside surveillance cameras. arXiv preprint arXiv:2203.14550, 2022.
- FCOS: fully convolutional one-stage object detection. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV, pages 9626–9635, 2019.
- Collaborative 3d object detection for autonomous vehicles via learnable communications. IEEE Trans. Intell. Transp. Syst., 24(9):9804–9816, 2023.
- Bev-lanedet: An efficient 3d lane detection based on virtual camera via key-points. In IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR, pages 1002–1011, 2023.
- Exploring object-centric temporal modeling for efficient multi-view 3d object detection. arXiv preprint arXiv:2303.11926, 2023.
- FCOS3D: fully convolutional one-stage monocular 3d object detection. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW, pages 913–922, 2021.
- SECOND: sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
- Bevheight++: Toward robust visual centric 3d object detection. arXiv preprint arXiv:2309.16179, 2023.
- Monogae: Roadside monocular 3d object detection with ground-aware embeddings. arXiv preprint arXiv:2310.00400, 2023.
- Bevheight: A robust framework for vision-based roadside 3d object detection. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Mar. 2023.
- Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pages 21309–21318, 2022.
- Yolov7-3d: A monocular 3d traffic object detection method from a roadside perspective. Applied Sciences, 13(20), 2023.
- DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR, pages 21329–21338, 2022.
- Monodetr: Depth-aware transformer for monocular 3d object detection. arXiv preprint arXiv:2203.13310, 2022.
- Objects are different: Flexible monocular 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 3289–3298, 2021.
- Objects as points. CoRR, abs/1904.07850, 2019.
- Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, ICLR, 2021.