SGV3D:Towards Scenario Generalization for Vision-based Roadside 3D Object Detection (2401.16110v2)
Abstract: Roadside perception can greatly increase the safety of autonomous vehicles by extending their perception ability beyond the visual range and addressing blind spots. However, current state-of-the-art vision-based roadside detection methods possess high accuracy on labeled scenes but have inferior performance on new scenes. This is because roadside cameras remain stationary after installation and can only collect data from a single scene, resulting in the algorithm overfitting these roadside backgrounds and camera poses. To address this issue, in this paper, we propose an innovative Scenario Generalization Framework for Vision-based Roadside 3D Object Detection, dubbed SGV3D. Specifically, we employ a Background-suppressed Module (BSM) to mitigate background overfitting in vision-centric pipelines by attenuating background features during the 2D to bird's-eye-view projection. Furthermore, by introducing the Semi-supervised Data Generation Pipeline (SSDG) using unlabeled images from new scenes, diverse instance foregrounds with varying camera poses are generated, addressing the risk of overfitting specific camera poses. We evaluate our method on two large-scale roadside benchmarks. Our method surpasses all previous methods by a significant margin in new scenes, including +42.57% for vehicle, +5.87% for pedestrian, and +14.89% for cyclist compared to BEVHeight on the DAIR-V2X-I heterologous benchmark. On the larger-scale Rope3D heterologous benchmark, we achieve notable gains of 14.48% for car and 12.41% for large vehicle. We aspire to contribute insights on the exploration of roadside perception techniques, emphasizing their capability for scenario generalization. The code will be available at https://github.com/yanglei18/SGV3D
- Cooperative perception for 3d object detection in driving scenarios using infrastructure sensors. IEEE Transactions on Intelligent Transportation Systems, 23(3):1852–1864, 2022.
- Kinematic 3d object detection in monocular video. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pages 135–152. Springer, 2020.
- Consensus-based distributed cooperative perception for connected and automated vehicles. IEEE Transactions on Intelligent Transportation Systems, 24(8):8188–8208, 2023.
- Calibration-free bev representation for infrastructure perception. arXiv preprint arXiv:2303.03583, 2023a.
- Quest: Query stream for vehicle-infrastructure cooperative perception. arXiv preprint arXiv:2308.01804, 2023b.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
- Aug3d-rpn: Improving monocular 3d object detection by synthetic images with virtual depth. arXiv preprint arXiv:2107.13269, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Source-free unsupervised domain adaptation for 3d object detection in adverse weather. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 6973–6980. IEEE, 2023.
- Marrs: Modern backbones assisted co-training for rapid and robust semi-supervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4579–4588, 2023.
- Monouni: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- V2x-sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving. IEEE Robotics and Automation Letters, 7(4):10914–10921, 2022a.
- Among us: Adversarially robust collaborative perception by consensus. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023a.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1477–1485, 2023b.
- Multi-robot scene completion: Towards task-agnostic collaborative perception. In Conference on Robot Learning, pages 2062–2072. PMLR, 2023c.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022b.
- Cl3d: Camera-lidar 3d object detection with point feature enhancement and point-guided fusion. IEEE Transactions on Intelligent Transportation Systems, 23(10):18040–18050, 2022.
- V2vformer: Vehicle-to-vehicle cooperative perception with spatial-channel transformer. IEEE Transactions on Intelligent Vehicles, pages 1–13, 2024.
- Run and chase: Towards accurate source-free domain adaptive object detection. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 2453–2458. IEEE, 2023.
- Periodically exchange teacher-student for source-free object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6414–6424, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Novel gaussian mixture model background subtraction method for detecting moving objects. In 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), pages 6–10, 2018.
- Delving into localization errors for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4721–4730, 2021.
- Adapting object size variance and class imbalance for semi-supervised object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1966–1974, 2023.
- Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8555–8564, 2021.
- Learning cooperative trajectory representations for motion forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
- Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2397–2406, 2022.
- Cobev: Elevating roadside 3d object detection with depth and height complementarity. arXiv preprint arXiv:2310.02815, 2023.
- A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020.
- Graphalign++: an accurate feature alignment by graph matching for multi-modal 3d object detection. IEEE Transactions on Circuits and Systems for Video Technology, 2023a.
- Vp-net: Voxels as points for 3d object detection. IEEE Transactions on Geoscience and Remote Sensing, 2023b.
- A spatial calibration method for robust cooperative perception. arXiv preprint arXiv:2304.12033, 2023c.
- Robustness-aware 3d object detection in autonomous driving: A review and outlook. arXiv preprint arXiv:2401.06542, 2024a.
- Robofusion: Towards robust multi-modal 3d obiect detection via sam. arXiv preprint arXiv:2401.03907, 2024b.
- Voxelnextfusion: A simple, unified and effective voxel fusion framework for multi-modal 3d object detection. arXiv preprint arXiv:2401.02702, 2024c.
- Instance relation graph guided source-free domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3520–3530, 2023.
- Collaborative 3d object detection for autonomous vehicles via learnable communications. IEEE Transactions on Intelligent Transportation Systems, 24(9):9804–9816, 2023a.
- Multi-modal 3d object detection in autonomous driving: A survey and taxonomy. IEEE Transactions on Intelligent Vehicles, 2023b.
- Ssda3d: Semi-supervised domain adaptation for 3d object detection from point cloud. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2707–2715, 2023c.
- Adapt then generalize: A simple two-stage framework for semi-supervised domain generalization. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 540–545. IEEE, 2023.
- Semi-supervised learning with pseudo-negative labels for image classification. Knowledge-Based Systems, 260:110166, 2023.
- Bevheight++: Toward robust visual centric 3d object detection. arXiv preprint arXiv:2309.16179, 2023a.
- Monogae: Roadside monocular 3d object detection with ground-aware embeddings. arXiv preprint arXiv:2310.00400, 2023b.
- Bevheight: A robust framework for vision-based roadside 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21611–21620, 2023c.
- Lite-fpn for keypoint-based monocular 3d object detection. Knowledge-Based Systems, 271:110517, 2023d.
- Multifeature fusion-based object detection for intelligent transportation systems. IEEE Transactions on Intelligent Transportation Systems, 24(1):1126–1133, 2023e.
- Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21341–21350, 2022.
- V2vformer +++++ + : Multi-modal vehicle-to-vehicle cooperative perception via global-local transformer. IEEE Transactions on Intelligent Transportation Systems, pages 1–14, 2023.
- Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21361–21370, 2022.
- Ffnet: Flow-based feature fusion for vehicle-infrastructure cooperative 3d object detection. Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
- V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5486–5495, 2023b.
- Semi-supervised domain adaptation with source label adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24100–24109, 2023.
- Urformer: Unified representation lidar-camera 3d object detection with transformer. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 401–413. Springer, 2023a.
- Semi-detr: Semi-supervised object detection with detection transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23809–23818, 2023b.