Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception (2403.07746v2)
Abstract: Low-cost, vision-centric 3D perception systems for autonomous driving have made significant progress in recent years, narrowing the gap to expensive LiDAR-based methods. The primary challenge in becoming a fully reliable alternative lies in robust depth prediction capabilities, as camera-based systems struggle with long detection ranges and adverse lighting and weather conditions. In this work, we introduce HyDRa, a novel camera-radar fusion architecture for diverse 3D perception tasks. Building upon the principles of dense BEV (Bird's Eye View)-based architectures, HyDRa introduces a hybrid fusion approach to combine the strengths of complementary camera and radar features in two distinct representation spaces. Our Height Association Transformer module leverages radar features already in the perspective view to produce more robust and accurate depth predictions. In the BEV, we refine the initial sparse representation by a Radar-weighted Depth Consistency. HyDRa achieves a new state-of-the-art for camera-radar fusion of 64.2 NDS (+1.8) and 58.4 AMOTA (+1.5) on the public nuScenes dataset. Moreover, our new semantically rich and spatially accurate BEV features can be directly converted into a powerful occupancy representation, beating all previous camera-based methods on the Occ3D benchmark by an impressive 3.7 mIoU. Code and models are available at https://github.com/phi-wol/hydra.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
- Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Futr3d: A unified sensor fusion framework for 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Hidden gems: 4d radar scene flow learning using cross-modal supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9340–9349, 2023a.
- 3dmotformer: Graph transformer for online 3d multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9784–9794, 2023b.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
- Adamixer: A fast-converging query-based object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5364–5373, 2022.
- Simple-bev: What really matters for multi-sensor bev perception? In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2759–2765, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
- Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
- BEVDet4D: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022a.
- Bevpoolv2: A cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111, 2022b.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. In arXiv preprint arXiv:2112.11790, 2021.
- Detecting as labeling: Rethinking lidar-camera fusion in 3d object detection. arXiv preprint arXiv:2311.07152, 2023a.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023b.
- Grif net: Gated region of interest fusion network for robust 3d object detection from radar point cloud and monocular image. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10857–10864. IEEE, 2020.
- CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer. In AAAI, 2023a.
- Crn: Camera radar net for accurate, robust, efficient 3d perception. In ICCV, 2023b.
- X3kd: Knowledge distillation across modalities, tasks and stages for multi-camera 3d object detection. In CVPR, 2023.
- Pointpillars: Fast encoders for object detection from point clouds. In CVPR, pages 12697–12705, 2019.
- An energy and gpu-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
- Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13619–13627, 2022a.
- Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023a.
- Pillarnext: Rethinking network designs for 3d object detection in lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17567–17576, 2023b.
- Unifying voxel-based representation with transformer for 3d object detection. NeurIPS, 35, 2022b.
- Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17182–17191, 2022c.
- Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1486–1494, 2023c.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, 2023d.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022d.
- Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023e.
- Fb-bev: Bev representation from forward-backward view transformations. In ICCV, 2023f.
- Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 35, 2022.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In ICCV, 2023a.
- Fully sparse 3d panoptic occupancy prediction. arXiv preprint arXiv:2312.17118, 2023b.
- Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, pages 531––548, 2022a.
- Petrv2: A unified framework for 3d perception from multi-camera images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3262–3272, 2023c.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022b.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023d.
- Radiant: Radar-image association network for 3d object detection. In AAAI, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Octreeocc: Efficient and multi-granularity occupancy prediction using octree queries. arXiv preprint arXiv:2312.03774, 2023.
- Bev-guided multi-modality fusion for driving perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7074–7082, 2017.
- Centerfusion: Center-based radar and camera fusion for 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1527–1536, 2021.
- Multi-class road user detection with 3+ 1d radar in the view-of-delft dataset. IEEE Robotics and Automation Letters, 7(2):4961–4968, 2022.
- Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023.
- Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021.
- Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. In ICLR, 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210, 2020.
- Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8555–8564, 2021.
- Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188, 2018.
- Disentangling monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1991–1999, 2019.
- Radars for autonomous driving: A review of deep learning methods and challenges. arXiv preprint arXiv:2306.09304, 2023.
- Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 2024.
- Scene as occupancy. In ICCV, 2023.
- Improved orientation estimation and detection with hybrid object detection networks for automotive radar. In Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), pages 111–117, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612, 2020.
- Focal-petr: Embracing foreground for efficient multi-camera 3d object detection. IEEE Transactions on Intelligent Vehicles, 2023a.
- Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023b.
- Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 913–922, 2021.
- Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pages 1475–1485. PMLR, 2022a.
- Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022b.
- A baseline for 3d multi-object tracking. arXiv preprint arXiv:1907.03961, 1(2):6, 2019.
- Mvfusion: Multi-view 3d object detection with semantic-aligned radar and camera fusion. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023.
- Cross modal transformer: Towards fast and robust 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18268–18278, 2023.
- Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Quality matters: Embracing quality clues for robust 3d multi-object tracking. arXiv preprint arXiv:2208.10976, 2022a.
- Deepinteraction: 3d object detection via modality interaction. Advances in Neural Information Processing Systems, 35:1992–2005, 2022b.
- Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review, 2023.
- Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
- Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023.
- DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In ICLR, 2023a.
- Bytetrackv2: 2d and 3d multi-object tracking by associating every detection box. arXiv preprint arXiv:2303.15334, 2023b.
- Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection. IEEE Transactions on Intelligent Vehicles (IEEE Trans. Intell. Veh.), 2023.
- Objects as points. arXiv preprint arXiv:1904.07850, 2019.
- Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492, 2019.
- Deformable DETR: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021.