CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception (2304.00670v3)
Abstract: Autonomous driving requires an accurate and fast 3D perception system that includes 3D object detection, tracking, and segmentation. Although recent low-cost camera-based approaches have shown promising results, they are susceptible to poor illumination or bad weather conditions and have a large localization error. Hence, fusing camera with low-cost radar, which provides precise long-range measurement and operates reliably in all environments, is promising but has not yet been thoroughly investigated. In this paper, we propose Camera Radar Net (CRN), a novel camera-radar fusion framework that generates a semantically rich and spatially accurate bird's-eye-view (BEV) feature map for various tasks. To overcome the lack of spatial information in an image, we transform perspective view image features to BEV with the help of sparse but accurate radar points. We further aggregate image and radar feature maps in BEV using multi-modal deformable attention designed to tackle the spatial misalignment between inputs. CRN with real-time setting operates at 20 FPS while achieving comparable performance to LiDAR detectors on nuScenes, and even outperforms at a far distance on 100m setting. Moreover, CRN with offline setting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first among all camera and camera-radar 3D object detectors.
- Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1090–1099, 2022.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020.
- Deft: Detection embeddings for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Efficient and robust 2d-to-bev representation learning via geometry-guided kernel transformer. arXiv preprint arXiv:2206.04584, 2022.
- Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 628–644, 2022.
- A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 1836–1843, 2014.
- Continental. Continental ARS 408-21 Datasheet. https://conti-engineering.com/components/ars-408/. Accessed: 2023-03-01.
- MMCV Contributors. MMCV: OpenMMLab computer vision foundation. https://github.com/open-mmlab/mmcv, 2018.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
- Superfusion: Multilevel lidar-camera fusion for long-range hd map generation and prediction. arXiv preprint arXiv:2211.15656, 2022.
- Deepfusion: A robust and modular 3d object detector for lidars, cameras and radars. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 560–567, 2022.
- The pascal visual object classes (voc) challenge. International Journal of Computer Vision (Int. J. Comput. Vis.), 88(2):303–338, 2010.
- Cc-3dt: Panoramic 3d object tracking via cross-camera fusion. In Proceedings of the Conference on Robot Learning (CoRL), pages 2294–2305, 2023.
- Radar/lidar sensor fusion for car-following on highways. In Proceedings of the IEEE International Conference on Automation, Robotics and Applications (ICARA), pages 407–412, 2011.
- 3D Packing for Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2485–2494, 2020.
- Simple-bev: What really matters for multi-sensor bev perception? In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2759–2765, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15273–15282, 2021.
- Monocular quasi-dense 3d object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Trans. Pattern Anal. Mach. Intell.), 45(2):1992–2008, 2022.
- Deep networks with stochastic depth. In Proceedings of the European Conference on Computer Vision (ECCV), pages 646–661, 2016.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. In arXiv preprint arXiv:2112.11790, 2021.
- Polarformer: Multi-camera 3d object detection with polar transformers. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
- Array signal processing: concepts and techniques. Simon & Schuster, Inc., 1992.
- Low-level sensor fusion network for 3d vehicle detection using radar range-azimuth heatmap and monocular image. In Proceedings of the Asian Conference on Computer Vision (ACCV), pages 388–402, 2020.
- GRIF Net: Gated region of interest fusion network for robust 3D object detection from radar point cloud and monocular image. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10857–10864, 2020.
- CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
- Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5750–5757, 2018.
- Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12697–12705, 2019.
- Lidaraugment: Searching for scalable 3d lidar data augmentations. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 7039–7045, 2023.
- MIMO radar signal processing. John Wiley & Sons, 2008.
- Hdmapnet: An online hd map construction and evaluation framework. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 4628–4634, 2022.
- Bevstereo: Enhancing depth estimation in multi-view 3d object detection with dynamic temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
- Unifying voxel-based representation with transformer for 3d object detection. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 1–18, 2022.
- Bevfusion: A simple and robust lidar-camera fusion framework. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Depth estimation from monocular images and sparse radar data. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10233–10240, 2020.
- Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017.
- Focal Loss for Dense Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
- Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022.
- Human activity classification with radar: Optimization and noise robustness with iterative convolutional neural networks followed with random forests. IEEE Sensors Journal, 18(23):9669–9681, 2018.
- Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 531––548, 2022.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11976–11986, 2022.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023.
- Radar-camera pixel depth association for depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12507–12516, 2021.
- Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Learning ego 3d representation as ray tracing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 129–144, 2022.
- Vehicle detection with automotive radar using deep learning on range-azimuth-doppler tensors. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 924–932, 2019.
- Automotive radar dataset for deep learning based 3d object detection. In Proceedings of the European Radar Conference (EuRAD), pages 129–132, 2019.
- Centerfusion: Center-based radar and camera fusion for 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1527–1536, 2021.
- Clocs: Camera-lidar object candidates fusion for 3d object detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10386–10393. IEEE, 2020.
- Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3142–3152, 2021.
- Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision (ECCV), pages 194–210, 2020.
- Nvradarnet: Real-time radar obstacle and free space detection for autonomous driving. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 6958–6964, 2023.
- Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 918–927, 2018.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems (NeurIPS), pages 5105–5114, 2017.
- Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8555–8564, 2021.
- Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11138–11147, 2020.
- Sparse detr: Efficient end-to-end object detection with learnable sparsity. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Translating images into maps. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 9200–9206, 2022.
- PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10529–10538, 2020.
- Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–779, 2019.
- Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1711–1719, 2020.
- InstaGraM: Instance-level Graph Modeling for Vectorized HD Map Learning. arXiv preprint arXiv:2301.04470, 2023.
- Disentangling Monocular 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1991–1999, 2019.
- Road scene understanding by occupancy grid learning from sparse radar clusters using semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019.
- Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5725–5734, 2021.
- Radar-pointgnn: Graph based object recognition for unstructured radar point-cloud data. In Proceedings of the IEEE Radar Conference (RadarConf), pages 1–6, 2021.
- Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6411–6420, 2019.
- Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9627–9636, 2019.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 32–42, 2021.
- Improved orientation estimation and detection with hybrid object detection networks for automotive radar. In Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), pages 111–117, 2022.
- Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 6000–6010, 2017.
- Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4604–4612, 2020.
- Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 913–922, 2021.
- Pillar-based object detection for autonomous driving. In Proceedings of the European Conference on Computer Vision (ECCV), pages 18–34, 2020.
- A baseline for 3d multi-object tracking. In arXiv preprint arXiv:1907.03961, 2019.
- Probably unknown: Deep inverse sensor modelling radar. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 5446–5452, 2019.
- Mvfusion: Multi-view 3d object detection with semantic-aligned radar and camera fusion. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023.
- Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337–3352, 2018.
- Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11784–11793, 2021.
- 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2020.
- Exploring data augmentation for multi-modality 3d object detection. arXiv preprint arXiv:2012.12741, 2020.
- Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13760–13769, 2022.
- Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection. IEEE Transactions on Intelligent Vehicles (IEEE Trans. Intell. Veh.), 2023.
- Objects as points. In arXiv preprint arXiv:1904.07850, 2019.
- Class-balanced grouping and sampling for point cloud 3d object detection. In arXiv preprint arXiv:1908.09492, 2019.
- Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.