Zero-BEV: Zero-shot Projection of Any First-Person Modality to BEV Maps (2402.13848v2)
Abstract: Bird's-eye view (BEV) maps are an important geometrically structured representation widely used in robotics, in particular self-driving vehicles and terrestrial robots. Existing algorithms either require depth information for the geometric projection, which is not always reliably available, or are trained end-to-end in a fully supervised way to map visual first-person observations to BEV representation, and are therefore restricted to the output modality they have been trained for. In contrast, we propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. This is achieved by disentangling the geometric inverse perspective projection from the modality transformation, eg. RGB to occupancy. The method is general and we showcase experiments projecting to BEV three different modalities: semantic segmentation, motion vectors and object bounding boxes detected in first person. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
- Mohamed Aly. Real time detection of lane markers in urban streets. In 2008 IEEE intelligent vehicles symposium, pages 7–12. IEEE, 2008.
- LaRa: Latents and rays for multi-camera bird’s-eye-view semantic segmentation. In Conference on Robot Learning (CoRL), pages 1663–1672, 2023.
- EgoMap: Projective mapping and structured egocentric memory for Deep RL. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), pages 525–540. Springer, 2020.
- Structured bird’s-eye-view traffic scene understanding from onboard images. In International Conference on Computer Vision (ICCV), pages 15641–15650, 2021.
- Matterport3d: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision (3DV), 2018.
- Goat: Go to any thing. arXiv preprint arXiv:2311.06430, 2023.
- Learning To Explore Using Active Neural SLAM. In International Conference on Learning Representations (ICLR), 2019.
- Object goal navigation using goal-oriented semantic exploration. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Describing textures in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
- Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In International Conference on Computer Vision (ICCV), pages 10786–10796, 2021.
- Semantic grid estimation with a hybrid bayesian and deep neural network approach. In International Conference on Intelligent Robots and Systems (IROS), 2018.
- Navigating to objects in the real world. Science Robotics, 8(79), 2023.
- Gitnet: Geometric prior-based transformation for birds-eye-view segmentation. In European Conference on Computer Vision (ECCV), pages 396–411, 2022.
- SkyEye: Self-supervised bird’s-eye-view semantic mapping using monocular frontal view images. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 14901–14910, 2023.
- Improving Grid-based SLAM with Rao-Blackwellized Particle Filters by Adaptive Proposals and Selective Resampling. In International Conference on Robotics and Automation (ICRA), 2005.
- Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Mapnet: An allocentric spatial memory for mapping environments. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 8476–8484, 2018.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 15273–15282, 2021.
- 3D common corruptions and data augmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 18963–18974, 2022.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. Journal of Field Robotics, 36(2):416–446, 2019.
- Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. arXiv preprint, abs/2209.05324, 2022a.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European Conference Computer Vision (ECCV), pages 1–18, 2022b.
- Feature pyramid networks for object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017.
- Understanding road layout from videos as a whole. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4413–4422, 2020.
- End-to-end multi-task learning with attention. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1871–1880, 2019.
- Petrv2: A unified framework for 3d perception from multi-camera images. arXiv preprint, abs/2206.01256, 2022.
- Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks. IEEE Robotics and Automation Letters, 4(2):445–452, 2019.
- Vision-centric bev perception: A survey. arXiv preprint, abs/2208.02797, 2023.
- Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biological Cybernetics, 64:177–185, 1991.
- Teaching agents how to map: Spatial reasoning for multi-object navigation. In International Conference on Intelligent Robots and Systems (IROS), 2022.
- Weakly supervised 3d object detection from lidar point cloud. In European Conference on computer vision (ECCV), pages 515–531, 2020.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In International Conference on 3D Vision (3DV), pages 565–571. Ieee, 2016.
- Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020.
- Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. In Winter Conference on Applications of Computer Vision, pages 5935–5943, 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In European Conference on Computer Vision (ECCV), 2020.
- Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track (Round 2), 2021.
- Poni: Potential functions for objectgoal navigation with interaction-free learning. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 18890–18900, 2022.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 2020.
- You only look once: Unified, real-time object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Predicting semantic map representations from images using pyramid occupancy networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 11138–11147, 2020.
- Conditional Monte Carlo Dense Occupancy Tracker. In International Conference on Intelligent Transportation Systems, 2015.
- Enabling spatio-temporal aggregation in birds-eye-view vehicle estimation. In International Conference on Robotics and Automation (ICRA), pages 5133–5139. IEEE, 2021.
- Translating images into maps. In International Conference on Robotics and Automation (ICRA). IEEE, 2022.
- Habitat: A platform for embodied ai research. In International Conference on Computer Vision (ICCV), 2019.
- Learning to look around objects for top-view representations of outdoor scenes. In European Conference Computer Vision (ECCV), pages 815–831. Springer, 2018.
- Automatic dense visual semantic mapping from street-level imagery. In International Conference on Intelligent Robots and Systems (IROS), pages 857–862, 2012.
- Richard Szeliski. Computer vision algorithms and applications. Springer Verlag, 2011.
- Raft: Recurrent all-pairs field transforms for optical flow. In European Conference Computer Vision (ECCV), pages 402–419. Springer, 2020.
- Probabilistic robotics, vol. 1. MIT Press Cambridge, 2005.
- Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning (CoRL), pages 1475–1485, 2021.
- M22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTBEV: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv preprint, abs/2204.05088, 2022.
- Habitat-matterport 3D semantics dataset. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4927–4936, 2023.
- Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving. In International Conference on Computer Vision (ICCV), 2019.
- Robust learning through cross-task consistency. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 11197–11206, 2020.
- Cross-view transformers for real-time map-view semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 13760–13769, 2022.