Pre-Trained Masked Image Model for Mobile Robot Navigation (2310.07021v2)
Abstract: 2D top-down maps are commonly used for the navigation and exploration of mobile robots through unknown areas. Typically, the robot builds the navigation maps incrementally from local observations using onboard sensors. Recent works have shown that predicting the structural patterns in the environment through learning-based approaches can greatly enhance task efficiency. While many such works build task-specific networks using limited datasets, we show that the existing foundational vision networks can accomplish the same without any fine-tuning. Specifically, we use Masked Autoencoders, pre-trained on street images, to present novel applications for field-of-view expansion, single-agent topological exploration, and multi-agent exploration for indoor mapping, across different input modalities. Our work motivates the use of foundational vision models for generalized structure prediction-driven applications, especially in the dearth of training data. For more qualitative results see https://raaslab.org/projects/MIM4Robots.
- Field coverage and weed mapping by uav swarms. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4319–4325. Ieee, 2017.
- On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
- Multimae: Multi-modal multi-task masked autoencoders. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 348–367. Springer, 2022.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Valid: A comprehensive virtual aerial image dataset. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 2009–2016, 2020.
- Coverage path planning: The boustrophedon cellular decomposition. In Field and service robotics, pages 203–209. Springer, 1998.
- Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022.
- Unmanned aerial vehicles in agriculture: A survey. Agronomy, 11(2):203, 2021.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Pred-nbv: Prediction-guided next-best-view planning for 3d object reconstruction. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, (in press).
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, pages 569–593. Springer, 1992.
- Map-predictive motion planning in unknown environments. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 8552–8558. IEEE, 2020.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning [eb/ol]. arXiv preprint arxiv:1506.02142, 2015.
- Uncertainty-driven planner for exploration and navigation. In 2022 International Conference on Robotics and Automation (ICRA), pages 11295–11302. IEEE, 2022.
- Learning to map for active semantic goal navigation. arXiv preprint arXiv:2106.15648, 2021.
- Cross-modal map learning for vision and language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15460–15470, 2022.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Bird’s eye view: Cooperative exploration by ugv and uav. In 2017 International Conference on Unmanned Aircraft Systems (ICUAS), pages 247–255. IEEE, 2017.
- Uncertainty-aware occupancy map prediction using generative networks for robot navigation. In 2019 International Conference on Robotics and Automation (ICRA), pages 5453–5459. IEEE, 2019.
- Occupancy map prediction using generative and fully convolutional networks for vehicle navigation. arXiv preprint arXiv:1803.02007, 2018.
- High-speed robot navigation using predicted occupancy maps. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 5476–5482, 2021.
- High-speed robot navigation using predicted occupancy maps. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 5476–5482. IEEE, 2021.
- Simple and effective synthesis of indoor 3d scenes. arXiv preprint arXiv:2204.02960, 2022.
- Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
- Shakey: from conception to history. Ai Magazine, 38(1):88–103, 2017.
- Unmanned aerial vehicles applications in future smart cities. Technological forecasting and social change, 153:119293, 2020.
- Occupancy anticipation for efficient exploration and navigation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 400–418. Springer, 2020.
- Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3563–3573, 2022.
- Geometry-free view synthesis: Transformers and no 3d priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14356–14366, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Online Exploration of Tunnel Networks Leveraging Topological CNN-based World Predictions. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6038–6045, Oct. 2020. ISSN: 2153-0866.
- Occupancy map prediction for improved indoor robot navigation. arXiv preprint arXiv:2203.04177, 2022.
- Proxmap: Proximal occupancy map prediction for efficient indoor robot navigation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, (in press).
- Risk-aware planning and assignment for ground vehicles using uncertain perception from aerial vehicles. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11763–11769. IEEE, 2020.
- D2coplan: A differentiable decentralized planner for multi-robot coverage. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 3425–3431. IEEE, 2023.
- Adversarial masking for self-supervised learning. In International Conference on Machine Learning, pages 20026–20040. PMLR, 2022.
- Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
- Occupancy map inpainting for online robot navigation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 8551–8557. IEEE, 2021.
- ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.