Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoViS-Net: A Cooperative Visual Spatial Foundation Model for Multi-Robot Applications (2405.01107v3)

Published 2 May 2024 in cs.RO, cs.MA, cs.SY, and eess.SY

Abstract: Autonomous robot operation in unstructured environments is often underpinned by spatial understanding through vision. Systems composed of multiple concurrently operating robots additionally require access to frequent, accurate and reliable pose estimates. In this work, we propose CoViS-Net, a decentralized visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as spatial comprehension. Our model is fully decentralized, platform-agnostic, executable in real-time using onboard compute, and does not require existing networking infrastructure. CoViS-Net provides relative pose estimates and a local bird's-eye-view (BEV) representation, even without camera overlap between robots (in contrast to classical methods). We demonstrate its use in a multi-robot formation control task across various real-world settings. We provide code, models and supplementary material online. https://proroklab.github.io/CoViS-Net/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. DistillPose: Lightweight Camera Localization Using Auxiliary Learning. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7919–7924, 2021. doi: 10.1109/IROS51168.2021.9635870. URL https://ieeexplore.ieee.org/document/9635870.
  2. Local motion planning for collaborative multi-robot manipulation of deformable objects. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 5495–5502, 2015. doi: 10.1109/ICRA.2015.7139967.
  3. Map-free Visual Relocalization: Metric Pose Relative to a Single Image. In ECCV, 2022. URL https://dl.acm.org/doi/10.1007/978-3-031-19769-7_40.
  4. Heterogeneous Multi-Robot Reinforcement Learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’23, page 1485–1494, Richland, SC, 2023. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450394321. URL https://dl.acm.org/doi/abs/10.5555/3545946.3598801.
  5. A Framework for Real-World Multi-Robot Systems Running Decentralized GNN-Based Policies. In 2022 International Conference on Robotics and Automation (ICRA), pages 8772–8778, 2022. doi: 10.1109/ICRA46639.2022.9811744. URL https://ieeexplore.ieee.org/document/9811744.
  6. Extreme Rotation Estimation using Dense Correlation Volumes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. URL https://openaccess.thecvf.com/content/CVPR2021/html/Cai_Extreme_Rotation_Estimation_Using_Dense_Correlation_Volumes_CVPR_2021_paper.html.
  7. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Transactions on Robotics, 37(6):1874–1890, 2021. doi: 10.1109/TRO.2021.3075644. URL https://ieeexplore.ieee.org/abstract/document/9440682.
  8. An Ultrasonic and Vision-Based Relative Positioning Sensor for Multirobot Localization. IEEE Sensors Journal, 15(3):1716–1726, 2015. doi: 10.1109/JSEN.2014.2364684. URL https://ieeexplore.ieee.org/abstract/document/6934978.
  9. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848. URL https://ieeexplore.ieee.org/document/5206848.
  10. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  11. ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–7, 2022. doi: 10.1109/IJCNN55064.2022.9891987. URL https://ieeexplore.ieee.org/document/9891987.
  12. RPNet: an End-to-End Network for Relative Camera Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018. URL https://link.springer.com/chapter/10.1007/978-3-030-11009-3_46.
  13. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/gal16.html.
  14. Graph Neural Networks for Decentralized Controllers. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5260–5264, 2021. doi: 10.1109/ICASSP39728.2021.9414563. URL https://ieeexplore.ieee.org/document/9414563.
  15. Synthesizing Decentralized Controllers With Graph Neural Networks and Imitation Learning. IEEE Transactions on Signal Processing, 70:1932–1946, 2022. doi: 10.1109/TSP.2022.3166401. URL https://ieeexplore.ieee.org/document/9755021.
  16. Collaborative Perception in Autonomous Driving: Methods, Datasets, and Challenges. IEEE Intelligent Transportation Systems Magazine, 15(6):131–151, 2023. doi: 10.1109/MITS.2023.3298534. URL https://ieeexplore.ieee.org/document/10248946.
  17. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://papers.nips.cc/paper_files/paper/2017/hash/2650d6089a6d640c5e85b2b88265dc2b-Abstract.html.
  18. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. URL https://openaccess.thecvf.com/content_iccv_2015/html/Kendall_PoseNet_A_Convolutional_ICCV_2015_paper.html.
  19. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://papers.nips.cc/paper_files/paper/2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html.
  20. Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. URL https://github.com/AaltoVision/camera-relocalisation.
  21. V2X-Sim: Multi-Agent Collaborative Perception Dataset and Benchmark for Autonomous Driving. IEEE Robotics and Automation Letters, 7(4):10914–10921, 2022. doi: 10.1109/LRA.2022.3192802. URL https://ieeexplore.ieee.org/document/9835036.
  22. Multi-Robot Scene Completion: Towards Task-Agnostic Collaborative Perception. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 2062–2072. PMLR, 14–18 Dec 2023. URL https://proceedings.mlr.press/v205/li23e.html.
  23. LightGlue: Local Feature Matching at Light Speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17627–17638, October 2023. URL https://openaccess.thecvf.com/content/ICCV2023/html/Lindenberger_LightGlue_Local_Feature_Matching_at_Light_Speed_ICCV_2023_paper.html.
  24. PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3262–3272, October 2023. URL https://openaccess.thecvf.com/content/ICCV2023/html/Liu_PETRv2_A_Unified_Framework_for_3D_Perception_from_Multi-Camera_Images_ICCV_2023_paper.html.
  25. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004. URL https://link.springer.com/article/10.1023/B:VISI.0000029664.99615.94.
  26. Modeling swarm robotic systems: A case study in collaborative distributed manipulation. The International Journal of Robotics Research, 23(4-5):415–436, 2004.
  27. Relative Camera Pose Estimation Using Convolutional Neural Networks. In Jacques Blanc-Talon, Rudi Penne, Wilfried Philips, Dan Popescu, and Paul Scheunders, editors, Advanced Concepts for Intelligent Vision Systems, pages 675–687, Cham, 2017. Springer International Publishing. ISBN 978-3-319-70353-4. URL https://link.springer.com/chapter/10.1007/978-3-319-70353-4_57.
  28. Virtual Omnidirectional Perception for Downwash Prediction within a Team of Nano Multirotors Flying in Close Proximity, 2023. URL https://arxiv.org/abs/2303.03898.
  29. Benchmarking UWB-Based Infrastructure-Free Positioning and Multi-Robot Relative Localization: Dataset and Characterization, 2023. URL https://arxiv.org/abs/2305.08532.
  30. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 1, pages 55–60 vol.1, 1994. doi: 10.1109/ICNN.1994.374138. URL https://ieeexplore.ieee.org/document/374138.
  31. A survey of multi-agent formation control. Automatica, 53:424–440, 2015.
  32. DINOv2: Learning Robust Visual Features without Supervision, 2023. URL https://github.com/facebookresearch/dinov2/tree/main.
  33. GlueStick: Robust Image Matching by Sticking Points and Lines Together. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9706–9716, October 2023. URL https://openaccess.thecvf.com/content/ICCV2023/html/Pautrat_GlueStick_Robust_Image_Matching_by_Sticking_Points_and_Lines_Together_ICCV_2023_paper.html.
  34. BEVSegFormer: Bird’s Eye View Semantic Segmentation From Arbitrary Camera Rigs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5935–5943, January 2023. URL https://openaccess.thecvf.com/content/WACV2023/html/Peng_BEVSegFormer_Birds_Eye_View_Semantic_Segmentation_From_Arbitrary_Camera_Rigs_WACV_2023_paper.html.
  35. A Smooth Representation of Belief over SO(3) for Deep Rotation Learning with Uncertainty. In Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA, July 2020. doi: 10.15607/RSS.2020.XVI.007. URL https://www.roboticsproceedings.org/rss16/p007.html.
  36. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In Proceedings of the European Conference on Computer Vision, 2020. URL https://dl.acm.org/doi/abs/10.1007/978-3-030-58568-6_12.
  37. RelMobNet: End-to-End Relative Camera Pose Estimation Using a Robust Two-Stage Training. In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino, editors, Computer Vision – ECCV 2022 Workshops, pages 238–252, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-25075-0. URL https://link.springer.com/chapter/10.1007/978-3-031-25075-0_18.
  38. Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021. URL https://aihabitat.org/datasets/hm3d.
  39. Craig W Reynolds. Flocks, herds and schools: A distributed behavioral model. In Proceedings of the 14th annual conference on Computer graphics and interactive techniques, pages 25–34, 1987.
  40. The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs. In 2022 International Conference on 3D Vision (3DV), pages 1–11, 2022. doi: 10.1109/3DV57658.2022.00028. URL https://ieeexplore.ieee.org/abstract/document/10044394.
  41. Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Roddick_Predicting_Semantic_Map_Representations_From_Images_Using_Pyramid_Occupancy_Networks_CVPR_2020_paper.html.
  42. Orb: An efficient alternative to sift or surf. In 2011 International Conference on Computer Vision, pages 2564–2571, 2011. doi: 10.1109/ICCV.2011.6126544.
  43. V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer. In Proceedings of the European Conference on Computer Vision (ECCV), 2022. URL https://dl.acm.org/doi/abs/10.1007/978-3-031-19842-7_7.
  44. SuperGlue: Learning Feature Matching With Graph Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Sarlin_SuperGlue_Learning_Feature_Matching_With_Graph_Neural_Networks_CVPR_2020_paper.html.
  45. Back to the Feature: Learning Robust Camera Localization From Pixels To Pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3247–3257, June 2021. URL https://openaccess.thecvf.com/content/CVPR2021/html/Sarlin_Back_to_the_Feature_Learning_Robust_Camera_Localization_From_Pixels_CVPR_2021_paper.html.
  46. Swarms of micro aerial vehicles stabilized under a visual relative localization. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 3570–3575, 2014. doi: 10.1109/ICRA.2014.6907374. URL https://ieeexplore.ieee.org/abstract/document/6907374.
  47. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. URL https://openaccess.thecvf.com/content_ICCV_2019/html/Savva_Habitat_A_Platform_for_Embodied_AI_Research_ICCV_2019_paper.html.
  48. Visual Odometry [Tutorial]. IEEE Robotics & Automation Magazine, 18(4):80–92, 2011. doi: 10.1109/MRA.2011.943233. URL https://ieeexplore.ieee.org/abstract/document/6096039.
  49. Structure-from-Motion Revisited. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016. doi: 10.1109/CVPR.2016.445. URL https://ieeexplore.ieee.org/abstract/document/7780814.
  50. Conflict-based search for optimal multi-agent pathfinding. Artificial intelligence, 219:40–66, 2015.
  51. Persistent robotic tasks: Monitoring and sweeping in changing environments. IEEE Transactions on Robotics, 28(2):410–426, 2012. doi: 10.1109/TRO.2011.2174493.
  52. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3, pages 240–248. Springer, 2017. URL https://link.springer.com/chapter/10.1007/978-3-319-67558-9_28.
  53. LoFTR: Detector-Free Local Feature Matching With Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8922–8931, June 2021. URL https://openaccess.thecvf.com/content/CVPR2021/html/Sun_LoFTR_Detector-Free_Local_Feature_Matching_With_Transformers_CVPR_2021_paper.html.
  54. Habitat 2.0: Training Home Assistants to Rearrange their Habitat. In Advances in Neural Information Processing Systems (NeurIPS), 2021. URL https://proceedings.neurips.cc/paper/2021/hash/021bbc7ee20b71134d53e20206bd6feb-Abstract.html.
  55. Learning Decentralized Controllers for Robot Swarms with Graph Neural Networks. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors, Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pages 671–682. PMLR, 30 Oct–01 Nov 2020. URL https://proceedings.mlr.press/v100/tolstaya20a.html.
  56. Reciprocal velocity obstacles for real-time multi-agent navigation. In 2008 IEEE international conference on robotics and automation, pages 1928–1935. Ieee, 2008.
  57. Attention is All you Need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  58. Fast Mutual Relative Localization of UAVs using Ultraviolet LED Markers. In 2018 International Conference on Unmanned Aircraft Systems (ICUAS), pages 1217–1226, 2018. doi: 10.1109/ICUAS.2018.8453331. URL https://ieeexplore.ieee.org/abstract/document/8453331.
  59. V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 605–621, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58536-5. URL https://link.springer.com/chapter/10.1007/978-3-030-58536-5_36.
  60. A survey of transfer learning. Journal of Big data, 3:1–40, 2016.
  61. Learning to Localize in New Environments from Synthetic Training Data. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 5840–5846, 2021. doi: 10.1109/ICRA48506.2021.9560872. URL https://ieeexplore.ieee.org/document/9560872.
  62. CoBEVT: Cooperative Bird’s Eye View Semantic Segmentation with Sparse Transformers. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 989–1000. PMLR, 14–18 Dec 2023. URL https://proceedings.mlr.press/v205/xu23a.html.
  63. RCPNet: Deep-Learning based Relative Camera Pose Estimation for UAVs. In 2020 International Conference on Unmanned Aircraft Systems (ICUAS), pages 1085–1092, 2020. doi: 10.1109/ICUAS48674.2020.9214000. URL https://ieeexplore.ieee.org/abstract/document/9214000.
  64. Learning-based Camera Relocalization with Domain Adaptation via Image-to-Image Translation. In 2021 International Conference on Unmanned Aircraft Systems (ICUAS), pages 1047–1054, 2021. doi: 10.1109/ICUAS51884.2021.9476673. URL https://ieeexplore.ieee.org/abstract/document/9476673.
  65. BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17830–17839, June 2023. URL https://openaccess.thecvf.com/content/CVPR2023/html/Yang_BEVFormer_v2_Adapting_Modern_Image_Backbones_to_Birds-Eye-View_Recognition_via_CVPR_2023_paper.html.
  66. William J Youden. Index for rating diagnostic tests. Cancer, 3(1):32–35, 1950.
  67. Planning paths of complete coverage of an unstructured environment by a mobile robot. In Proceedings of international conference on advanced robotics, volume 13, pages 533–538. Citeseer, 1993.
  68. To Learn or Not to Learn: Visual Localization from Essential Matrices. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3319–3326, 2020. doi: 10.1109/ICRA40945.2020.9196607. URL https://ieeexplore.ieee.org/abstract/document/9196607.
  69. Multi-Robot Collaborative Perception With Graph Neural Networks. IEEE Robotics and Automation Letters, 7(2):2289–2296, 2022. doi: 10.1109/LRA.2022.3141661. URL https://ieeexplore.ieee.org/abstract/document/9676458.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jan Blumenkamp (12 papers)
  2. Steven Morad (15 papers)
  3. Jennifer Gielis (4 papers)
  4. Amanda Prorok (66 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets