Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Generalizable Feature Fields for Mobile Manipulation (2403.07563v2)

Published 12 Mar 2024 in cs.RO, cs.CV, and cs.LG

Abstract: An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use both for navigation and manipulation. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherent at an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We quantitatively evaluate GeFF's ability for open-vocabulary object-/part-level manipulation and show that GeFF outperforms point-based baselines in runtime and storage-accuracy trade-offs, with qualitative examples of semantics-aware navigation and articulated object manipulation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in Neural Information Processing Systems, 31:9094–9104, 2018.
  2. Commodity telepresence with team avatrina’s nursebot in the ana avatar xprize finals. In ICRA 2023 2nd Workshop on Toward Robot Avatars, 2023.
  3. Tidybot: Personalized robot assistance with large language models. Autonomous Robots, 2023.
  4. Kimera-Multi: Robust, Distributed, Dense Metric-Semantic SLAM for Multi-Robot Systems. IEEE Transactions on Robotics (T-RO), 38(4):2022–2038, 2022.
  5. Semantic OcTree Mapping and Shannon Mutual Information Computation for Robot Exploration. IEEE Transactions on Robotics (T-RO), 39(3):1910–1928, 2023.
  6. Gnm: A general navigation model to drive any robot. In ICRA, 2023.
  7. Vint: A foundation model for visual navigation. In CORL, 2023.
  8. Stable bin packing of non-convex 3d objects with a robot manipulator. In ICRA, 2019.
  9. Homerobot: Open-vocabulary mobile manipulation. arXiv preprint arXiv:2306.11565, 2023.
  10. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  11. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021.
  12. Featurenerf: Learning generalizable nerfs by distilling foundation models. In ICCV, 2023.
  13. Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
  14. Lerf: Language embedded radiance fields. In ICCV, 2023.
  15. Alex Trevithick and Bo Yang. Grf: Learning a general radiance field for 3d representation and rendering. In ICCV, 2021.
  16. F2-nerf: Fast neural radiance field training with free camera trajectories. In CVPR, 2023.
  17. Zip-nerf: Anti-aliased grid-based neural radiance fields. arXiv preprint arXiv:2304.06706, 2023.
  18. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  19. Is attention all that nerf needs? In ICLR, 2023.
  20. Actorsnerf: Animatable few-shot human rendering with generalizable nerfs. In ICCV, pages 18391–18401, 2023.
  21. Advances in neural rendering. In arXiv:2111.05849, 2021.
  22. Lolnerf: Learn from one look. In CVPR, 2022.
  23. Decomposing nerf for editing via feature field distillation. NeurIPS, 2022.
  24. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In International Conference on 3D Vision (3DV), 2022.
  25. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  26. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  27. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  28. Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931, 2023.
  29. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In CoRL, 2023.
  30. Poni: Potential functions for objectgoal navigation with interaction-free learning. In CVPR, 2022.
  31. Navigating to objects in the real world. In SCIENCE ROBOTICS, 2023.
  32. Clip-fields: Weakly supervised semantic fields for robotic memory. In RSS, 2023.
  33. Open-vocabulary queryable scene representations for real world planning. In ICRA, 2023.
  34. Stubborn: A strong baseline for indoor object navigation. In IROS, 2022.
  35. Navigation with large language models: Semantic guesswork as a heuristic for planning. In Conference on Robot Learning (CoRL), 2023.
  36. Reasoning with scene graphs for robot planning under partial observability. IEEE Robotics and Automation Letters (RAL), 7(2):5560–5567, 2022.
  37. Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks. In IEEE International Conference on Robotics and Automation (ICRA), pages 9272–9279, 2022.
  38. Object goal navigation using goal-oriented semantic exploration. In In Neural Information Processing Systems (NeurIPS), 2020.
  39. Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav. In arXiv:2303.07798, 2023.
  40. Topological semantic graph memory for image-goal navigation. In CoRL, 2022.
  41. Multi-object navigation with dynamically learned neural implicit representations. In ICCV, 2023.
  42. Unifying perception, estimation and action for mobile manipulation via belief space planning. In ICRA, 2012.
  43. Fully autonomous real-world reinforcement learning with applications to mobile manipulation. In CoRL, 2021.
  44. Error-aware imitation learning from teleoperation data for mobile manipulation. In CoRL, 2021.
  45. Multi-skill mobile manipulation for object rearrangement. In ICLR, 2023.
  46. Slap: Spatial-language attention policies. In CoRL, 2023.
  47. Relmogen: Integrating motion generation in reinforcement learning for mobile manipulation. In ICRA, 2021.
  48. Asc: Adaptive skill coordination for robotic mobile manipulation. In arXiv:2304.00410, 2023.
  49. Skill transformer: A monolithic policy for mobile manipulation. In ICCV, 2023.
  50. Open-world object manipulation using pre-trained vision-language model. In CoRL, 2023.
  51. Go fetch: Mobile manipulation in unstructured environments. In arXiv:2004.00899, 2020.
  52. Go fetch! - dynamic grasps using boston dynamics spot with external robotic arm. In ICRA, 2021.
  53. Kinematically-decoupled impedance control for fast object visual servoing and grasping on quadruped manipulators. In IROS, 2023.
  54. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In CVPR, 2023.
  55. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024.
  56. Visual language maps for robot navigation. In ICRA, 2023.
  57. Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
  58. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  59. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
  60. 3d reconstruction with generalizable neural fields using scene priors. arXiv preprint arXiv:2309.15164, 2023.
  61. Ponder: Point cloud pre-training via neural rendering. In ICCV, 2023.
  62. Extract free dense labels from clip. In ECCV, 2022.
  63. isdf: Real-time neural signed distance fields for robot perception. In RSS, 2022.
  64. Implicit geometric regularization for learning shapes. In ICML. PMLR, 2020.
  65. 3d object detection with pointformer. In CVPR, 2021.
  66. Point transformer. In ICCV, 2021.
  67. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  68. Deep residual learning for image recognition. In CVPR, 2016.
  69. Hybvio: Pushing the limits of real-time visual-inertial odometry. In WACV, 2022.
  70. The Open Motion Planning Library. IEEE Robotics & Automation Magazine, 19(4):72–82, December 2012. https://ompl.kavrakilab.org.
  71. https://gazebosim.org/home.
  72. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Ri-Zhao Qiu (9 papers)
  2. Yafei Hu (7 papers)
  3. Ge Yang (49 papers)
  4. Yuchen Song (16 papers)
  5. Yang Fu (43 papers)
  6. Jianglong Ye (11 papers)
  7. Jiteng Mu (10 papers)
  8. Ruihan Yang (43 papers)
  9. Nikolay Atanasov (101 papers)
  10. Sebastian Scherer (163 papers)
  11. Xiaolong Wang (243 papers)
Citations (17)

Summary

Unifying Navigation and Manipulation through Generalizable Feature Fields in Real-Time Mobile Robotics

Introduction

The exploration of unified scene representations suitable for both robot navigation and manipulation remains a significant frontier in robotics research. Typical approaches often treat navigation and manipulation as separate challenges, employing distinct strategies and representations for each task. Navigation typically leverages large-scale geometric or topological maps, while manipulation relies on precise, continuous scene representations for object interaction. The discrepancy between these approaches complicates tasks requiring integrated visuomotor skills, particularly in dynamic, real-world environments.

In a novel approach, the work presented herein introduces Generalizable Feature Fields (GeFF), a scene-level neural feature representation designed for real-time, unified application in navigation and manipulation. GeFF builds on the principles of generalizable neural radiance fields, extending their utility beyond novel view synthesis to embody rich semantic and geometric scene priors. Notably, GeFF incorporates language-aligned semantics via feature distillation from Vision-LLMs (VLM), enabling open-vocabulary tasks. This integration facilitates seamless and efficient robot interaction with dynamic environments based on language instructions, presenting a significant advancement in the field.

Generalizable Feature Fields: Methodology

GeFF distinguishes itself by merging scene-level generalizable Neural Radiance Fields (NeRF) with feature distillation, creating a representation capable of real-time updates and language alignment. The methodology encompasses two principal components:

  • Real-time Scene Representation: GeFF employs an encoder-decoder framework where the encoder processes input RGB-D streams, generating a latent representation dynamically updated as the robot navigates and manipulates within its environment. This process supports incremental scene understanding and manipulation planning in a unified manner.
  • Semantic Alignment through Feature Distillation: Beyond capturing geometry, GeFF enhances scene representation with semantics by distilling features from a pre-trained Vision-LLM, specifically CLIP. This alignment enables robots to understand and execute tasks described in natural language, addressing both specific objects and their broader semantic context.

Empirical Evaluation

The efficacy of GeFF is demonstrated through deployment on a quadrupedal robot equipped with a manipulator. Evaluation across diverse real-world scenarios—ranging from lab spaces to community kitchens—showcases GeFF's robustness and versatility. Notably, GeFF achieves an average 52.9\% success rate in open-vocabulary mobile manipulation tasks, significantly outperforming baseline approaches such as LeRF. These results are underpinned by GeFF's ability to provide detailed scene representations and execute real-time updates in response to dynamic changes, enhancing both navigation and manipulation capabilities.

Implications and Future Directions

The development of GeFF marks an important step towards realizing robots capable of integrated navigation and manipulation in dynamic, real-world settings. By bridging the gap between geometric navigation maps and manipulation-centric scene representations, GeFF facilitates a broader range of autonomous robotic tasks. The approach's real-time performance and open-vocabulary capability expand the potential for robots to interact with their environments in a more natural and intuitive manner.

Looking forward, the work opens several avenues for further research. Enhancements in feature distillation could refine semantic understanding and alignment with language, while advances in incremental learning may bolster GeFF's adaptability to novel environments and tasks. Moreover, integrating GeFF with end-to-end learning strategies for visuomotor control could further unify the navigation-manipulation pipeline, leading to more sophisticated and autonomous robotic systems.

In essence, Generalizable Feature Fields represent a significant stride towards versatile, language-aware robotic systems capable of navigating and manipulating within complex, ever-changing environments. This work not only advances our understanding of unified scene representations but also lays the groundwork for future innovations in robot autonomy.