Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

110 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

343

Learning Generalizable Feature Fields for Mobile Manipulation (2403.07563v2)

Published 12 Mar 2024 in cs.RO, cs.CV, and cs.LG

Abstract: An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use both for navigation and manipulation. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherent at an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We quantitatively evaluate GeFF's ability for open-vocabulary object-/part-level manipulation and show that GeFF outperforms point-based baselines in runtime and storage-accuracy trade-offs, with qualitative examples of semantics-aware navigation and articulated object manipulation.

References (72)

Authors (11)

Ri-Zhao Qiu (9 papers)
Yafei Hu (7 papers)
Ge Yang (49 papers)
Yuchen Song (16 papers)
Yang Fu (43 papers)
Jianglong Ye (11 papers)
Jiteng Mu (10 papers)
Ruihan Yang (43 papers)
Nikolay Atanasov (101 papers)
Sebastian Scherer (163 papers)
Xiaolong Wang (243 papers)

Citations (17)

View on Semantic Scholar

Summary

Unifying Navigation and Manipulation through Generalizable Feature Fields in Real-Time Mobile Robotics

Introduction

The exploration of unified scene representations suitable for both robot navigation and manipulation remains a significant frontier in robotics research. Typical approaches often treat navigation and manipulation as separate challenges, employing distinct strategies and representations for each task. Navigation typically leverages large-scale geometric or topological maps, while manipulation relies on precise, continuous scene representations for object interaction. The discrepancy between these approaches complicates tasks requiring integrated visuomotor skills, particularly in dynamic, real-world environments.

In a novel approach, the work presented herein introduces Generalizable Feature Fields (GeFF), a scene-level neural feature representation designed for real-time, unified application in navigation and manipulation. GeFF builds on the principles of generalizable neural radiance fields, extending their utility beyond novel view synthesis to embody rich semantic and geometric scene priors. Notably, GeFF incorporates language-aligned semantics via feature distillation from Vision-LLMs (VLM), enabling open-vocabulary tasks. This integration facilitates seamless and efficient robot interaction with dynamic environments based on language instructions, presenting a significant advancement in the field.

Generalizable Feature Fields: Methodology

GeFF distinguishes itself by merging scene-level generalizable Neural Radiance Fields (NeRF) with feature distillation, creating a representation capable of real-time updates and language alignment. The methodology encompasses two principal components:

Real-time Scene Representation: GeFF employs an encoder-decoder framework where the encoder processes input RGB-D streams, generating a latent representation dynamically updated as the robot navigates and manipulates within its environment. This process supports incremental scene understanding and manipulation planning in a unified manner.
Semantic Alignment through Feature Distillation: Beyond capturing geometry, GeFF enhances scene representation with semantics by distilling features from a pre-trained Vision-LLM, specifically CLIP. This alignment enables robots to understand and execute tasks described in natural language, addressing both specific objects and their broader semantic context.

Empirical Evaluation

The efficacy of GeFF is demonstrated through deployment on a quadrupedal robot equipped with a manipulator. Evaluation across diverse real-world scenarios—ranging from lab spaces to community kitchens—showcases GeFF's robustness and versatility. Notably, GeFF achieves an average 52.9\% success rate in open-vocabulary mobile manipulation tasks, significantly outperforming baseline approaches such as LeRF. These results are underpinned by GeFF's ability to provide detailed scene representations and execute real-time updates in response to dynamic changes, enhancing both navigation and manipulation capabilities.

Implications and Future Directions

The development of GeFF marks an important step towards realizing robots capable of integrated navigation and manipulation in dynamic, real-world settings. By bridging the gap between geometric navigation maps and manipulation-centric scene representations, GeFF facilitates a broader range of autonomous robotic tasks. The approach's real-time performance and open-vocabulary capability expand the potential for robots to interact with their environments in a more natural and intuitive manner.

Looking forward, the work opens several avenues for further research. Enhancements in feature distillation could refine semantic understanding and alignment with language, while advances in incremental learning may bolster GeFF's adaptability to novel environments and tasks. Moreover, integrating GeFF with end-to-end learning strategies for visuomotor control could further unify the navigation-manipulation pipeline, leading to more sophisticated and autonomous robotic systems.

In essence, Generalizable Feature Fields represent a significant stride towards versatile, language-aware robotic systems capable of navigating and manipulating within complex, ever-changing environments. This work not only advances our understanding of unified scene representations but also lays the groundwork for future innovations in robot autonomy.

PDF Markdown

Tweets

https://twitter.com/xiaolonw/status/1768016308032208952

https://twitter.com/_akhaliq/status/1767753512199295463

https://twitter.com/YafeiHuCMU/status/1767768578625814812