Unified scene representation for navigation and manipulation

Determine a unified representation of objects and scenes for mobile manipulation that can be used both for robot navigation over expansive environments and for object manipulation requiring intricate geometry and fine-grained semantic understanding, thereby enabling a single representation to serve both tasks simultaneously.

Background

The paper tackles the challenge of creating a single scene representation that supports both navigation and manipulation. Traditional navigation approaches often rely on geometric or topological maps that scale to large environments but lack the fine-grained geometry needed for manipulation. Conversely, manipulation methods frequently use continuous implicit surfaces or meshes that enable precise grasping but are not readily integrated with large-scale navigation representations.

To address this discrepancy, the authors propose GeFF (Generalizable Feature Fields), which uses generalizable neural radiance fields with feature distillation from language-aligned vision models (e.g., CLIP) to produce a real-time, unified representation supporting both geometric queries (e.g., SDF, meshes, point clouds) and semantic queries aligned with natural language. Despite proposing a solution, the paper explicitly flags the unification of representations for both navigation and manipulation as an open problem.

References

An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use it both for navigating in the environment and manipulating objects.

— Learning Generalizable Feature Fields for Mobile Manipulation (2403.07563 - Qiu et al., 12 Mar 2024) in Abstract (page 1)

Unified scene representation for navigation and manipulation

Background

References

Related Problems