FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real (2502.17894v1)

Published 25 Feb 2025 in cs.RO and cs.CV

Abstract: Object fetching from cluttered shelves is an important capability for robots to assist humans in real-world scenarios. Achieving this task demands robotic behaviors that prioritize safety by minimizing disturbances to surrounding objects, an essential but highly challenging requirement due to restricted motion space, limited fields of view, and complex object dynamics. In this paper, we introduce FetchBot, a sim-to-real framework designed to enable zero-shot generalizable and safety-aware object fetching from cluttered shelves in real-world settings. To address data scarcity, we propose an efficient voxel-based method for generating diverse simulated cluttered shelf scenes at scale and train a dynamics-aware reinforcement learning (RL) policy to generate object fetching trajectories within these scenes. This RL policy, which leverages oracle information, is subsequently distilled into a vision-based policy for real-world deployment. Considering that sim-to-real discrepancies stem from texture variations mostly while from geometric dimensions rarely, we propose to adopt depth information estimated by full-fledged depth foundation models as the input for the vision-based policy to mitigate sim-to-real gap. To tackle the challenge of limited views, we design a novel architecture for learning multi-view representations, allowing for comprehensive encoding of cluttered shelf scenes. This enables FetchBot to effectively minimize collisions while fetching objects from varying positions and depths, ensuring robust and safety-aware operation. Both simulation and real-robot experiments demonstrate FetchBot's superior generalization ability, particularly in handling a broad range of real-world scenarios, includ

Summary

Overview of "FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real"

FetchBot, a novel sim-to-real framework, is introduced for executing efficient and safety-aware object fetching tasks in cluttered shelves. This research presents a comprehensive approach focusing on scalable synthetic data generation, dynamic-aware policy learning, and robust multi-view 3D representation, enabling the framework to generalize effectively to real-world scenarios without prior physical trials (zero-shot deployment).

Core Contributions and Methodological Advancements

FetchBot addresses the critical problem of retrieving objects from cluttered environments—common in warehouses and homes—minimizing disturbances to surrounding objects. Central to this paper is the development of the Voxel-based Cluttered Scene Generator, UniVoxGen, which efficiently creates realistic and diverse shelf environments to address data scarcity issues. This component accelerates large-scale scene generation by performing collision checks directly in voxel space using predefined generation rules, surpassing the inefficiency seen in conventional simulation-based methods.

A dynamics-aware reinforcement learning (RL) policy is trained on these voxelized scenes. The generated expert trajectories integrate considerations of restricted spaces and dynamic interactions, providing a foundation for robust and collision-averse object fetching policies. This approach circumvents the limitations of pure motion planning by incorporating environmental feedback; it excels in densely packed scenes where collisions are often unavoidable for task completion.

Notably, FetchBot embeds a multi-view 3D vision-based policy that bridges the sim-to-real domain gap predominantly stemming from textural discrepancies. The system mitigates these discrepancies by utilizing depth maps derived from foundation models, such as DepthAnything, instead of raw RGB inputs. Furthermore, it capitalizes on a novel architecture for learning cohesive 3D scene representations by employing auxiliary tasks like occupancy prediction to encode geometric features essential for successful manipulation tasks.

Experimental Evaluation

Experiments conducted both in simulation and on real-world hardware—using datasets of over a million scenes—demonstrate FetchBot's superior generalization capability and high success rates, especially in robust handling of diverse real-world shelf configurations. The 3D vision policy achieves a significant success rate of 81.46% on varied test scenarios in simulations and maintains a 76.67% success rate across challenging real-world conditions, outperforming heuristic, motion planning, and other contemporary learning-based approaches.

Implications and Future Directions

FetchBot's zero-shot sim-to-real transfer capability sets a viable precedent for future robotic systems tasked with manipulating complex environments. Its approach highlights the efficiency and reduced costs of leveraging simulation for robotic training, potentially influencing manufacturing, logistics, and domestic robotic solutions. The integration of data-efficient learning mechanisms with robust scene representation strategies can be foundational for emerging AI applications requiring precise and adaptive interventions in entropy-rich environments.

Moving forward, expanding this framework to handle a wider array of object shapes or introducing non-suction manipulation techniques could address the current limitations related to object surface and size constraints. Enhancing the spatial resolution for the occupancy prediction model could further boost the system's adaptability and accuracy without escalating computational demands, setting fertile ground for subsequent research in sim-to-real robotic applications.