GET-USE: Learning Generalized Tool Usage for Bimanual Mobile Manipulation via Simulated Embodiment Extensions

Published 29 Oct 2025 in cs.RO | (2510.25754v1)

Abstract: The ability to use random objects as tools in a generalizable manner is a missing piece in robots' intelligence today to boost their versatility and problem-solving capabilities. State-of-the-art robotic tool usage methods focused on procedurally generating or crowd-sourcing datasets of tools for a task to learn how to grasp and manipulate them for that task. However, these methods assume that only one object is provided and that it is possible, with the correct grasp, to perform the task; they are not capable of identifying, grasping, and using the best object for a task when many are available, especially when the optimal tool is absent. In this work, we propose GeT-USE, a two-step procedure that learns to perform real-robot generalized tool usage by learning first to extend the robot's embodiment in simulation and then transferring the learned strategies to real-robot visuomotor policies. Our key insight is that by exploring a robot's embodiment extensions (i.e., building new end-effectors) in simulation, the robot can identify the general tool geometries most beneficial for a task. This learned geometric knowledge can then be distilled to perform generalized tool usage tasks by selecting and using the best available real-world object as tool. On a real robot with 22 degrees of freedom (DOFs), GeT-USE outperforms state-of-the-art methods by 30-60% success rates across three vision-based bimanual mobile manipulation tool-usage tasks.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces GeT-USE, a framework that learns generalized tool usage through simulated embodiment extensions to enhance bimanual manipulation.
It leverages a dual-phase process: first, it builds tools in simulation via reinforcement learning, then distills vision-based policies for real-world deployment.
Experimental results show a 30-60% improvement over baselines, emphasizing the importance of 6-DOF control and robust tool selection in challenging tasks.

GeT-USE: Generalized Tool Usage for Bimanual Mobile Manipulation via Simulated Embodiment Extensions

Introduction and Motivation

The paper introduces GeT-USE, a framework for learning generalized tool usage in bimanual mobile manipulation tasks by leveraging simulated embodiment extensions. The central premise is that robots can acquire versatile manipulation capabilities by first exploring and learning to extend their own embodiment in simulation, and then transferring the learned geometric and control strategies to real-world visuomotor policies. This approach addresses the limitations of prior work, which typically assumes the presence of a single, ideal tool and does not generalize to scenarios where the optimal tool is absent or multiple objects are available.

Figure 1: GeT-USE enables a TIAGo robot to solve bimanual mobile manipulation tasks by learning in simulation to build tools and transferring the strategy to real-world vision-based modules.

Framework Overview

GeT-USE operates in two main phases:

Simulated Embodiment Extension: The robot incrementally builds tools by appending small blocks to its wrists in simulation, guided by a reinforcement learning policy. The policy receives depth images and outputs the position and size of new blocks, terminating when a suitable tool is constructed. Success is determined by executing a predefined manipulation strategy using privileged information.
Vision-Based Module Training and Sim2Real Transfer: The geometric and morphological properties of successful simulated tools are distilled into three vision-based modules:
- Generalized Tool Selector: Trained to rank real-world objects by their suitability for the task using depth images.
- Visuomotor Tool Grasping Policy: Learns to grasp selected objects using depth images and proprioception.
- Visuomotor Tool Manipulation Policy: Controls the robot to perform the task with the grasped tool.
  Figure 2: The GeT-USE framework: training in simulation (top) and deployment in the real world (bottom), with modules for tool selection, grasping, and manipulation.

Simulated Tool-Building Policy

The tool-building policy, $\pi_\mathit{gtb}$ , is trained via RL to explore the space of possible tool geometries by appending blocks to the robot's wrists. The action space includes both the relative position and size of each block. The policy is rewarded for constructing tools that enable successful task completion, as determined by downstream manipulation strategies. This process generates a diverse set of both optimal and suboptimal tool geometries, which are critical for training robust selection and manipulation modules.

Figure 3: Example rollouts of GeT-USE's tool-building policy for Sweeping, Hook, and Decanting tasks, showing incremental construction of complex tools.

Generalized Tool Selection

The tool selector module, $\mathcal{D}_\mathit{gts}$ , is trained using depth images of simulated tools labeled with success or failure. Successful tools are annotated with binary masks indicating the graspable region. The module outputs a likelihood map for each candidate object, enabling the robot to "make the best of what it has" even when the ideal tool is absent.

Figure 4: GeT-USE's tool selector ranks objects for Sweeping, Hook, and Decanting tasks, preferring those with suitable geometric features.

Visuomotor Grasping and Manipulation Policies

The grasping policy, $\pi_\mathit{gtg}$ , and manipulation policy, $\pi_\mathit{gtm}$ , are trained in simulation using depth images and proprioceptive data. The grasping policy learns to pick up objects in a manner consistent with their intended use as tools, while the manipulation policy controls all 22 DOFs of the TIAGo robot to execute the task. Both policies are supported by success detectors trained to autonomously identify successful execution from visual input.

Real-World Deployment and Evaluation

At test time, the robot uses an object detector to generate candidate patches, applies the tool selector to choose the best object, and then executes the grasping and manipulation policies. The system is evaluated on three tasks—Sweeping, Hook, and Decanting—using a diverse set of real-world objects, including both useful and adversarial items.

Figure 5: Simulated and real-world versions of Sweeping, Hook, and Decanting tasks, demonstrating GeT-USE's sim2real generalization.

Figure 6: All real-world objects used in experiments, illustrating the diversity and challenge of generalized tool usage.

Experimental Results

GeT-USE achieves 30-60% higher success rates than state-of-the-art baselines (TOG-Net variants) across all tasks. Notably, TOG-Net fails completely when restricted to top-down grasping/manipulation or when lacking a tool selector. GeT-USE's ability to control all 6-DOFs and to select the most suitable object is critical for success. Ablation studies confirm that both the tool selector and full 6-DOF control are essential; removing either results in a 50-60% drop in performance.

Failure Analysis

Failures in Sweeping and Hook are primarily attributed to sim-to-real dynamics gaps, such as objects sliding under the tool or being pushed out of reach. Decanting failures are due to hardware limitations causing vibration during pouring. These limitations highlight the need for improved simulation fidelity and more robust real-world controllers.

Implications and Future Directions

GeT-USE demonstrates that simulated embodiment extension is an effective strategy for learning generalized tool usage, enabling robots to adapt to novel objects and tasks without requiring extensive real-world data collection. The approach is scalable and leverages depth-based vision for robust sim2real transfer. However, the framework currently assumes accurate simulation of rigid-body dynamics and is limited to parallel-jaw grippers. Extending GeT-USE to deformable objects, articulated tools, and multi-fingered hands represents a promising direction for future research.

Conclusion

GeT-USE provides a principled framework for learning and deploying generalized tool usage in bimanual mobile manipulation. By combining simulated embodiment extension with vision-based policy distillation, it achieves superior performance and generalization compared to prior methods. The results underscore the importance of geometric reasoning, robust selection mechanisms, and full DOF control in enabling versatile robotic manipulation. Future work should address sim-to-real gaps, richer object dynamics, and more dexterous manipulation capabilities.

Markdown Report Issue