Papers
Topics
Authors
Recent
Search
2000 character limit reached

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

Published 18 Feb 2026 in cs.RO and cs.AI | (2602.16863v1)

Abstract: The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories.

Summary

  • The paper introduces an object-centric reinforcement learning approach that reduces dexterous tool manipulation to goal pose tracking.
  • It leverages procedural object generation and SAPG-based training to generalize across diverse tasks and novel real-world tools.
  • Experimental results show a 37% performance boost over baselines, demonstrating robust zero-shot sim-to-real transfer.

SimToolReal: An Object-Centric Framework for Zero-Shot Dexterous Tool Manipulation

Problem Formulation and Approach

Dexterous tool manipulation in robotics extends robot capability across a broad range of household and industrial tasks. However, realizing robust tool use remains challenging due to requirements such as grasping flat-laying objects (e.g., markers, hammers), achieving in-hand rotation to functional poses, and exerting controlled force during environmental interaction. Sim-to-real reinforcement learning (RL) offers a promising but labor-intensive alternative to teleoperated demonstration, though extant approaches often suffer from substantial simulation setup, per-object modeling, and reward engineering.

SimToolReal circumvents these barriers by reducing dexterous tool use to an object-centric RL paradigm: manipulating arbitrary tool-like objects through sequences of goal poses. This formulation enables training a single goal-conditioned policy in simulation on procedurally-generated tool primitives, each paired with arbitrary goal poses. At inference, the policy executes real-world tool-use behaviors by tracking pose trajectories extracted from RGB-D human videos without object- or task-specific fine-tuning. Figure 1

Figure 1: The SimToolReal framework trains a goal-conditioned RL policy in simulation using procedurally-generated objects and then deploys it zero-shot on real tools leveraging human-demonstrated tool trajectories.

Object-Centric Policy Design and Perception

A core contribution of SimToolReal is an object-centric observation space, consisting minimally of:

  1. The current 6D tool pose
  2. A 3D grasp-region bounding box
  3. A current goal pose (from a demonstration)

The policy uses an LSTM core to allow exploitation of temporal context and latent inference over unmodeled physical properties. At deployment, tool representations are extracted with a state-of-the-art vision pipeline combining SAM 3D (for mesh and segmentation) and FoundationPose (for 6D tracking), enabling robust, scalable sim-to-real transfer across novel objects while bypassing visual domain gap challenges. Figure 2

Figure 2: Real-world deployment pipeline showing human video processing (object mesh, grasp region segmentations, and pose trajectories extraction) and closed-loop policy execution for dexterous manipulation tasks.

Training Protocol and Procedural Object Generation

The RL training environment utilizes a highly parallel simulation (IsaacGym on GPU), randomizing:

  • Robot and object initialization
  • Physical parameters (geometry, density, mass)
  • Action, observation, and pose delays
  • Observation and actuation noise

Policy optimization is realized using SAPG—a distributed variant of PPO that mitigates exploration bottlenecks via a diverse policy population and importance-weighted policy aggregation. The critic is asymmetric, accessing privileged state to stabilize value estimation and accelerate policy convergence. Reward structure consists of smoothness, grasping, and a keypoint-based goal term, promoting both dexterous reorientation and trajectory following.

Procedural tool primitives are synthesized by random sampling handle/head geometries (cylinders, cuboids) and densities. This efficiently spans the diversity and inertial variations typical of real-world tools, obviating per-object asset modeling.

DexToolBench: Benchmark for Generalization

To validate generalization, the DexToolBench benchmark is introduced. It comprises 24 daily tool-use tasks (e.g., hammer swings, whiteboard writing, erasing, table brushing, spatula flipping, screwdriver spinning), spanning 12 object instances (6 categories). Each task is characterized by an RGB-D human demonstration. Figure 3

Figure 3: SimToolReal demonstrates strong generalization to previously unseen DexToolBench tasks and tool instances in real-world deployments.

Experimental Results

Zero-Shot Real-World Performance

Policy generalization is evaluated on previously unseen objects and task trajectories. Across 120 rollouts, the policy demonstrates high zero-shot task progress (evaluated as proportion of trajectory-goal poses reached within 2cm), including diverse tasks involving in-hand orientation, forceful contact, and arm-hand synergy. Performance variance aligns with tool geometry and mass—tasks requiring minimal in-hand rotation (e.g., eraser translation) yield near-perfect scores, while thin/heavy tools (e.g., flat spatula/hammer mallet) pose greater challenge due to grasp reliability and pose estimation limitations.

Baseline Comparisons

SimToolReal is extensively compared to:

  • Kinematic Retargeting: Retargets human hand-object pose kinematics to the robot via IK, but fails to establish robust contacts and is incapable of grasping and interacting dynamically.
  • Fixed Grasp: Grasp is established by the RL policy, then held static while object trajectories are executed via arm motion and trajectory optimization. This baseline fails to execute tasks involving significant object rotation due to kinematic constraints and environmental collision. Figure 4

    Figure 4: SimToolReal surpasses both Kinematic Retargeting and Fixed Grasp across real-world tasks requiring dexterous in-hand tool rotation and environmental contact reasoning.

Quantitatively, SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% in task progress.

Specialist Comparison

Specialist RL policies, trained for a single object and trajectory, are competitive only when evaluated strictly on their exact training conditions. Performance degrades sharply if either object or trajectory changes. SimToolReal achieves specialist-level performance on both training and novel objects/tasks without any task-specific adaptation. Figure 5

Figure 5: In simulation, SimToolReal maintains robust performance across object and trajectory variations, while specialists sharply overfit and degrade outside training conditions.

Training-Objective Generalization Correlation

Downstream task performance on DexToolBench is strongly correlated with improvements in training reward on procedural object pose-reaching. This indicates that the policy’s meta-objective is well-aligned with generalized dexterous tool-use skills, validating the object-centric, pose-based abstraction.

Ablation Studies

Two ablations demonstrate that (1) SAPG is crucial—standard PPO fails at scale due to exploration saturation, and (2) privileged critic information is essential for overcoming partial observability during training.

Limitations

The current approach operates on rigid objects and a fixed goal-pose sequence, without dynamic replanning. Conditioning policies solely on object-pose (environment-agnostic) can lead to collisions in cluttered or unmodeled scenes. The framework does not guarantee functional task completion for high-force tasks, though it achieves robust pose tracking and in-hand manipulation.

Implications and Future Directions

SimToolReal demonstrates that a single general-purpose object-centric policy can achieve broad generalization across tools, tasks, and environments with zero-shot sim-to-real transfer. This unifies several lines of dexterous manipulation research under a versatile, data-efficient framework, mitigating the need for costly object modeling, reward engineering, or teleoperated demonstration.

Practical implications include rapid deployment of dexterous manipulation skills for newly encountered tools and tasks, and extensibility to perceptually-guided, language-conditioned, or function-based manipulation as vision-language and trajectory extraction models improve. Theoretically, the work validates object-centric position-based conditioning as a robust abstraction for high-DoF manipulation.

Future research should explore closed-loop, dynamically replanned interaction, integration with tactile and force feedback, functional generalization to non-rigid or articulated tools, and multi-object manipulation in semantically rich scenes.

Conclusion

SimToolReal provides compelling evidence for the efficacy of unified, object-centric RL policies in endowing robots with generalizable dexterous tool-manipulation capabilities. By abstracting manipulation as object pose tracking through sequences derived from demonstration, the framework achieves robust zero-shot sim-to-real transfer—markedly outperforming both prior kinematic and specialist methods. The paradigm’s scalability, generalization, and practical deployment potential make it a promising foundation for advanced dexterous manipulation research in robotics.


Reference:

"SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation" (2602.16863)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 229 likes about this paper.