RealDexUMI: A Wearable Universal Manipulation Interface for Dexterous Robot Learning

Published 4 Jun 2026 in cs.RO | (2606.06033v2)

Abstract: Learning dexterous manipulation requires demonstrations that preserve fine hand-object interactions while remaining executable at deployment. Existing pipelines either lose deployable dexterity through retargeting or embodiment conversion, or rely on robot-specific teleoperation that is costly to scale and often lacks intuitive, contact-aware control for dexterous data collection. We present RealDexUMI, a wearable universal manipulation interface built around a shared dexterous end-effector module that integrates a lightweight dexterous hand, in-hand vision, and fingertip tactile sensing. A palm-side isomorphic teleoperation glove maps human finger inputs to robot-hand joint commands, enabling real-time, retargeting-free, intuitive, and precise hand control. The shared hand and sensing modules yield zero-gap end-effector data, with matched in-hand observations, tactile signals, contacts, and hand actions between collection and deployment. Across eight real-robot tasks spanning fine-grained, contact-rich, long-horizon, and bimanual manipulation, policies trained on RealDexUMI data achieve an average success rate of 88.75%, generalize to unseen initial poses, and transfer across three embodiments. Website: https://research.beingbeyond.com/realdexumi

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces RealDexUMI, a wearable interface that maps human finger motions directly to a robotic hand, eliminating retargeting errors.
It synchronizes multi-modal data—tactile, vision, hand state, and glove commands—to capture deployable action–state correspondence during complex tasks.
Experimental results demonstrate high success rates and robust cross-embodiment policy transfer, outperforming state-only supervision methods.

RealDexUMI: A Wearable Universal Manipulation Interface for Dexterous Robot Learning

Motivation and Problem Statement

Dexterous manipulation with robots poses significant challenges due to the gap between human-performed demonstrations and executable robot actions. Most demonstration interfaces either require extensive retargeting—introducing errors in critical hand/object contact and observation alignment—or rely on robot-specific teleoperation hardware, impeding scalability and cross-embodiment policy transfer. Crucially, data collection pipelines often fail to guarantee that the captured dexterity remains deployable in an end-to-end sense, i.e., with strict preservation of hand actions, contacts, tactile signals, and end-effector observations between collection and deployment.

System Architecture and Methodology

RealDexUMI addresses the deployable dexterity gap by design. The central insight is to leverage a shared dexterous end-effector module as both the wearable demonstration interface and the robot’s deployed hand. This module comprises a lightweight, servo-actuated multi-DoF hand, integrated fingertip tactile sensing, and an in-hand camera. A palm-side isomorphic teleoperation glove enables real-time, retargeting-free mapping of operator finger motions directly into the hand’s command space, eschewing kinematic retargeting entirely.

Figure 2: Single-joint magnetic encoder in the glove: (a) interface circuit using AS5600L communicating over I2C, (b) diametric magnet physically aligned with the joint rotation axis for absolute angular measurement.

This hardware configuration ensures that all data streams—RGB vision, tactile arrays, hand state, and hand command—are generated by the same physical hand that will execute policies at deployment. Absolute magnetic encoders within the glove, as shown in (Figure 1), capture joint commands with high fidelity, supporting drift-free, high-frequency, multi-axis measurement.

Action-State Correspondence and Data Collection

A core challenge in dexterous robot learning lies in consistent action–state correspondence, especially under contact constraints. RealDexUMI uniquely records not just the measured hand state (as realized after interacting with objects), but also the actual glove command issued at each timestep, capturing the operator’s intended corrective actions—even when underactuated due to contact. This enables policies to learn robust, contact-aware corrections, rather than relying on state-only supervision which cannot disambiguate between intended and achieved motion under environmental interaction.

Collected episodes contain time-aligned streams: in-hand RGB, 5×10×4 tactile arrays, 6-D hand joint state, glove command vector, and 6-DoF tracker pose. These are synchronized using a latest-sample protocol anchored on RGB timestamps. Over 100 hours of demonstrations across eight complex tasks—including long-horizon, contact-rich, and bimanual scenarios—were gathered.

Learning and Policy Architecture

Policy learning is performed in the end-effector reference frame using a chunked prediction interface. For each observation, the policy predicts a temporally coherent sequence comprising local (hand-frame) relative translation, rotation, and the subsequent executable glove command. This representation is invariant to the robot body and does not require global pose or workspace alignment, simplifying cross-embodiment and cross-setup generalization.

The network backbone is predominantly ACT with ResNet-18 vision encoder, and policies are trained from 200 demonstrations per task. A parallel evaluation using Diffusion Policy confirms the generality of the RealDexUMI interface with respect to policy backend.

Experimental Results

Deployable Dexterity and Performance

Empirical evaluation on a real-world Franka FR3 equipped with the RealDexUMI hand demonstrates an overall average full-task success rate of 88.75% on eight diverse manipulation benchmarks. These include cube pick-and-place, plug insertion, precision tool use, long-horizon drawer opening/closing, and bimanual coordination.

Initial-pose robustness experiments show zero failures under significant robot pose variation, substantiating the claim that the hand-frame action parameterization yields intrinsic robustness to deployment configuration.

Ablation Analysis

Removing tactile feedback reduces average performance to 70.00%, with largest impact on tasks where contact sensing is nontrivial for vision alone. Substituting state-only action supervision further degrades success to 51.25%, highlighting the necessity of paired (command, state) action annotations for contact-aware skill acquisition.

Cross-Embodiment Policy Transfer

A key result is cross-embodiment deployment: the same dexterous policy checkpoint is directly deployable on three distinct robot arms (Franka FR3, RealMan RM65, and PND Adam-U), achieving consistently high success without retraining. Only the base robot’s IK and low-level controller are swapped, leveraging the decoupled, hand-centric action interface.

Operator Usability and Teleoperation Complexity

Control efficiency during demonstration collection was benchmarked against AVP-based arm–hand teleoperation and motion-capture glove retargeting methods. RealDexUMI exhibited the highest success and lowest completion times, especially in tasks sensitive to contact precision (e.g., tweezer-based tea picking). This advantage stems from direct, command-space teleoperation—rather than high-DOF but non-isomorphic human-to-robot mappings prone to loss of tactile and action alignment.

Figure 4: Survey instrument used in perceived teleoperation complexity evaluation; interfaces are rated by ease of setup, wearing, and operation.

Perceived teleoperation complexity, as formally surveyed in (Figure 3), was rated as 'Low' for RealDexUMI, reinforcing its usability and scalability for broad data collection.

Practical and Theoretical Implications

The RealDexUMI system establishes a precedent for zero-gap dexterous data collection, facilitating deployable dexterous skill learning without the retargeting or post-processing barriers typical in existing approaches. The system robustly addresses the practical necessity of matching data collection and deployment interfaces at the physical and informational level—tactile, observable, and actionable—thus maximizing data utility for imitation learning frameworks.

From a theoretical standpoint, the action–state-coupled data paradigm, realized via the isomorphic glove, could generalize to other domains where tight environment–controller feedback under constraints is essential, potentially impacting haptics, shared autonomy, and nonprehensile interaction policies.

Limitations and Future Directions

The current system prioritizes end-effector alignment; as such, global or egocentric sensing for high-level planning, search, or progress estimation is constrained. Extending the interface to higher-DoF hands, or integrating additional global sensing in a manner consistent with collection–deployment alignment, is an open challenge. Furthermore, while tactile and vision streams are tightly coupled, certain long-horizon, cognitive task elements remain underexplored in this actuation-rich, observation-local regime.

Conclusion

RealDexUMI demonstrates a scalable, practical, and technically rigorous approach for deployable dexterous data collection and policy transfer. By unifying the collection and deployment end-effector modules—and instrumenting the command interface—the system enables robust, intuitive demonstration, preserves critical hand–object contact modalities, and supports cross-embodiment generalization in robot learning pipelines. The methodology is expected to inform future work on universal, robot-agnostic dexterous manipulation interfaces, and may catalyze advances in scalable robot learning from human demonstrations.

Markdown Report Issue