Glove2Robot Policy Pipeline
- Glove2Robot pipelines are end-to-end methodologies that transfer precise human manipulation skills to robots by aligning human demonstrations with robot embodiments.
- They integrate diverse sensor modalities—vision, IMU, and tactile gloves—to achieve robust temporal and spatial alignment of demonstration data.
- Empirical evaluations demonstrate that these pipelines deliver near teleoperation-level performance while reducing hardware cost and bridging the embodiment gap.
The Glove2Robot policy pipeline refers to a family of end-to-end methodologies for transferring fine-grained human manipulation skills to autonomous robot systems via human-hand demonstration, with or without instrumented tactile gloves. These pipelines are characterized by: (i) direct sensing of human hand motion (via vision, IMU, or tactile glove), (ii) systematic pre-processing and alignment to the robot embodiment, (iii) generative or kinematic translation of demonstration data to robot-observable space, (iv) SE(3) and/or joint-state action extraction, and (v) final integration into policy learning frameworks such as conditional diffusion models or behavior cloning. Across recent instantiations, Glove2Robot approaches enable scalable collection of high-fidelity demonstrations suitable for closed-loop policy deployment that matches or approaches performance from direct teleoperation, at reduced hardware cost and complexity (Heng et al., 5 Jul 2025, Yin et al., 9 Dec 2025).
1. Human Demonstration Capture Modalities
Glove2Robot pipelines support a range of demonstration capture modalities. The approach in "RwoR" (Heng et al., 5 Jul 2025) utilizes a GoPro Hero 9 camera with a Max Lens Mod 1.0 fisheye module, wrist-mounted via a custom 3D-printed adapter, optimized for a 160° field-of-view that reliably frames both hand and object. The wrist adapter allows tuning of both wrist-axis rotation and translation (12–15 cm from the fingertips), visually mimicking the robot’s onboard wrist camera. Pose information is inferred indirectly from camera and IMU, dispensing with the need for an instrumented glove; each demonstration frame is 4K at 30 Hz, with a synchronized 200 Hz IMU stream.
In contrast, the OSMO pipeline (Yin et al., 9 Dec 2025) employs a custom open-source tactile glove featuring 12 3-axis magnetic-tactile taxels (five on the fingertips and three on the palm), each comprising paired Bosch BMM350 magnetometers and an integrated 6-axis BHI360 IMU, supporting a force range of 0.3 N to 80 N. Sensor boards communicate via I²C to an STM32 microcontroller, sampled at 100 Hz and logged, along with Intel RealSense RGB + stereo IR frames, via ROS 2 at 25 Hz. The tactile glove facilitates direct measurement of continuous shear and normal force, minimizing the embodiment gap and eliminating reliance on force inference through image processing.
2. Data Pre-processing and Alignment
Both vision- and glove-based pipelines necessitate rigorous pre-processing to resolve the temporal and spatial misalignments between human demonstrations and the robot's sensory-modalities.
For vision-based approaches (Heng et al., 5 Jul 2025), raw human video and paired robot gripper demonstration are recorded separately (). Temporal alignment leverages Temporal Cycle-Consistency (TCC) to learn a shared embedding function , aligning each human frame to its nearest robot frame in this latent space:
Remaining spatial/viewpoint discrepancies are addressed by compositing: foreground masks for the gripper and object (from SAM2) are used to inpaint human background into aligned robot frames, yielding synthesized ground-truth robot images:
The tactile-glove pipeline (Yin et al., 9 Dec 2025) processes raw magnetic field vectors by channel-wise differential subtraction (to mitigate cross-talk) and robust percentile-based normalization (Eq. 3). Hand pose alignment is solved by first segmenting hands via SAM2, then extracting mesh and wrist pose with HaMeR, and further refining wrist height via back-projected IR stereo; trajectories are smoothed with a Savitzky-Golay filter.
3. Human-to-Robot Demonstration Translation
Vision-based Glove2Robot pipelines employ generative models to transform human hand frames to robot-equivalent observations. The RwoR approach (Heng et al., 5 Jul 2025) fine-tunes InstructPix2Pix, a conditional diffusion model (based on Stable Diffusion), to map each human observation and text prompt to a robot image . Training uses $15,000$ aligned pairs over $3$ epochs, with batch size $4$, and a linear noise schedule ($1,000$ steps, ). The generative loss is:
At inference, each is mapped to a predicted robot frame .
Tactile-glove pipelines (Yin et al., 9 Dec 2025) bypass this vision-to-robot transfer. By equipping the human demonstrator and the robot with identical sensorized gloves, all tactile and kinematic signals are natively matched across domains, negating the need for generative visual translation or force inference.
4. Action Extraction and Retargeting
A central component of Glove2Robot is extracting SE(3) or joint-state actions corresponding to demonstrator intent.
In wrist-camera pipelines (Heng et al., 5 Jul 2025), ORB-SLAM3 recovers camera poses from the human video, refined via GoPro IMU. Fixed extrinsics from camera to “fingertip” yield:
Actions are then incremental SE(3) transforms:
Gripper open/closed states are binary, detected via foreground mask transitions.
For glove-based pipelines (Yin et al., 9 Dec 2025), Cartesian positions from hand tracking and tactile signals are retargeted through the Mink IK solver (MuJoCo backend) to generate 7-DoF arm + 6-DoF hand joint sequences. Unsafe kinematic states are handled by frame skipping or pose repetition. Both proprioceptive and tactile information are synchronized across the trajectory.
5. Policy Learning and Integration
Resulting demonstration sequences are structured for integration into policy learning frameworks. The vision-based pipeline (Heng et al., 5 Jul 2025) constructs tuples:
A conditional diffusion policy is then trained per [Chi et al., 2023], with 1,000 noise schedule steps, batch size 32, Adam optimizer (), across 100 epochs. Policy inference is conducted in closed-loop with online wrist-camera input, generating SE(3) actions and gripper commands for the robot.
The tactile-glove pipeline (Yin et al., 9 Dec 2025) adopts a conditional diffusion-policy learning framework: image features (DINOv2), proprioceptive vectors, and normalized tactile features are embedded and concatenated, then conditioned into a FiLM-adapted U-Net denoiser across 100 DDPM steps. The policy predicts future joint states over a 16-step horizon (training), and executes 4 steps per chunk at test time. Tactile and visual signals are incorporated directly into input features, with no separate behavioral cloning or force loss.
6. Empirical Evaluation and Performance
Performance evaluations highlight the capacity of Glove2Robot pipelines to approach or match direct teleoperation performance on real-world manipulation tasks.
- In RwoR (Heng et al., 5 Jul 2025), nine tasks on a Franka Emika R3 with UMI gripper yield a mean success rate of 0.78, near the upper bound of 0.82 (direct-UMI demonstrations), substantially exceeding a rule-based baseline (0.37). Model ablations demonstrate incremental value from composited inpainting and the full generative pipeline, with PSNR/SSIM progressing from (31.5/0.77) (raw), to (33.4/0.84) (inpaint-only), to (33.8/0.86) (full). Generalization to unseen actions and object instances yields success rates of 0.80–0.87.
- The OSMO pipeline (Yin et al., 9 Dec 2025) reports a 72% ± 27.4% success rate on a contact-rich whiteboard-wiping task, outperforming vision-only (55.8% ± 30.0%) and proprio-only (27.1% ± 32.4%) policies. Qualitative analyses indicate vision-based policies suffer contact failure modes (over/under-pressing, slip), while the tactile-inclusive policy maintains robust contact and pressure.
These results collectively indicate that Glove2Robot pipelines, leveraging either wrist-mounted vision with generative translation or high-fidelity tactile gloves with explicit force conditioning, enable robot policy imitation with minimal hardware overhead and improved embodiment-robustness.
7. Embodiment Gap, Limitations, and Extensions
A fundamental challenge addressed by Glove2Robot pipelines is the visual and sensory “embodiment gap” between human demonstrations and robot execution. The vision-only pipelines require sophisticated timestamp alignment, image compositing, and hand-to-gripper generative translation to reconcile differences, whereas the tactile-glove approach in OSMO (Yin et al., 9 Dec 2025) sidesteps these issues by standardizing hardware/sensing across both domains.
Implications include:
- Settings devoid of tactile instrumentation require careful generative and pre-processing components to align human and robot domains; in contrast, demonstration collection with matched tactile gloves simplifies policy transfer and improves contact-relevant task performance.
- The selection of policy learning paradigms (diffusion policies, conditional imitation learning) is directly informed by the structure and alignment of the demonstration data.
- Tasks that critically depend on force feedback or contact are particularly sensitive to embodiment-gap minimization; OSMO data demonstrates performance improvements in such domains.
A plausible implication is that further generalization and scalability of the Glove2Robot paradigm may depend on improvements in hand tracking accuracy, tactile-glove cost reduction, and policy architectures robust to diverse demonstration domains.
References:
- "RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot" (Heng et al., 5 Jul 2025).
- "OSMO: Open-Source Tactile Glove for Human-to-Robot Skill Transfer" (Yin et al., 9 Dec 2025).