Custom Hoi! Gripper: Handheld UMI Innovation
- Custom Hoi! Gripper is a handheld UMI gripper that employs both granular jamming and parallel-jaw designs to adapt to diverse object shapes.
- It integrates multimodal sensors, including visual, inertial, and tactile inputs, for precise real-time pose estimation and force feedback.
- Its open-source, modular design accelerates data collection and cross-embodiment learning, driving advances in human-in-loop robotic manipulation.
A handheld UMI gripper, or Universal Manipulation Interface gripper, is a portable, operator-actuated device for universal object acquisition that bridges human-in-the-loop demonstration and robotic execution. It achieves object grasping and manipulation for a broad array of shapes and surfaces either through mechanisms inspired by parallel-jaw robotics or by exploiting physical principles such as granular jamming. This term also encompasses a family of research tools used for data collection in cross-embodiment robot learning and force-grounded manipulation research.
1. Core Mechanical and Actuation Principles
Handheld UMI grippers are primarily realized in two variants: (1) jammed granular-material-based universal grippers, and (2) parallel-jaw, single-DOF antipodal devices. In both, the essential hardware goal is to provide a low-cost, lightweight, easy-to-operate end-effector that can be used interchangeably by a human or a robot.
1.1 Jammed Granular Grippers
A flexible membrane filled with loose grains (50–200 μm; e.g., glass beads, coffee, or polymer microspheres) is pressed onto an object. The granular medium exhibits a rapid transition from a fluid-like (unjammed) state to a solid-like (jammed) state when a mild vacuum (ΔP ≈ 30–80 kPa) is applied across the membrane, causing a <0.5% contraction in volume. This transition enables the gripper to conform to complex geometries, then rigidify and immobilize the object for lifting or manipulation. Gripping may result from frictional contact, airtight suction, or geometric interlocking, with relative contributions dependent on object shape and surface compliance. The device may be actuated by a compact electric or manual vacuum pump, with a simple valve for unjamming, and membrane thickness is tuned (0.2–0.5 mm) for optimal flexibility and abrasion resistance (Brown et al., 2010).
1.2 Parallel-Jaw and Three-Fingered Variants
The canonical mechanical design is a two-finger parallel jaw:
- Two rigid links (Lf ≈ 60 mm) mounted on a drive carriage (lead-screw or belt drive), offering a single prismatic DoF controlling jaw width q (0 ≤ q ≤ 30 mm).
- Symmetrical actuation ensures antipodal, parallel grasping.
- Some variants employ three-jaw configurations with two compliant (e.g., TPU) jaws and one rigid jaw for reduced actuation force, with a single linear actuator for aperture control (Rayyan et al., 23 Sep 2025, Engelbracht et al., 4 Dec 2025).
- Modifications include embedding tactile sensors (e.g., GelSight Mini), marker mounts, or extended flanges to facilitate pose tracking or tactile data collection (Helmut et al., 15 Oct 2025, San-Miguel-Tello et al., 11 Jun 2025).
Actuation is typically manual during demonstration (hand-squeezed or thumb-lever), or via miniaturized electric or belt/pulley drives when robot-mounted, with identical end-effector kinematics for cross-embodiment learning (Rayyan et al., 23 Sep 2025, San-Miguel-Tello et al., 11 Jun 2025).
2. Multimodal Sensing and Pose Estimation
Handheld UMI grippers are typically paired with a suite of sensors to record the context, pose, and dynamic interactions during demonstration:
- Visual sensing: Rigidly mounted first-person (FPV) cameras (e.g., GoPro Hero 9/10) capture RGB video (≈30 Hz, ≈120° FOV) directly from the gripper or operator’s wrist (Rayyan et al., 23 Sep 2025, Engelbracht et al., 4 Dec 2025).
- IMUs: Integrated or externally affixed IMUs (e.g., Bosch BNO055) provide 9-axis orientation/acceleration at ≈200 Hz, fused to estimate real-time 6-DoF gripper pose.
- Marker systems: ArUco or AprilTag markers strategically positioned on the gripper (fingertips, lateral flanges) enable external localization from static scene cameras; typical pose accuracy is ±5 mm, orientation ±0.5° (San-Miguel-Tello et al., 11 Jun 2025).
- Tactile/force sensors: Some variants integrate high-resolution tactile sensors (e.g., GelSight Mini), supplying per-pixel normal/shear stress, summing to a total inferred grip force for closed-loop control (Helmut et al., 15 Oct 2025).
- Multi-view: Dual-camera setups (FPV plus third-person RealSense or smartphone cameras) expand the observable workspace, overcome occlusions, and support multi-embodiment learning (Rayyan et al., 23 Sep 2025, Engelbracht et al., 4 Dec 2025).
Pose fusion is accomplished with extended Kalman filter (EKF) pipelines combining IMU and external visual marker data, yielding sub-centimeter tracking accuracy, low drift, and robust operation in challenging field conditions (San-Miguel-Tello et al., 11 Jun 2025).
3. Grasping Mechanisms, Kinematic Models, and Mechanical Models
3.1 Grasping Force Models
For granular jamming grippers, the grasp is decomposed into three mechanisms (Brown et al., 2010):
- Frictional grip: For objects with surface slope θ > θ_c = arctan(1/μ), normal pinching stress σ_y, itself proportional to vacuum ΔP (σ_y ≈ αΔP with α∼0.5–1.0), creates frictional resistance F_f = μσ_yA_c.
- Suction grip: If an airtight patch A_p is formed, vacuum generates suction force F_s = ΔP_gA_p, with ΔP_g ≤ σ_y.
- Geometric interlocking: For wrap-around >90°, resistance to removal is governed by either shell bending (small overwrap) or shell yield (large overwrap): F_i,max ∼ σ_fRt.
3.2 Rigid-Jaw Kinematics
A standard parallel-jaw UMI gripper with a single prismatic DoF is described by (Engelbracht et al., 4 Dec 2025):
where q is the half-width opening, L_f is finger length. This direct translation enables simple, easily retargetable demonstrations.
4. Interface Design, Control, and Ergonomics
Handheld UMI grippers use ergonomic enclosures (e.g., pistol-grip mount) weighing <300 g (Rayyan et al., 23 Sep 2025), with either thumb levers, triggers, or servo-driven mechanisms for jaw actuation (San-Miguel-Tello et al., 11 Jun 2025). Control is typically open-loop during demonstration, relying on natural proprioception or finger resistance—tactile or force feedback (if present) is recorded for later use in tactile-conditioned policy learning (Helmut et al., 15 Oct 2025).
Independent motor/servo drives ensure reliable opening and closure; early designs use MG90S analog micro-servos or DYNAMIXEL XL430-W250-T motors for precise control and compatibility with robotic mounting (San-Miguel-Tello et al., 11 Jun 2025, Helmut et al., 15 Oct 2025). CAD files and control electronics are conventionally released open-source (Helmut et al., 15 Oct 2025).
5. Data Collection, Learning Frameworks, and Cross-Embodiment Usage
Handheld UMI grippers are essential tools for robot learning from demonstration (LfD), supporting various data modalities (RGB, IMU, pose, tactile):
- Multi-view policies: In frameworks such as MV-UMI, synchronized multi-view perception (FPV and third-person) is used to close the context gap between human demonstration and robot deployment. SAM-2 segmentation and inpainting remove embodiment cues from third-person frames. Diffusion policy architectures fuse joint visual features and output short-horizon action sequences (pose, jaw width) for direct robot playback. This provides +47% precision in context-heavy manipulation tasks compared to single-view baselines (Rayyan et al., 23 Sep 2025).
- Tactile-conditioned policies: Integrated tactile sensors (e.g., GelSight Mini) provide high-frequency, force-resolved data, enabling force-aware action policies (grip width, grip force) via networks like FARM, which jointly predict and regulate tactile-grounded actions (Helmut et al., 15 Oct 2025).
- EKF fusion for tracking: In disciplinary domains such as agricultural LfD, multimodal data is fused via EKF to reliably recover gripper pose and orientation, dramatically reducing sample idle times and operator cognitive load (San-Miguel-Tello et al., 11 Jun 2025).
- Cross-embodiment alignment: In datasets like Hoi!, the UMI gripper is used alongside diverse sensor platforms (egocentric, exocentric, wrist-mounted cameras), with temporal alignment via QR codes and spatial registration to LiDAR scene scans. This supports precise benchmarking of articulation understanding, vision-force prediction, and policy retargeting across embodiments (Engelbracht et al., 4 Dec 2025).
6. Performance Metrics, Application Domains, and Limitations
6.1 Quantitative Metrics
- Gripping force (granular UMI): 10–50 N/cm² hold force per 10 kPa vacuum, payload/gripper weight ratio up to 50× (Brown et al., 2010).
- Tracking accuracy: EKF-fused pipelines achieve 15 mm position RMSE, 2.8° orientation RMSE, 80% idle time reduction vs. visual SLAM alone (San-Miguel-Tello et al., 11 Jun 2025).
- Task success rate: MV-UMI achieves ~87% average success in context-heavy tasks, comparative single-view UMI ~40% (Rayyan et al., 23 Sep 2025).
6.2 Use Cases
- Rapid demonstration collection for robot learning in object manipulation, articulated scene interaction (drawers, doors), fruit harvesting, and dexterous pick-and-place (Brown et al., 2010, San-Miguel-Tello et al., 11 Jun 2025, Engelbracht et al., 4 Dec 2025).
- Empirically robust to ±10 mm misalignment without sensors (granular UMI), supports “any shape” manipulation, typical response 0.2–1 s pump-down, <0.1 s release (Brown et al., 2010).
6.3 Limitations
- No inherent sensing in standard UMI—cannot record force/torque or high-res tactile data without retrofit (Engelbracht et al., 4 Dec 2025).
- Granular grippers limited on soft/porous objects (no seal) and extreme geometric interlocks (require multi-bag design) (Brown et al., 2010).
- In-field pose estimation subject to marker occlusion, ambient lighting changes; tactile options increase complexity and mass (San-Miguel-Tello et al., 11 Jun 2025, Helmut et al., 15 Oct 2025).
7. Open Research Directions and Advanced Variants
The modularity of the handheld UMI gripper admits diverse extensions:
- Soft, multi-modal grippers (DexGrip-based): Incorporation of actively actuated suction-cups and belt-driven rolling contact surfaces yields in-hand rotation and re-positioning capacity (e.g., 360° in all axes, continuous re-orientation without regrasp), at the cost of added mass and reduced payload in handheld form (≤ 500 g, payload ~80–100 g) (Wang et al., 26 Nov 2024).
- Force-aware policies: Learning frameworks exploiting tactile-conditioned actions marked a departure from purely kinematic reproduction, enabling robust manipulation of fragile or deformable objects under force control (Helmut et al., 15 Oct 2025).
- Cross-embodiment datasets: Synchronized recordings across operator, gripper, and scene (Hoi!) foster robust evaluation of transferable policies and force-visual inference techniques (Engelbracht et al., 4 Dec 2025).
A plausible implication is that future handheld UMI design will increasingly integrate compact tactile and force-sensing, high-fidelity multi-modal tracking, and actuation schemes enabling true in-hand dexterity at handheld scale. Open-source hardware and software releases have accelerated this trajectory, standardizing the UMI gripper as a baseline research tool for robot learning from human demonstration.
References:
(Brown et al., 2010, San-Miguel-Tello et al., 11 Jun 2025, Rayyan et al., 23 Sep 2025, Helmut et al., 15 Oct 2025, Wang et al., 26 Nov 2024, Engelbracht et al., 4 Dec 2025)