Hoi! Gripper: Parallel-Jaw Robotic End-Effector
- Hoi! Gripper is a research-grade parallel-jaw robotic end-effector designed for precise, articulated object manipulation in multimodal datasets.
- It integrates high-fidelity force-torque, tactile imaging, visual, and inertial sensors to capture synchronized, spatially-aligned data for advanced control tasks.
- Its robust mechanical design, calibrated sensing suite, and kinematic modeling enable cross-embodiment analysis and facilitate research in articulated manipulation.
Hoi! Gripper
The Hoi! Gripper is a research-grade, custom-built parallel-jaw robotic end-effector designed for articulated manipulation in cross-embodiment, multimodal interaction datasets. Developed for the "Hoi!" dataset, its design is characterized by a rigid, two-finger parallel mechanism, high-fidelity force-torque sensing, high-resolution tactile imaging, and full integration with extrinsic and egocentric vision and inertial measurement systems. The device is engineered to bridge human demonstration and robot-grade sensing, enabling tightly synchronized, spatially-aligned visual, force, and tactile data capture for real-world object interaction (Engelbracht et al., 4 Dec 2025).
1. Mechanical Design and Actuation
The Hoi! Gripper employs a classic antipodal parallel-jaw architecture with the following construction:
- Fingers and Joints: Two opposing rigid fingers, each affixed to a single translation stage, enable parallel closure/opening with no additional joints or underactuated couplings. Motion is strictly unidirectional along the finger gap axis.
- Actuation: Actuated by a Dynamixel XM430-W350-T servo, which incorporates an internal planetary gearbox for high-output torque at low velocities. The actuation system is physically mounted on a rigid handle, facilitating use in human demonstration scenarios, while a handle-integrated load cell transduces operator-applied pull force into a reference gripper current.
- Materials and Construction: Finger bodies and the main chassis are CNC-milled from aluminum for rigidity and minimal mass. GelSight Digit tactile sensors are housed in 3D-printed mounts at each fingertip. The complete package (gripper, force-torque sensor, vision, compute) is engineered for portability and can be deployed fully untethered, with the electronics and power housed in a wearable backpack (Engelbracht et al., 4 Dec 2025).
2. Multimodal Sensing Suite
The gripper delivers synchronized, co-located, and highly calibrated signals necessary for advanced manipulation research:
- 6-DoF Force–Torque Sensing: A Bota SensONE sensor in the wrist measures Cartesian forces and torques in the sensor frame S. The sensor features a 100 N force, 10 N·m torque range, and 0.01 N/0.001 N·m resolution, sampled at 1 kHz.
- High-Resolution Fingertip Tactile Imaging: Dual GelSight Digit sensors at the fingertips acquire px grayscale tactile images at 30 Hz, covering approximately mm with $0.07$ mm/pixel spatial detail. Output includes both calibrated 3D point clouds and raw images under controlled illumination.
- Operator Load Measurement: The handle-embedded load cell measures applied pull force (), yielding gripper actuation force after offset and scaling () with .
- Motor Current/Torque Telemetry: The Dynamixel servo reports current . Estimated motor-side torque is modeled as , with $k_1 = 1.769\,\text{N$\cdot$m}/\text{A}$ and $k_2 = -0.2214\,\text{N$\cdot$m}$.
- Visual and Inertial Sensing: Wrist-mounted ZED Mini stereo camera provides RGB-D, while Project Aria AR glasses supply egocentric RGB (30–60 Hz), real-time pose, and IMU at 200 Hz. These modalities are tightly synchronized for cross-view analysis.
3. Signal Processing, Synchronization, and Calibration
- Force–Torque Signal Correction: Raw F/T data are compensated for gravity and bias, using sessions of “no load” for calibration. Calculations incorporate the mass, the transformation between coordinate frames, and known sensor biases, yielding residuals of 0.1 N and 0.01 N·m in static conditions.
- Temporal and Spatial Alignment:
- Temporal: All camera streams and the robot’s proprioceptive signals are aligned via a 25 Hz QR code overlay encoding precise Unix time, achieving inter-stream accuracy of 10–25 ms (95% CI).
- Spatial: High-resolution 3D scans (Leica RTC360) provide the environmental world frame. Camera and gripper trajectories (from SLAM or motion capture) are registered to this point cloud using hierarchical 2D–3D matching, achieving millimeter-level accuracy.
- Frame Transformations for Contact Modeling: Contact forces for learning tasks are projected into local interaction frames at each fingertip (aligned with the tactile sensors) using rigid transformations computed from the known kinematics and sensor mounting geometry.
4. Kinematic and Dynamic Modeling
- Gripper Kinematics: The single-DoF translation stage is mapped through a Jacobian (a function of the finger gap angle ) relating input motor torque to output fingertip force:
where is an empirically fit efficiency factor. is read from the servo’s encoder, and from current feedback.
- Force Decomposition: Gripper-level external forces are rotated into the local interaction frame at the tactile sensor (via ) for analysis:
- Performance Characterization:
- Force–torque residuals: 0.1 N/no-load force, 0.01 N·m/no-load torque, post compensation.
- Trajectory pose accuracy: –0.006 m, –0.016 rad (aligned to motion capture).
- Tactile prediction RMSE: $3.9$ N (across six environments), visual-force RMSE: $2.2$–$2.6$ N, a magnitude higher than tabletop benchmarks, demonstrating real-world complexity.
5. Integration with Hoi! Dataset and Benchmarking
The Hoi! Gripper constitutes the primary tool embodiment for robot-grade data within the Hoi! dataset (3048 sequences, 381 object parts, 38 environments). Each manipulation episode includes:
- Exocentric and egocentric RGB-D video streams
- Wrist-view RGB-D
- End-effector 6-DoF force-torque
- Fingertip tactile images (2× GelSight)
- Motor current, finger force, and joint torque
All data are aligned in space and time to a shared world frame, permitting:
- Cross-view studies of physical contact and force from multiple visual perspectives
- Analysis of embodiment transfer, e.g., visual-to-force prediction from human or gripper demonstration
- Benchmarks for articulation estimation, tactile-force regression, vision-force affordance prediction, and unified vision-touch-force learning for articulated object manipulation
6. Research Impact and Applications
The Hoi! Gripper establishes a new standard in the collection of force-grounded manipulation data for articulated objects in real-world environments. By enabling synchronized visual, tactile, and force data from a parallel-jaw end-effector with robot-like mechanics, it provides:
- A consistent physical interface for comparing human, teleoperated, and autonomous manipulation
- A platform for benchmarking multimodal perception (vision, touch, force), visuo-tactile policy learning, and cross-embodiment transfer
- Datasets supporting the training and evaluation of advanced control and learning algorithms for generalizable manipulation, especially in domains—such as furniture assembly or tool use—where cross-modal and cross-view reasoning is critical
The gripper's data and patterns are instrumental in studies on tactile-force regression, vision-tactile fusion via neural architectures, and policies for robust manipulation of articulated household objects (Engelbracht et al., 4 Dec 2025).