Hoi! Gripper: Parallel-Jaw Robotic End-Effector

Updated 4 March 2026

Hoi! Gripper is a research-grade parallel-jaw robotic end-effector designed for precise, articulated object manipulation in multimodal datasets.
It integrates high-fidelity force-torque, tactile imaging, visual, and inertial sensors to capture synchronized, spatially-aligned data for advanced control tasks.
Its robust mechanical design, calibrated sensing suite, and kinematic modeling enable cross-embodiment analysis and facilitate research in articulated manipulation.

Hoi! Gripper

The Hoi! Gripper is a research-grade, custom-built parallel-jaw robotic end-effector designed for articulated manipulation in cross-embodiment, multimodal interaction datasets. Developed for the "Hoi!" dataset, its design is characterized by a rigid, two-finger parallel mechanism, high-fidelity force-torque sensing, high-resolution tactile imaging, and full integration with extrinsic and egocentric vision and inertial measurement systems. The device is engineered to bridge human demonstration and robot-grade sensing, enabling tightly synchronized, spatially-aligned visual, force, and tactile data capture for real-world object interaction (Engelbracht et al., 4 Dec 2025).

1. Mechanical Design and Actuation

The Hoi! Gripper employs a classic antipodal parallel-jaw architecture with the following construction:

Fingers and Joints: Two opposing rigid fingers, each affixed to a single translation stage, enable parallel closure/opening with no additional joints or underactuated couplings. Motion is strictly unidirectional along the finger gap axis.
Actuation: Actuated by a Dynamixel XM430-W350-T servo, which incorporates an internal planetary gearbox for high-output torque at low velocities. The actuation system is physically mounted on a rigid handle, facilitating use in human demonstration scenarios, while a handle-integrated load cell transduces operator-applied pull force into a reference gripper current.
Materials and Construction: Finger bodies and the main chassis are CNC-milled from aluminum for rigidity and minimal mass. GelSight Digit tactile sensors are housed in 3D-printed mounts at each fingertip. The complete package (gripper, force-torque sensor, vision, compute) is engineered for portability and can be deployed fully untethered, with the electronics and power housed in a wearable backpack (Engelbracht et al., 4 Dec 2025).

2. Multimodal Sensing Suite

The gripper delivers synchronized, co-located, and highly calibrated signals necessary for advanced manipulation research:

6-DoF Force–Torque Sensing: A Bota SensONE sensor in the wrist measures Cartesian forces $f_S = [f_x, f_y, f_z]^T$ and torques $\tau_S = [\tau_x, \tau_y, \tau_z]^T$ in the sensor frame S. The sensor features a $\pm$ 100 N force, $\pm$ 10 N·m torque range, and 0.01 N/0.001 N·m resolution, sampled at 1 kHz.
High-Resolution Fingertip Tactile Imaging: Dual GelSight Digit sensors at the fingertips acquire $160 \times 240$ px grayscale tactile images at 30 Hz, covering approximately $18 \times 12$ mm with $0.07$ mm/pixel spatial detail. Output includes both calibrated 3D point clouds and raw images under controlled illumination.
Operator Load Measurement: The handle-embedded load cell measures applied pull force ( $V_{out}$ ), yielding gripper actuation force after offset and scaling ( $F_{grip} = k_l (V_{out} - V_{bias})$ ) with $k_l \approx 20\,\text{N/V}$ .
Motor Current/Torque Telemetry: The Dynamixel servo reports current $I_{motor}$ . Estimated motor-side torque is modeled as $\tau_{motor}(I) = k_1 I + k_2$ , with $k_1 = 1.769\,\text{N$\cdot$m}/\text{A}$ and $k_2 = -0.2214\,\text{N$\cdot$m}$.
Visual and Inertial Sensing: Wrist-mounted ZED Mini stereo camera provides $30\,\text{Hz}$ RGB-D, while Project Aria AR glasses supply egocentric RGB (30–60 Hz), real-time pose, and IMU at 200 Hz. These modalities are tightly synchronized for cross-view analysis.

3. Signal Processing, Synchronization, and Calibration

Force–Torque Signal Correction: Raw F/T data are compensated for gravity and bias, using sessions of “no load” for calibration. Calculations incorporate the mass, the transformation between coordinate frames, and known sensor biases, yielding residuals of $\pm$ 0.1 N and $\pm$ 0.01 N·m in static conditions.
Temporal and Spatial Alignment:
- Temporal: All camera streams and the robot’s proprioceptive signals are aligned via a 25 Hz QR code overlay encoding precise Unix time, achieving inter-stream accuracy of 10–25 ms (95% CI).
- Spatial: High-resolution 3D scans (Leica RTC360) provide the environmental world frame. Camera and gripper trajectories (from SLAM or motion capture) are registered to this point cloud using hierarchical 2D–3D matching, achieving millimeter-level accuracy.
Frame Transformations for Contact Modeling: Contact forces for learning tasks are projected into local interaction frames at each fingertip (aligned with the tactile sensors) using rigid transformations computed from the known kinematics and sensor mounting geometry.

4. Kinematic and Dynamic Modeling

Gripper Kinematics: The single-DoF translation stage is mapped through a Jacobian $J(q)$ (a function of the finger gap angle $q$ ) relating input motor torque to output fingertip force:

$F_{grip}(q,I) = \eta(I) \cdot J(q) \cdot \tau_{motor}(I)$

where $\eta(I)$ is an empirically fit efficiency factor. $q$ is read from the servo’s encoder, and $I$ from current feedback.

Force Decomposition: Gripper-level external forces are rotated into the local interaction frame at the tactile sensor (via $R_{i\leftarrow S}$ ) for analysis:

$F_{tang,i} = \lVert[f_{xi}, f_{yi}]^T\rVert_2,\quad F_{norm,i} = |f_{zi}|,\quad F_{comb,i} = \sqrt{F_{tang,i}^2 + F_{norm,i}^2}$

Performance Characterization:
- Force–torque residuals: $\pm$ 0.1 N/no-load force, 0.01 N·m/no-load torque, post compensation.
- Trajectory pose accuracy: $RMSE_{position} \approx 0.005$ –0.006 m, $RMSE_{rotation} \approx 0.012$ –0.016 rad (aligned to motion capture).
- Tactile prediction RMSE: $3.9$ N (across six environments), visual-force RMSE: $2.2$–$2.6$ N, a magnitude higher than tabletop benchmarks, demonstrating real-world complexity.

5. Integration with Hoi! Dataset and Benchmarking

The Hoi! Gripper constitutes the primary tool embodiment for robot-grade data within the Hoi! dataset (3048 sequences, 381 object parts, 38 environments). Each manipulation episode includes:

Exocentric and egocentric RGB-D video streams
Wrist-view RGB-D
End-effector 6-DoF force-torque
Fingertip tactile images (2× GelSight)
Motor current, finger force, and joint torque

All data are aligned in space and time to a shared world frame, permitting:

Cross-view studies of physical contact and force from multiple visual perspectives
Analysis of embodiment transfer, e.g., visual-to-force prediction from human or gripper demonstration
Benchmarks for articulation estimation, tactile-force regression, vision-force affordance prediction, and unified vision-touch-force learning for articulated object manipulation

6. Research Impact and Applications

The Hoi! Gripper establishes a new standard in the collection of force-grounded manipulation data for articulated objects in real-world environments. By enabling synchronized visual, tactile, and force data from a parallel-jaw end-effector with robot-like mechanics, it provides:

A consistent physical interface for comparing human, teleoperated, and autonomous manipulation
A platform for benchmarking multimodal perception (vision, touch, force), visuo-tactile policy learning, and cross-embodiment transfer
Datasets supporting the training and evaluation of advanced control and learning algorithms for generalizable manipulation, especially in domains—such as furniture assembly or tool use—where cross-modal and cross-view reasoning is critical

The gripper's data and patterns are instrumental in studies on tactile-force regression, vision-tactile fusion via neural architectures, and policies for robust manipulation of articulated household objects (Engelbracht et al., 4 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Hoi! -- A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hoi! Gripper.