Egocentric Full-Hand Tactile Dataset
- The dataset is a comprehensive multimodal resource capturing naturalistic hand-object interactions with high-resolution tactile and pose data.
- It integrates precise tactile sensors with synchronized RGB/D video and advanced hand pose tracking for robust embodied perception and manipulation analyses.
- Baseline models demonstrate improved contact and grasp inference, fostering research in human–robot interaction and dexterous robotics.
An egocentric full-hand tactile dataset is a comprehensive, multimodal resource for studying naturalistic hand-object interactions from a first-person perspective with fine-grained touch and force data. Such datasets combine synchronized first-person RGB or RGB-D video, tactile signals capturing per-point hand contact pressure, and detailed hand pose information, enabling research in embodied perception, contact-rich manipulation, vision-based tactile inference, and human-robot interaction. Recent advances have produced datasets of unprecedented quality in terms of both spatial-temporal resolution and ecological diversity, such as “EgoPressure” (Zhao et al., 2024) and “OpenTouch” (Song et al., 18 Dec 2025), which collectively define the state of the art in this domain.
1. Dataset Architectures and Data Acquisition
The physical and technical design of egocentric full-hand tactile datasets is a primary driver of both fidelity and ecological validity. The following table summarizes the core architectures of EgoPressure and OpenTouch, the two leading datasets in this space:
| Dataset | Tactile Sensor | Video Capture | Hand Pose Tracking |
|---|---|---|---|
| EgoPressure | Sensel Morph (160×168) | 8 Kinect DK (1 ego, 7 exo) | Multi-view RGB-D+PnP+MANO |
| OpenTouch | FPCB Glove (16×16) | Meta Aria Glasses (egocentric) | Rokoko Smartglove (IMU+EMF) |
EgoPressure (Zhao et al., 2024):
- Utilizes a high-density Sensel Morph touchpad (160×168, ≈1.5 mm pitch, 120 Hz) beneath interchangeable paper textures, synchronized with RGB-D captured at 30 Hz using one head-mounted and seven stationary Azure Kinect DK cameras. A global trigger and IR marker system achieves cross-modal timestamp alignment (±1 ms), and egocentric camera pose is tracked via active LED arrays and PnP.
- Participants: 21 adults (23–32 years), each recording ≈5 hours of hand-surface gestures (31 categories, both hands).
OpenTouch (Song et al., 18 Dec 2025):
- Employs a custom flexible printed circuit board glove (16×16 taxels, Δx ≈3.7 mm, 30 Hz), calibrated taxel-by-taxel (0.02–50 kPa, linear model per taxel), with force readings transmitted wirelessly. Video is recorded using Meta Aria Profile 28 glasses at 1408×1408 px, 30 Hz; hand pose is concurrently monitored with a Rokoko Smartglove (7×6-DOF IMU+EMF, 21 keypoints, 30 Hz), with timestamp anchoring based on a visual cue at recording start.
- Participants: Multiple right-hand-dominant subjects, performing contact-rich, in-the-wild manipulations with diverse objects (∼8,000 unique; 14 environments).
2. Data Annotation, Synchronization, and Structure
Robust temporal and spatial alignment across modalities is essential. Both datasets synchronize tactile and video streams at high precision (sub-frame synchronization via IR markers in EgoPressure, or by visual cue alignment in OpenTouch), also aligning hand pose streams to a unified timebase.
Annotation and Organization:
- EgoPressure:
- Frames are organized by participant, session, and gesture, each frame containing: 8 RGB/Depth images, pressure maps ([160×168] float32, kPa), full MANO mesh parameters (θ, β, per-vertex offsets D, translation t), camera extrinsics, and PnP-based head pose.
- Pressure is projected from the pad onto the hand mesh using differentiable rendering and optimized alignment in the UV atlas.
- Provides a Python API, PyTorch Dataset interface, and WebGL-based visualizer.
- OpenTouch:
- Curated into 2,900 densely annotated multimodal clips (≈3 hours), each with RGB video, tactile arrays (N,16,16), 21-keypoint 3D poses, and a JSON annotation (object name, type, environment, action verb, grasp type via Feix taxonomy, free-form description).
- Directory structure supports efficient retrieval, with splits.csv defining train/val/test for each clip.
3. Hand Pose and Pressure Mapping Methodologies
Precise hand modeling and contact localization underpin the analytic capabilities of these datasets.
EgoPressure (Zhao et al., 2024):
- Initial hand pose per frame is estimated via HaMeR, then refined in a two-stage process:
- Stage 1—Pose Optimization:
- Optimizes MANO θ and global t by minimizing an objective comprising mask IoU, RGB alignment (MSE), depth alignment, and a self-intersection penalty over all cameras, leveraging differentiable rendering (DIB-R).
- Stage 2—Shape and Pressure Alignment:
- With pose fixed, per-vertex offsets improve shape fidelity and align mesh contact regions to empirical pressure via an additional pressure consistency loss and ARAP/Laplacian/offset regularization. Temporal smoothness across batches ensures consistency.
- Reaches mask-IoU < 0.06, depth-IoU > 0.87, and $3$D fingertip error ≈ $5.7$ mm against manual triangulations after refinement.
- High-resolution pressure is mapped to hand-mesh UV using a differentiable projection from pad to hand mesh, enforcing both intensity and zero-gap constraints.
OpenTouch (Song et al., 18 Dec 2025):
- Hand pose captured by sensor-instrumented gloves, reported as (N frames × 21 keypoints × 3D position).
- Pressure per taxel can be projected onto a triangular hand mesh via inverse-distance interpolation:
where is the weight between mesh face and taxel , is taxel pressure.
- No mesh-based optimization or differentiable rendering is performed; pose and tactile streams are strictly sensor-based.
4. Baseline Models, Benchmarks, and Evaluation Metrics
Both datasets supply baseline models for cross-modal inference and recognition.
EgoPressure (Zhao et al., 2024):
- PressureVisionNet (PV):
SEResNeXt50 encoder (ImageNet-init), FPN decoder, predicts 16 pressure classes from egocentric RGB. Loss is cross-entropy + temporal smoothness. Evaluated on Contact-IoU, volumetric-IoU, MAE [Pa], and temporal consistency.
- RGB + 2.5D Keypoints:
Augments first layer with joint-heatmap; provides improved volumetric-IoU (+2.8–2.9% with GT keypoints).
- PressureFormer:
Takes HaMeR's ViT features, applies transformer cross-attention from mesh vertices to image, decodes UV pressure map (20 force bins). Loss combines CE on UV pressure and image-projected pressure. Achieves Contact-IoU=43.0% and UV-Contact-IoU=33.1%. Accurately localizes pressure on occluded surfaces via mesh UV.
- Joint pose and pressure estimation is proposed as future work.
OpenTouch (Song et al., 18 Dec 2025):
- Cross-modal Retrieval:
- Video→Tactile: Recall@1=7.15%, mAP=15.47%
- Tactile→Pose: Recall@1=7.15%, mAP=13.43%
- Multi-modal fusion (Video+Pose→Tactile): Recall@1=14.08%
- CCA and PLSCA baselines perform near chance.
- Pattern Classification:
- Best action accuracy: Video=40.26%, Tactile=31.59%, T+V=32.73%, T+P+V=37.32%.
- Best grasp accuracy: Tactile=57.12%, Video=57.45%, T+V=65.47%, T+P+V=68.09%.
- Tactile data is strongest unimodal cue for grasp, while fusion enhances performance for both tasks.
5. Dataset Formats, Access, and Usage
Each dataset provides programmatic access and recommended split protocols.
EgoPressure (Zhao et al., 2024):
- Distributed as per-participant/session/gesture folder hierarchies, with per-frame files: RGB/depth images, pressure .npy, MANO .json, camera extrinsics, and py API.
- Train/test splits, pressure-mesh projection, UV visualization, and playback via Python and WebGL.
OpenTouch (Song et al., 18 Dec 2025):
- Clip-oriented directory: mp4 RGB, .npz tactile, .json pose, .json annotation per clip, plus splits.csv (80/10/10 protocol).
- Processing example (Python):
1 2 3 4 5 6 7 |
import numpy as np, json, cv2 d = np.load("tactile/clip_012_tactile.npz") pressure = d["arr_0"] # (N,16,16), kPa pose = json.load(open("pose/clip_012_pose.json")) cap = cv2.VideoCapture("rgb/clip_012_rgb.mp4") ret, frame = cap.read() pressure_norm = (pressure - 0.02) / (50.0 - 0.02) # [0,1] |
- Recommended use includes embodied learning (e.g., cross-modal hallucination), robotic manipulation (policy net finetuning), and augmentation (pressure binning, spatial jitter).
6. Applications and Research Implications
Egocentric full-hand tactile datasets enable new research paradigms and applications, including:
- AR/VR Input: Pressure-sensitive virtual keyboards, musical interfaces, and adaptive UIs leveraging direct tactile signals (Zhao et al., 2024).
- Human–Robot Interaction (HRI): Grasp and handover learning for robotic hands using direct supervision from human tactile data (e.g., Mandikal & Grauman).
- Dexterous Robotics: Vision-guided pressure imitation improving grip stability and task adaptability (cf. Christen et al., Collins et al.; D-Grasp, Visual Gripper Pressure).
- Behavior Analysis and Skill Assessment: Analysis of fine-motor patterns in rehabilitation or expertise studies (Ego4D, OpenTouch).
- Multimodal Perception: Cross-sensory embedding and retrieval, aligning tactile with video and pose data to facilitate embodied learning and manipulation tasks.
This suggests that future models may further exploit such datasets for joint pose-pressure inference, multi-task embodied agents, and transfer to real-world manipulation scenarios previously unattainable due to data limitations.
7. Limitations and Directions for Future Research
Current datasets differ in their controlled (EgoPressure: pad-based, high spatial/temporal pressure resolution) versus in-the-wild (OpenTouch: sensor glove, unconstrained environments) designs. A plausible implication is that combining high-resolution pressure with ecological diversity remains an open challenge.
Future research is poised to explore fully end-to-end learning from egocentric video to hand mesh and UV pressure distributions, data augmentation protocols reflecting sensor noise, and transfer learning for cross-domain embodied agents.
References
- "EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision" (Zhao et al., 2024)
- "OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction" (Song et al., 18 Dec 2025)