OakInk Repository: Hand-Object Interaction Data
- OakInk is a large-scale multimodal repository that integrates detailed hand-object interaction data with object affordance annotations and human intent capture.
- It features dual databases – Oak for object-centric affordance and Ink for human interaction – enabling multi-view imaging, motion capture, and 3D pose measurements.
- The repository supports tasks like 3D hand pose estimation, grasp synthesis, and interactive modeling, offering extensive benchmarks for vision and robotics research.
OakInk is a large-scale, multimodal repository designed for the systematic study of hand-object interaction, unifying object-centric affordance annotations (“Oak”) and human-centric interaction demonstrations (“Ink”). Its dual-knowledge structure captures both affordances—what objects are designed to enable—and the rich variety of intent-driven interactions performed by human subjects. OakInk provides granular part-level annotations, multi-view imaging and motion capture, pose and contact metadata, and interaction transfer across an extensive taxonomized object set. The resource enables detailed benchmarking and analysis for tasks such as 3D hand pose estimation, hand-object pose prediction, grasp synthesis, and higher-level interactive modeling, with all code and data available to the community (Yang et al., 2022).
1. Core Design and Knowledge Base Structure
OakInk is architected around two complementary databases:
- Oak Base (Object Affordance Knowledge): Comprising 1,800 single-hand manipulable household objects sourced from online vendors, ShapeNet, YCB, and ContactDB, each object is classified in a two-level taxonomy (top-level:
maniptool—tools with handles and end-effectors, andfunctool—self-contained function objects). The sub-level consists of 32 WordNet-based categories. Every object receives part-level affordance annotations using consensus phrases reflecting potential interactions or functions. - Ink Base (Interaction Knowledge): Contains detailed records of human-object interaction. For 100 selected “source” objects, 12 human subjects perform up to five distinct interaction intents (
use,hold,lift-up,hand-out,receive), captured via synchronized RGB-D streams and motion capture (230,064 frames). Using a novel Tink transfer pipeline, these interactions are algorithmically mapped to all visually similar “target” objects (1,700 instances), resulting in 50,000 affordance- and intent-aware interaction samples.
The repository thus integrates detailed 3D object geometry, multi-level functional labeling, annotated hand and object poses, hand-object contact statistics, intent labels, and transferred interaction exemplars. OakInk offers two main dataset releases:
- OakInk-Image: 230,000 RGB-D frames with MoCap and associated 3D annotations, covering 100 source objects.
- OakInk-Shape: 50,000 distinct hand-object interaction samples incorporating geometry and pose, spanning the full set of 1,800 objects.
2. Annotation Schema and Representation
Object Taxonomy and Affordances
Objects are categorized first as either maniptool (tools with a handle and an end-effector, e.g., mugs, knives) or functool (functionally self-contained items, e.g., cameras, headphones). Each object is mapped to one of 32 WordNet-based sub-categories.
Each object or part is annotated with affordance phrases collected via a volunteer consensus process, totaling 30 phrases. These have the canonical form ⟨verb (+ preposition), something⟩ (e.g., ⟨cut, something⟩ for a knife blade, ⟨handled (by), something⟩ for handles), with ⟨no function⟩ designated for nonfunctional components.
Intent and Interaction Coding
Five discrete intent classes are recognized for human actions: use, hold, lift-up, hand-out, receive, with each recorded sequence labeled accordingly ().
Contactness Quantification
Hand and object meshes are co-registered in common MoCap coordinates. For each frame, the “contactness” of object vertex with hand part anchor is calculated by
where 17 anatomical hand anchors are used. This yields vertex-level dense contact maps, enabling spatial and intent-based analysis of interaction patterns.
3. Data Collection, Sensing Stack, and Tink Transfer Pipeline
Recording Infrastructure
Data capture occurs in a multi-sensor enclosure ( m³) equipped with four RGB-D cameras and eight infra-red MoCap cameras. Human subjects initiate each sequence with the hand at rest, then perform each of the five possible intents with a given object for an interval of ~5 seconds; non-informative frames are post-processed manually.
Surface-attached reflective markers track object pose in the MoCap system, which is subsequently calibrated to the camera system. Hand pose is parametrized by the MANO model: pose , shape . The pose estimation optimizes reprojection error on multi-view 2D keypoints, with auxiliary losses enforcing silhouette overlap, interpenetration minimization, temporal smoothness, and anatomical feasibility.
Tink: Automated Interaction Transfer
To expand Ink beyond the 100 recorded objects, the Tink algorithm systematically transfers existing hand-object interaction patterns to the remaining 1,700 “target” objects of similar category. The Tink procedure comprises:
a) Shape Interpolation: DeepSDF models learn signed distance functions for source and target objects, producing latent vectors . Linear interpolation generates intermediate geometric forms, reconstructed via Marching Cubes.
b) Contact Mapping: For every hand part anchor , Tink aligns contact patches from the source to intermediate and finally target shapes using iterative closest point (ICP) techniques.
c) Pose Refinement: The optimal hand pose parameters for the target shape minimize a total objective combining contact region alignment , anatomical regularization , and intersection penalties :
with final results screened for perceptual plausibility by a panel of five human raters.
4. Dataset Composition, Statistics, and Analysis
| Dataset | Object Count | Interactions | Frames | Modalities |
|---|---|---|---|---|
| OakInk-Image | 100 | - | 230,000 | RGB-D, MoCap, 2D/3D keypoints, intent labels |
| OakInk-Shape | 1,800 | 50,000 | - | 3D geometry, hand pose, object pose, contact |
- Objects: 1,800 total; 100 with direct recordings, 1,700 with transferred interactions.
- Interactions: 50,000 unique hand-object pairs, annotated with affordance, intent, pose, contactness, and transfer metadata.
- Coverage: Annotations leverage 30 consensus affordance phrases across 32 categories; heatmaps illustrate dense contact frequencies on functional object regions.
Significance: This scale and annotation richness support hypothesis-driven analysis of affordance saliency, part-centric grasp strategies, and intent-conditioned interaction variance.
5. Benchmarking and Evaluation Protocols
OakInk is benchmarked on several canonical tasks:
- Hand Mesh Recovery (HMR) (OakInk-Image):
- Evaluation splits: by view (SP0), by subject (SP1), by object (SP2).
- Methods: I2L-MeshNet, HandTailor.
- Performance metrics: mean per-joint position error (MPJPE, mm), AUC@[0,50 mm], mean per-vertex position error (MPVPE).
- Results (SP0): I2L-MeshNet MPJPE=12.10 mm (AUC=0.784), MPVPE=12.29 mm; HandTailor MPJPE=11.20 mm (AUC=0.884), MPVPE=11.75 mm.
- Hand–Object Pose Estimation (HOPE) (OakInk-Image):
- Methods: Tekin et al., Hasson et al.
- Metrics: MPJPE (mm), mean corner position error (MPCPE; mm).
- Results: Hasson et al.—MPJPE=27.26 mm, MPCPE=56.09 mm; Tekin et al.—MPJPE=23.52 mm, MPCPE=52.16 mm.
- Grasp Generation (GraspGen) (OakInk-Shape):
- Method: GrabNet (conditional variational autoencoder).
- Metrics: penetration depth (cm), intersection volume (cm³), simulated displacement mean/std (cm), perceptual Likert score (1–5).
- Test split results: Pen. depth=0.67, Vol=6.60, Disp μ=1.21, Disp σ=2.05, Score=3.66.
Context: These protocols enable standardized, reproducible evaluation for learning-based interaction models and 3D vision pipelines, especially in settings with cross-object, cross-intent, and cross-participant generalization.
6. Applications and Research Use Cases
- Intent-based Interaction Generation (IntGen): Models hand pose synthesis conditioned on both object shape and recognized intent. By extending GrabNet to include an intent embedding , the model enables generation of functional, intent-aligned hand-object interactions. For example, generating a “use” grasp for a mug yields penetration depth=0.45 cm, intersection vol=4.22 cm³, disp μ=0.86 cm, σ=1.51 cm, Likert score=3.86.
- Human-to-Human Handover Generation (HoverGen): Targets bi-manual coordination, producing receiver hand poses given an object shape and the giver’s root pose. Employs a two-stage architecture (CoarseNet, RefineNet) leveraging Chamfer distance for hand-to-hand spatial constraints; final performance: pen. depth=0.62 cm, intersection vol=6.99 cm³, disp μ=1.30 cm, σ=2.03 cm, Likert score=4.03.
These tools address research questions in action understanding, autonomous grasp planning, human-robot collaboration, and simulation-to-reality transfer, leveraging the bidirectional mapping between affordance, intention, and interaction geometry.
7. Data Access, API, and Implementation Details
OakInk’s dataset and codebase is openly available at https://github.com/lixiny/OakInk. The repository is organized with clear modularity:
1 2 3 4 5 6 7 8 9 10 |
oakink/ ├─ oak/ # Oak: taxonomy, affordance, 3D models │ ├─ objects.json │ ├─ models/ │ └─ affordances.json └─ ink/ # Ink: interaction data ├─ image/ ├─ mocap/ ├─ annotations/ └─ shapes/ |
1 2 3 4 5 6 7 |
import oakink oak = oakink.OakBase('/path/to/oak') ink = oakink.InkBase('/path/to/ink') objs = oak.list_objects() aff = oak.get_affordance('mug') sample = ink.get_interaction(object='mug_01', intent='use', idx=5) oakink.visualize_interaction(sample) |
This infrastructure supports direct experimental replication, dataset augmentation, and integration with existing vision, graphics, and robotics pipelines, providing a comprehensive platform for the study of human-level hand-object interaction (Yang et al., 2022).