OakInk Repository: Hand-Object Interaction Data

Updated 16 December 2025

OakInk is a large-scale multimodal repository that integrates detailed hand-object interaction data with object affordance annotations and human intent capture.
It features dual databases – Oak for object-centric affordance and Ink for human interaction – enabling multi-view imaging, motion capture, and 3D pose measurements.
The repository supports tasks like 3D hand pose estimation, grasp synthesis, and interactive modeling, offering extensive benchmarks for vision and robotics research.

OakInk is a large-scale, multimodal repository designed for the systematic study of hand-object interaction, unifying object-centric affordance annotations (“Oak”) and human-centric interaction demonstrations (“Ink”). Its dual-knowledge structure captures both affordances—what objects are designed to enable—and the rich variety of intent-driven interactions performed by human subjects. OakInk provides granular part-level annotations, multi-view imaging and motion capture, pose and contact metadata, and interaction transfer across an extensive taxonomized object set. The resource enables detailed benchmarking and analysis for tasks such as 3D hand pose estimation, hand-object pose prediction, grasp synthesis, and higher-level interactive modeling, with all code and data available to the community (Yang et al., 2022).

1. Core Design and Knowledge Base Structure

OakInk is architected around two complementary databases:

Oak Base (Object Affordance Knowledge): Comprising 1,800 single-hand manipulable household objects sourced from online vendors, ShapeNet, YCB, and ContactDB, each object is classified in a two-level taxonomy (top-level: maniptool—tools with handles and end-effectors, and functool—self-contained function objects). The sub-level consists of 32 WordNet-based categories. Every object receives part-level affordance annotations using consensus phrases reflecting potential interactions or functions.
Ink Base (Interaction Knowledge): Contains detailed records of human-object interaction. For 100 selected “source” objects, 12 human subjects perform up to five distinct interaction intents (use, hold, lift-up, hand-out, receive), captured via synchronized RGB-D streams and motion capture (230,064 frames). Using a novel Tink transfer pipeline, these interactions are algorithmically mapped to all visually similar “target” objects (1,700 instances), resulting in 50,000 affordance- and intent-aware interaction samples.

The repository thus integrates detailed 3D object geometry, multi-level functional labeling, annotated hand and object poses, hand-object contact statistics, intent labels, and transferred interaction exemplars. OakInk offers two main dataset releases:

OakInk-Image: 230,000 RGB-D frames with MoCap and associated 3D annotations, covering 100 source objects.
OakInk-Shape: 50,000 distinct hand-object interaction samples incorporating geometry and pose, spanning the full set of 1,800 objects.

2. Annotation Schema and Representation

Object Taxonomy and Affordances

Objects are categorized first as either maniptool (tools with a handle and an end-effector, e.g., mugs, knives) or functool (functionally self-contained items, e.g., cameras, headphones). Each object is mapped to one of 32 WordNet-based sub-categories.

Each object or part is annotated with affordance phrases collected via a volunteer consensus process, totaling 30 phrases. These have the canonical form ⟨verb (+ preposition), something⟩ (e.g., ⟨cut, something⟩ for a knife blade, ⟨handled (by), something⟩ for handles), with ⟨no function⟩ designated for nonfunctional components.

Intent and Interaction Coding

Five discrete intent classes are recognized for human actions: use, hold, lift-up, hand-out, receive, with each recorded sequence labeled accordingly ( $c \in C = \{ \mathrm{use}, \mathrm{hold}, \mathrm{lift}, \mathrm{hand-out}, \mathrm{receive} \}$ ).

Contactness Quantification

Hand and object meshes are co-registered in common MoCap coordinates. For each frame, the “contactness” $\gamma_{ij}$ of object vertex $V^{(i)}_{h,j}$ with hand part anchor $A_i$ is calculated by

$\gamma_{ij} = \max(0,1-\|A_i - V^{(i)}_{h,j}\|/25\,\mathrm{mm})$

where 17 anatomical hand anchors are used. This yields vertex-level dense contact maps, enabling spatial and intent-based analysis of interaction patterns.

3. Data Collection, Sensing Stack, and Tink Transfer Pipeline

Recording Infrastructure

Data capture occurs in a multi-sensor enclosure ( $1.5 \times 1.2 \times 1$ m³) equipped with four RGB-D cameras and eight infra-red MoCap cameras. Human subjects initiate each sequence with the hand at rest, then perform each of the five possible intents with a given object for an interval of ~5 seconds; non-informative frames are post-processed manually.

Surface-attached reflective markers track object pose in the MoCap system, which is subsequently calibrated to the camera system. Hand pose is parametrized by the MANO model: pose $\theta \in \mathbb{R}^{16 \times 3}$ , shape $\beta \in \mathbb{R}^{10}$ . The pose estimation optimizes reprojection error on multi-view 2D keypoints, with auxiliary losses enforcing silhouette overlap, interpenetration minimization, temporal smoothness, and anatomical feasibility.

Tink: Automated Interaction Transfer

To expand Ink beyond the 100 recorded objects, the Tink algorithm systematically transfers existing hand-object interaction patterns to the remaining 1,700 “target” objects of similar category. The Tink procedure comprises:

a) Shape Interpolation: DeepSDF models learn signed distance functions for source and target objects, producing latent vectors $o^s_i, o^t_j$ . Linear interpolation generates $N_{itpl} = 10$ intermediate geometric forms, reconstructed via Marching Cubes.

b) Contact Mapping: For every hand part anchor $A_i$ , Tink aligns contact patches from the source to intermediate and finally target shapes using iterative closest point (ICP) techniques.

c) Pose Refinement: The optimal hand pose parameters $(\theta, \beta, P_{h0})$ for the target shape minimize a total objective combining contact region alignment $(E_{consis})$ , anatomical regularization $(E_{anat})$ , and intersection penalties $(E_{intp})$ :

$\min_{\theta, \beta, P_{h0}} (E_{consis} + E_{anat} + E_{intp})$

with final results screened for perceptual plausibility by a panel of five human raters.

4. Dataset Composition, Statistics, and Analysis

Dataset	Object Count	Interactions	Frames	Modalities
OakInk-Image	100	-	230,000	RGB-D, MoCap, 2D/3D keypoints, intent labels
OakInk-Shape	1,800	50,000	-	3D geometry, hand pose, object pose, contact

Objects: 1,800 total; 100 with direct recordings, 1,700 with transferred interactions.
Interactions: 50,000 unique hand-object pairs, annotated with affordance, intent, pose, contactness, and transfer metadata.
Coverage: Annotations leverage 30 consensus affordance phrases across 32 categories; heatmaps illustrate dense contact frequencies on functional object regions.

Significance: This scale and annotation richness support hypothesis-driven analysis of affordance saliency, part-centric grasp strategies, and intent-conditioned interaction variance.

5. Benchmarking and Evaluation Protocols

OakInk is benchmarked on several canonical tasks:

Hand Mesh Recovery (HMR) (OakInk-Image):
- Evaluation splits: by view (SP0), by subject (SP1), by object (SP2).
- Methods: I2L-MeshNet, HandTailor.
- Performance metrics: mean per-joint position error (MPJPE, mm), AUC@[0,50 mm], mean per-vertex position error (MPVPE).
- Results (SP0): I2L-MeshNet MPJPE=12.10 mm (AUC=0.784), MPVPE=12.29 mm; HandTailor MPJPE=11.20 mm (AUC=0.884), MPVPE=11.75 mm.
Hand–Object Pose Estimation (HOPE) (OakInk-Image):
- Methods: Tekin et al., Hasson et al.
- Metrics: MPJPE (mm), mean corner position error (MPCPE; mm).
- Results: Hasson et al.—MPJPE=27.26 mm, MPCPE=56.09 mm; Tekin et al.—MPJPE=23.52 mm, MPCPE=52.16 mm.
Grasp Generation (GraspGen) (OakInk-Shape):
- Method: GrabNet (conditional variational autoencoder).
- Metrics: penetration depth (cm), intersection volume (cm³), simulated displacement mean/std (cm), perceptual Likert score (1–5).
- Test split results: Pen. depth=0.67, Vol=6.60, Disp μ=1.21, Disp σ=2.05, Score=3.66.

Context: These protocols enable standardized, reproducible evaluation for learning-based interaction models and 3D vision pipelines, especially in settings with cross-object, cross-intent, and cross-participant generalization.

6. Applications and Research Use Cases

Intent-based Interaction Generation (IntGen): Models hand pose synthesis conditioned on both object shape and recognized intent. By extending GrabNet to include an intent embedding $z_i \in \mathbb{R}^d$ , the model enables generation of functional, intent-aligned hand-object interactions. For example, generating a “use” grasp for a mug yields penetration depth=0.45 cm, intersection vol=4.22 cm³, disp μ=0.86 cm, σ=1.51 cm, Likert score=3.86.
Human-to-Human Handover Generation (HoverGen): Targets bi-manual coordination, producing receiver hand poses given an object shape and the giver’s root pose. Employs a two-stage architecture (CoarseNet, RefineNet) leveraging Chamfer distance for hand-to-hand spatial constraints; final performance: pen. depth=0.62 cm, intersection vol=6.99 cm³, disp μ=1.30 cm, σ=2.03 cm, Likert score=4.03.

These tools address research questions in action understanding, autonomous grasp planning, human-robot collaboration, and simulation-to-reality transfer, leveraging the bidirectional mapping between affordance, intention, and interaction geometry.

7. Data Access, API, and Implementation Details

OakInk’s dataset and codebase is openly available at https://github.com/lixiny/OakInk. The repository is organized with clear modularity:

oakink/
├─ oak/           # Oak: taxonomy, affordance, 3D models
│  ├─ objects.json
│  ├─ models/
│  └─ affordances.json
└─ ink/           # Ink: interaction data
   ├─ image/
   ├─ mocap/
   ├─ annotations/
   └─ shapes/

A Python API is provided for accessing and querying both knowledge bases:

import oakink
oak = oakink.OakBase('/path/to/oak')
ink = oakink.InkBase('/path/to/ink')
objs = oak.list_objects()
aff = oak.get_affordance('mug')
sample = ink.get_interaction(object='mug_01', intent='use', idx=5)
oakink.visualize_interaction(sample)

Prerequisites: Python 3.7+, PyTorch 1.8+, NumPy, Open3D, pyrender. Notebooks and demo scripts facilitate queries by affordance, intent-conditioned sampling, and visualization of pose transfers.

This infrastructure supports direct experimental replication, dataset augmentation, and integration with existing vision, graphics, and robotics pipelines, providing a comprehensive platform for the study of human-level hand-object interaction (Yang et al., 2022).

PDF Markdown Chat (Pro)

References (1)

OakInk: A Large-scale Knowledge Repository for Understanding Hand-Object Interaction (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OakInk Repository.