AffordPose: 3D Hand–Object Affordance Dataset
- AffordPose is a large-scale 3D dataset annotated at the object-part level with eight distinct hand-centered affordances using full 16-DOF MANO hand models.
- It employs a rigorous two-stage process combining semantic segmentation and simulated hand manipulation in GraspIt! to ensure physically plausible interactions.
- The dataset supports tasks such as affordance classification, grasp synthesis, and hand pose recovery with high accuracy and reproducibility.
AffordPose is a large-scale, 3D dataset designed to support detailed analysis, learning, and generation of human hand–object interactions, with a particular focus on the affordance-driven arrangement of hand poses. Every sample in AffordPose is explicitly annotated at the object-part level with one of eight fine-grained “hand-centered” affordances, ensuring that the resulting dataset supports rigorous research into plausible, functionally guided manipulations far beyond generic “use” or “handover” labels. Both the scale—comprising 26,712 interactions spanning 641 objects—and the granularity—featuring full 16-DOF MANO hand parameterizations and contact-driven quality control—distinguish it as a foundational asset for computational manipulation and affordance-driven learning research (Jian et al., 2023).
1. Dataset Construction and Representations
AffordPose comprises two fundamental components per sample: a 3D hand–object interaction and an explicit part-level affordance label indicating the intended manipulation type. The collection follows a two-stage process. First, each object is semantically segmented (using PartNet/PartNet-Mobility labels), and every object part is assigned exactly one affordance label (or “no affordance”) by a group of five trained annotators, with consensus enforced by majority. Second, for every afforded part, annotators manually adjust a MANO hand model within the GraspIt! simulator to realize a physically plausible configuration exhibiting the specified affordance. Physical plausibility is validated via the GraspIt! force analysis solver, which restricts mesh interpenetration and ensures an executable interaction.
Table 1 summarizes the composition and annotation paradigms.
| Objects | Object Classes | Hand–Object Interactions | Affordance Labels |
|---|---|---|---|
| 641 | 13 | 26,712 | 8 |
Objects are stored as watertight triangle meshes (.obj/.ply) normalized to life-scale. Hand annotations consist solely of pose and joint angles: the MANO hand model comprises a 16-dimensional vector (articulated DOFs, with fixed shape) and a global pose (translation plus SO(3) quaternion rotation ). Thus, each annotation includes , unifying intrinsic joint configuration with extrinsic spatial embedding relative to the object (Jian et al., 2023).
2. Affordance Taxonomy and Labeling Protocols
AffordPose delineates eight recurring hand-centered affordances:
- handle-grasp: Enclosing one or more fingers around a handle or loop.
- press: Applying force downward onto a button or surface.
- lift: Elevating a flat or shallow part (e.g., lids).
- pull: Extracting or peeling (e.g., zippers, drawers).
- twist: Rotational manipulation about the object axis (caps, knobs).
- wrap-grasp: Encompassing the body of an object (bottles, jars) cylindrically.
- support: Underlying, plate-like support from below.
- lever: Manipulating about a pivot point as with a switch.
For each part resulting from fine semantic segmentation, five annotators assign a single affordance label (or “none” if inapplicable) by majority, ensuring consistent interpretation across samples. This protocol yields affordance labels that are both generalizable (not object-instance-specific) and specific enough to support nuanced, segment-driven manipulation research (Jian et al., 2023).
3. Quantitative Analysis of Hand–Object Interactions
Comprehensive quantitative analyses of hand pose and contact are central to AffordPose.
- Pose Distribution: For each affordance, per-joint statistics (mean , standard deviation across all samples) are computed to characterize typicality and diversity. For instance, "pull" affordances demonstrate tight constraints (low ) at finger bases and higher variability at distal phalanges. The archetypal pose for each affordance is defined by selecting the minimizing distance to .
- Finger-Contact Frequency: For each class, the empirical contact probability of each finger is reported. Contact is defined per hand vertex whenever (with ), where is the signed distance to the object surface .
- Penetration and Solid Intersection: Penetration depth is quantified as
where negative denotes interpenetration. Solid intersection volume is also measured (e.g., by voxel intersection or mesh intersection) and reported both in cm and cm³ during method evaluation (Jian et al., 2023).
4. Data Format, Access, and Organization
AffordPose is distributed under a Creative Commons non-commercial license and is publicly accessible at https://github.com/GentlesJan/AffordPose. The dataset is meticulously structured for immediate integration:
1 2 3 4 5 |
├── meshes/ # 3D object meshes (.obj) ├── part_labels/ # Semantic part-to-affordance labels (.npy/.json) ├── hand_poses/ # Folders per interaction: pose.json + affordance.txt ├── splits/ # Train/val/test split lists └── README.md # Format/coordinate conventions |
- Meshes: .obj/.ply files, centroid at origin, right-hand frame, life-scale with respect to MANO hand.
- Labels: JSON/Numpy mapping of mesh vertices to affordance classes.
- Hand poses: JSON with keys “t” (translation), “q” (quaternion), “theta” (16-D), and a text affordance label.
- Splits: Standardized lists for benchmarking reproducibility (Jian et al., 2023).
5. Benchmark Tasks and Evaluation Protocols
AffordPose provides baseline benchmarks for two critical downstream tasks:
5.1 Hand–Object Affordance Understanding
Goal: Given a demonstration hand pose and a target object point cloud , predict either the object-level affordance class or per-point affordance labels.
- Architecture: Intrinsics passed through a four-layer MLP, concatenated to , processed via a DGCNN backbone (EdgeConv). For classification: global max pool, FC, softmax over 8 classes. For segmentation: per-point FC and softmax over 9 labels (“no affordance” + 8 real affordances).
- Losses: Standard categorical cross-entropy for both classification and segmentation.
- Performance: Affordance classification reaches 98.4% when all parameters are used; part localization IoU is up to 96.3% (Jian et al., 2023).
5.2 Affordance-Oriented Interaction Generation
Goal: Given object and a desired affordance , predict the full hand pose .
- Baseline: GrabNet predicts unconditioned grasps.
- AffordPoseNet: Features from PointNet++/DGCNN on , concatenated with one-hot , decoded via FC layers to and . Optionally augmented with a contact loss term penalizing penetration.
- Metrics: Mean penetration depth, solid intersection volume, contact ratio, and affordance accuracy (fraction of samples contacting the correct affordance region).
- Findings: Penetration and contact are on par with state-of-the-art (mean penetration 0.89 cm; contact ratio 96%), while affordance accuracy is 83.5% for AffordPoseNet (cannot be measured for affordance-agnostic baselines) (Jian et al., 2023).
5.3 Image-Based Tasks
Each interaction is rendered from three random viewpoints to generate RGB images.
- Interaction classification (RGB): ResNet-18 backbone, 97% precision/recall.
- Single-image hand mesh recovery: I2L-MeshNet variant, mean per-vertex error 16.4 mm, per-joint angular error 0.189 rad (Jian et al., 2023).
6. Positioning Within the Affordance and Manipulation Dataset Ecosystem
AffordPose provides interaction-centric, affordance-driven hand annotations at a scale and semantic granularity not present in other datasets. Unlike HANDAL (Guo et al., 2023), which targets category-level object pose, RGB-only input, and binary “handle” affordance segmentation for robot manipulation, AffordPose covers full hand articulation and provides explicit, multi-class affordance labels driven by human demarcation. The Geometric Pose Affordance dataset (Wang et al., 2019), though occasionally referred to as "AffordPose," is focused on full-body human pose in synthetic box scenes with extensive multi-view geometry but does not annotate hand-object affordance interactions at part-level. In contrast to recent work on language-conditioned, gripper-based pose-generation in 3D point clouds (Nguyen et al., 2023), AffordPose is distinguished by its focus on human hands—fully parameterized via MANO—and its strict part-level affordance labeling protocol.
7. Applications, Significance, and Limitations
AffordPose enables research in affordance classification, affordance-driven grasp synthesis, hand-pose estimation from both 3D and RGB views, and part-aware hand–object reasoning. By disentangling object parts, semantic intent, and detailed anatomical configuration, it bridges functional affordances and physical realization in hand-centric manipulation. The dataset’s annotated diversity supports generalization and robustness studies for affordance-aware models.
While AffordPose incorporates detailed parameter statistics and metrics for each affordance class, a plausible implication is that further work may extend semantic granularity or address more diverse object domains and manipulation contexts, mirroring the expansion seen in object-centric grasping data. The released code and documentation facilitate integration and reproducibility in experimental pipelines for affordance-learning research (Jian et al., 2023).