Papers
Topics
Authors
Recent
2000 character limit reached

Million-Scale Pose Dataset

Updated 30 November 2025
  • Million-scale pose datasets are comprehensive resources offering over one million annotated poses across objects, hands, and human bodies for robust training.
  • They integrate diverse modalities such as RGB, depth, and synthetic data with automated and manual annotation pipelines to ensure high precision.
  • These datasets drive innovation in applications like AR/VR, robotics, and activity parsing while revealing challenges in occlusion, symmetry, and sim-to-real transfer.

A million-scale pose dataset is a large-scale resource containing more than one million ground-truth pose annotations for objects, hands, or human bodies in images or videos. Such datasets are foundational for advancing deep learning methods in computer vision, providing the data scale and diversity necessary for robust training, benchmarking, and evaluation of pose estimation, tracking, and understanding across a range of domains including object manipulation, hand interaction, human activity, and group motion. These datasets typically span multiple modalities (e.g., RGB, depth, stereo), cover diverse scene conditions (e.g., clutter, occlusion, articulated items), and offer meticulously curated or physically simulated ground-truth, often at fine spatial and temporal scales. They catalyze algorithmic advances by surfacing new challenges related to occlusion, symmetry, category-level generalization, and sim-to-real transfer.

1. Definitions, Modalities, and Scope

Million-scale pose datasets encompass collections ranging from 1M to several million annotated instances, supporting a variety of tasks:

  • Object 6D pose estimation: Annotating the rotation and translation (RSO(3),tR3R \in SO(3), t \in \mathbb{R}^3) of objects relative to the camera, as in Omni6DPose (Zhang et al., 6 Jun 2024), PACE (You et al., 2023), StereOBJ-1M (Liu et al., 2021).
  • Human and hand pose estimation: High-DOF keypoint annotation for full bodies (SMPL) or hands, including multi-person and interacting-hand settings, exemplified by InterHand2.6M (Moon et al., 2020), WorldPose (Jiang et al., 6 Jan 2025).
  • Synthetic and real modalities: Mixed-reality and photoreal rendering (RenderIH (Li et al., 2023), SOPE in Omni6DPose (Zhang et al., 6 Jun 2024)) supplement real-world capture for increased domain and pose diversity.
  • Annotation formats: COCO-style JSON for objects, SMPL for body models, keypoint cloud for hands; multi-view geometrically consistent pose parameters.

These datasets frequently employ RGB, depth, or stereo capture, with increasing prevalence of multi-view synchronization, ground-truth 3D shape models, and per-frame segmentation masks.

2. Construction Methodologies and Annotation Pipelines

Achieving million-scale, high-quality pose data necessitates specialized acquisition and annotation protocols:

  • Sparse manual + optimization propagation: StereOBJ-1M (Liu et al., 2021) uses a few keyframes with manually labeled keypoints per object, with multi-view bundle adjustment to propagate accurate 6D pose to all frames, delivering annotation efficiency > 300× over frame-by-frame manual labeling. This yields sub-millimeter pose accuracy even for challenging materials.
  • High-resolution, multi-view capture: InterHand2.6M (Moon et al., 2020) deploys 80–140 synchronized cameras and multi-light capture to triangulate hand keypoints, later refined by deep network predictions and multi-frame triangulation (mean 3D error: 2.78 mm).
  • Automated physical simulation and rendering: RenderIH (Li et al., 2023) generates 1M photorealistic images via Blender with random backgrounds, skin textures, and camera viewpoints, leveraging a pose optimization pipeline that ensures inter-hand contact, anatomical constraint, and zero mesh penetration using anchor-based attraction, SDF collision, and adversarial filtering.
  • Category-level canonicalization: Omni6DPose (Zhang et al., 6 Jun 2024) performs high-precision scanning and aligns all object meshes into common canonical frames, enabling generalization across a large set of object categories and symmetry types.

Table: Key Construction Strategies

Dataset Real/Synth Annotation Workflow Modalities
StereOBJ-1M Real Sparse keypoints + optimization RGB, Stereo
InterHand2.6M Real Multi-view triangulation + DNN RGB
RenderIH Synth Pose opt. + photoreal rendering RGB
Omni6DPose Real/Synth Precise scan + SFM + rendering RGB, Depth
PACE Real/Synth Multi-view capture + annotation RGB-D

3. Dataset Content and Diversity

Million-scale pose datasets support wide variation in:

  • Object/subject classes: Omni6DPose annotates 581 real and 4162 synthetic object instances across 149 categories; PACE covers 576 instances (44 categories), including articulated items with hierarchical kinematics.
  • Material and physical diversity: StereOBJ-1M intentionally contains transparent, reflective, and symmetric objects; Omni6DPose balances ~70% diffuse, 15% transparent, and 15% specular instances, with explicit symmetry type annotation.
  • Scene complexity: PACE’s real and synthetic subsets yield heavily cluttered, multi-object frames, 2–6 objects per image, extensive occlusion (≥40% frames with significant occlusion). RenderIH produces hand-hand interaction with dense contact and varied viewpoint/background.
  • Temporal or sequence data: Datasets like InterHand2.6M, WorldPose (Jiang et al., 6 Jan 2025), and StereOBJ-1M include video sequences to enable tracking and action recognition. WorldPose provides 2.5M full-body pose tracks from multi-person, multi-view field sports.

4. Benchmark Tasks, Metrics, and Model Evaluation

Common benchmark tasks in million-scale pose datasets include:

  • Single-frame pose estimation (6D/3D): Object pose (Omni6DPose: ADD, ADD-S, AUC@IoU), hand keypoint estimation (InterHand2.6M: MPJPE, MRPE), human full-body/SMPL fitting (WorldPose: MPJPE, PA-MPJPE, G-MPJPE).
  • Pose tracking (video): Object and human pose tracking over time, with metrics such as AUC of ADD(-S), mIoU, MOTA, MOTP.
  • Category-level evaluation: Generalization to unseen object instances or categories, as emphasized in Omni6DPose and PACE, surfaces challenges in scaling model capacity.
  • Robustness to clutter, occlusion, and material: Benchmarks report sharp performance drops for deep models under heavy occlusion, high intra-class variability, or challenging materials (PACE, StereOBJ-1M).

Evaluation Metrics Overview

Metric Description Usage
ADD/ADD-S Average point/model error (symmetric/asymmetric) 6D pose
MPJPE Mean per-joint position error Body/hand
VUS@θ°,dcm Volume under pose error curve at thresholds Objects
mIoU Mean intersection over union (3D masks) Tracking
AR, AP@θ,δ Average recall, precision at angular/translation cutoff Detection

Notably, GenPose++ (Zhang et al., 6 Jun 2024) achieves AUC@IoU₂₅ = 39.0, VUS@10°5cm = 29.4 on Omni6DPose, outperforming all regression-based baselines due to generative sampling and semantic-aware feature fusion.

5. Applications Across Research Domains

Million-scale pose datasets underpin a broad spectrum of research directions:

  • End-to-end training of robust pose networks: Enables deep models to generalize across object families, materials, and occlusion conditions (Omni6DPose, InterHand2.6M).
  • Sim-to-real adaptation: PACE, Omni6DPose, RenderIH exploit paired synthetic/real data to paper domain transfer, with observed sim-to-real gaps motivating domain randomization and synthetic augmentation strategies.
  • Action recognition and activity parsing: IKEA ASM (not detailed here), WorldPose, and InterHand2.6M support holistic learning of activities, leveraging atomic action and temporal cues.
  • Articulated and interaction modeling: RenderIH and PACE provide minimally ambiguous ground truth for interacting hands and multi-part objects, exposing unique challenges for part-aware estimation and explicit kinematic priors.
  • Foundation model pretraining: The multi-million annotation count and scale-realism balance positions these datasets as viable sources for pretraining pose “foundation” models for downstream robotic manipulation, AR/VR, and group analysis.

6. Open Challenges and Future Directions

Despite their scale, million-pose datasets reveal persistent limitations and research opportunities:

  • Object and scene scale: Extending datasets to hundreds or thousands of object models (PACE, Omni6DPose) is ongoing; most real datasets remain category or instance limited.
  • Dynamic and multi-agent motion: Datasets such as StereOBJ-1M are quasi-static; introducing dynamic motion, handheld scanning, or spontaneous multi-agent interactions remains difficult.
  • Annotation scalability: While automatic propagation and synthetic data reduce labeling cost, true in-the-wild, diverse, and finely annotated million-pose collections still tax manual and computational resources.
  • Evaluation under real-world conditions: Sim-to-real gaps, heavy occlusion, and articulated objects still degrade performance of current pose networks, highlighting the need for new architectures and domain adaptation techniques.
  • Universal models and metrics: The diversity of object types, symmetries, and scene structures observed in these datasets motivate the development of universal pose representations and robust, generalizable error metrics (as in GenPose++ (Zhang et al., 6 Jun 2024)).

This suggests that future datasets may further unify multi-modal, multi-category, and multi-task annotations, while annotation and benchmarking protocols continue to evolve to meet the dual demands of large scale and high-fidelity.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Million-Scale Pose Dataset.