UniKPT: Unified Keypoint Detection Dataset
- UniKPT dataset is a unified benchmark for multi-object, multi-category keypoint detection, aggregating 13 datasets with 338 keypoint types over 1,237 classes.
- It employs balanced sampling and semantic unification strategies to reconcile annotation heterogeneity and ensure consistent labeling across diverse domains.
- The dataset supports promptable keypoint detection, 3D reconstruction, and cross-domain pose understanding, serving as a foundation for advanced computer vision tasks.
The UniKPT dataset is a unified, large-scale keypoint detection resource developed to facilitate generalized, multi-object, and multi-category keypoint estimation under diverse real-world scenarios. It aggregates and harmonizes numerous previously disjoint datasets covering articulated, rigid, soft, biological, and non-biological objects, providing a standardized benchmark for open-world keypoint localization tasks in computer vision. Originally introduced by the X-Pose framework and subsequently expanded and restructured for use in geometry-centric methods such as SirenPose, UniKPT serves as a backbone for advanced research in promptable keypoint detection, 3D reconstruction, and cross-domain pose understanding (Yang et al., 2023, Cai et al., 23 Dec 2025).
1. Dataset Construction and Scope
UniKPT is designed as a comprehensive corpus for multi-object, multi-category 2D and 3D keypoint detection. The initial incarnation, as described in X-Pose (Yang et al., 2023), consolidates 13 separate keypoint datasets spanning:
- Humans (COCO, Human-Art)
- Faces (300W-Face)
- Hands (OneHand10K)
- Quadrupeds and diverse mammals (AP-10K, APT-36K, MacaquePose, Animal Kingdom, AnimalWeb)
- Invertebrates (Vinegar Fly, Desert Locust)
- Rigid and soft objects (Keypoint-5, MP-100)
The unified dataset encapsulates:
- 226,547 images
- 418,487 annotated object instances
- 338 unique keypoint types
- 1,237 object categories (including 1,216 biological species)
- Over 400,000 instances in the initial release
SirenPose (Cai et al., 23 Dec 2025) expands UniKPT to approximately 600,000 annotated instances, maintaining the same categorical structure but focusing exclusively on category-agnostic keypoints for 3D supervision.
| Release | Instances | Images | Keypoints | Categories |
|---|---|---|---|---|
| X-Pose | 418,487 | 226,547 | 338 | 1,237 |
| SirenPose | ~600,000 | — (not given) | 338+ | 1,237 |
The image count for SirenPose is not specified; all 600k instances are used for training, with no held-out split provided.
2. Unification Strategies and Semantic Hierarchy
UniKPT employs an explicit strategy to reconcile annotation heterogeneity and taxonomic diversity:
- Balanced sampling: Equalizes appearance, style, pose, occlusion, and viewpoint distributions within and across categories.
- Keypoint unification: Assigns each keypoint type a single textual label and merges identical or homologous anatomical locations across species (e.g., “left eye” for all animals).
- Category hierarchy: Super-class organization (e.g., humans, mammals, invertebrates, objects), each with a consistent set of keypoints. Per-category keypoint counts are consolidated using textual and anatomical correspondences—for example, MP-100’s 561 annotated points are merged to 293 unified types.
- Semantic naming: Keypoints reflect anatomical or functionally descriptive locations such as “snout tip,” “left front hoof,” or “cap brim center,” supporting transfer learning and prompt-based querying.
This enables open-domain promptability and supports learning models that generalize across wide-ranging visual concepts and object types (Yang et al., 2023).
3. Annotation Schema, Formatting, and Storage
Annotation is provided in COCO-style JSON files for the original UniKPT release:
- Each instance includes x,y pixel coordinates per keypoint, a visibility flag , category ID, bounding box, and a human-readable label.
- File structure comprised of images/ and annotations/ directories partitioned by split.
- The list of keypoints per instance is represented by flattened (x, y, v) arrays with length $3K$, where for the unified set.
For the SirenPose expansion:
- Annotations are stored under a top-level directory containing per-category JSON files, each structured as:
1 2 3 4 5 6
{ "instance_id": ..., "category": ..., "keypoints": [[x_1, y_1, z_1], ..., [x_M, y_M, z_M]], "adjacency": [[i_1, j_1], ..., [i_{|E|}, j_{|E|}]] } - Keypoints are given as 3D coordinates and fully visible; there are no image files, bounding boxes, or visibility flags in this release. Optionally, shape proxies (.obj) are included as coarse models per instance.
- The adjacency field defines the undirected graph structure over keypoints, supporting geometric consistency in downstream methods (Cai et al., 23 Dec 2025).
4. Training Splits, Licensing, and Download
For the initial dataset (Yang et al., 2023):
- Recommended splits:
- Train: 204,000 images
- Val: 11,000 images
- Test: 11,547 images
- Splits preserve class and dataset balance to maximize representation across object types.
Each original dataset retains its individual license (COCO: CC-BY-4.0, Animal Kingdom: CC-BY-NC, etc.). The UniKPT wrapper and tooling are MIT-licensed for research use; aggregate usage is for non-commercial/academic research, as outlined in LICENSE.txt provided in the repository.
For SirenPose (Cai et al., 23 Dec 2025), no explicit split or held-out test set is defined; all 600,000 examples are used for supervised training.
The primary repositories are:
- https://github.com/IDEA-Research/X-Pose (X-Pose, original UniKPT)
- Prebuilt scripts download and process the source data; manifests and integrity checks are included.
5. Evaluation Metrics and Protocols
UniKPT establishes standard evaluation criteria suitable for promptable and unified keypoint detection:
- Average Precision (AP): Computed over object-keypoint predictions using Object Keypoint Similarity (OKS), analogous to COCO methodology. Evaluated at multiple thresholds (AP@[.50:.95], AP, AP).
- Percentage of Correct Keypoints (PCK): , where is the maximum of the predicted bounding box width/height and is a normalized threshold (commonly 0.1 or 0.2).
- Metrics are reported per super-category (e.g., human, mammal, invertebrate) and averaged across all 1,237 categories. Classes with fewer than 20 test instances are omitted to suppress variance.
In the context of SirenPose, keypoint annotations are leveraged to compute L2 positional loss and a geometric consistency loss, with the adjacency graph governing pairwise keypoint relationships. No additional dataset-specific losses are introduced (Cai et al., 23 Dec 2025).
6. Application Domains and Methodological Relevance
UniKPT is employed in a breadth of challenging vision problems including:
- Open-world animal behavior analysis and ethology
- Fine-grained human-computer interaction (face/hand/pose tracking)
- Robotic manipulation based on contact and handle point estimation
- AR/VR anchoring via prompt-based multi-object pose detection
- Dynamic scene reconstruction and geometric supervision for monocular video (as in SirenPose (Cai et al., 23 Dec 2025))
Its unprecedented breadth—338 keypoint types over 1,237 classes—enables cross-domain transfer learning and robust generalization to both seen and novel categories.
7. Limitations and Known Challenges
Despite the advantages of scope and unification, UniKPT presents several limitations:
- Residual class imbalance remains, with humans, domestic animals, and common objects overrepresented.
- Annotation heterogeneity, especially in the distribution and style (face-only vs. full-body) of keypoints, can hinder prompt consistency and model transfer.
- Certain semantic definitions (e.g., “mid-spine”) are ambiguous across taxonomic groups.
- Licensing varies by source; commercial use is not universally permitted.
- For the SirenPose expansion, only fully visible 3D keypoints are included; no image-space, bounding box, or occlusion metadata is provided. The dataset also lacks explicit train/test splits in this version (Cai et al., 23 Dec 2025).
A plausible implication is that, while UniKPT provides a foundation for any-keypoint detection, its utility in 3D geometric contexts is best realized by additional domain-specific preprocessing or augmentation, depending on the downstream framework.
References
- "X-Pose: Detecting Any Keypoints" (Yang et al., 2023)
- "SirenPose: Dynamic Scene Reconstruction via Geometric Supervision" (Cai et al., 23 Dec 2025)