Skeleton-Based Action Recognition Benchmarks
- Skeleton-based action recognition benchmarks are standardized datasets and protocols designed to measure algorithm performance using defined splits and metrics.
- They evaluate diverse modalities, including 3D Kinect-captured and 2D RGB-derived skeletons, with key metrics like Top-1 accuracy and mean class accuracy.
- These benchmarks guide methodological improvements by assessing performance under supervised, noisy-label, unsupervised, and open-set conditions.
Skeleton-based action recognition benchmarks provide standardized datasets, protocols, and evaluation criteria for the development and fair assessment of algorithms tasked with recognizing human actions from articulated body pose sequences. The field encompasses varying sensor modalities, labeling conditions, action taxonomies, and real-world challenges. Benchmarks are critical for measuring progress, elucidating method strengths and weaknesses, and exposing domain transfer and generalizability issues within skeleton-based action recognition.
1. Major Datasets and Modalities
The most extensively adopted skeleton-based benchmarks are dominated by large-scale 3D Kinect-captured datasets, augmented in recent years by 2D skeleton benchmarks derived from RGB pose estimation in-the-wild. The canonical datasets include:
| Dataset | #Classes | #Subjects | #Joints | Modality | Typical Splits |
|---|---|---|---|---|---|
| NTU RGB+D 60 | 60 | 40 | 25 | 3D | Cross-Subject, Cross-View |
| NTU RGB+D 120 | 120 | 106 | 25 | 3D | Cross-Subject, Cross-Setup |
| Kinetics-Skeleton | 400 | – | 18 | 2D | Standard train/val |
| ANUBIS | 102 | 80 | 32 | 3D | Cross-Subject, Cross-View |
| FineGYM | 99 | – | 18 | 2D | Mean per-class Top-1 |
| Skeletics-152 | 152 | – | 18–25 | 3D | Standard splits |
| Toyota SmartHome | 31 | – | 15 | 3D | Open-set protocols |
NTU RGB+D (NTU-60/NTU-120) remains the de facto 3D laboratory benchmark, utilizing Microsoft Kinect V2 for full-body 3D joint capture. Kinetics-Skeleton, Skeletics-152, and FineGYM represent in-the-wild, 2D or pose-tracked datasets where skeletons are extracted by vision-based pose estimators on unconstrained RGB video. Key modalities considered are joint coordinates , bone vectors , and joint motion , often processed as parallel input streams (Xu et al., 2024).
2. Evaluation Protocols and Performance Metrics
Benchmarks employ dataset-specific, rigorously defined splits and evaluation measures to ensure comparability:
- Cross-Subject (CS/XSub): Training and test sets partitioned by subject IDs.
- Cross-View (CV/XView): Partitioning by camera viewpoint(s).
- Cross-Setup (CSet/XSet): Split by capture setups or environments.
Performance is typically measured by Top-1 accuracy: and, when class imbalance is significant, mean class accuracy: For detection or multi-label tasks (e.g., AVA, PKU-MMD), mean average precision (mAP) at fixed IoU thresholds is used (Duan et al., 2023, Zhang et al., 2024).
Open-set protocols gauge the capacity to reject unseen classes, using metrics such as O-AUROC/AUPR alongside Top-1 (Peng et al., 2023). Benchmarks for noisy labels employ strictly clean test sets, with varying noise ratios on the train set, and report mean Top-1 over multiple noise levels (Xu et al., 2024).
3. Benchmark Methodologies and Labeling Conditions
Benchmarks increasingly target varied annotation regimes and environmental realism:
- Fully Supervised: Standard protocol; e.g., NTU60/120 with exhaustive manual activity labels per sequence (Duan et al., 2022).
- Noisy-Label Regime: Controlled injection of symmetric label noise, with sample labels replaced with random incorrect classes at noise ratio (Xu et al., 2024). The NoiseEraSAR benchmark uniquely quantifies recognition robustness under such noise.
- Unsupervised/Self-Supervised: Benchmarks support unsupervised (no label access) (Su et al., 2019), and self-supervised protocols with context-based, generative, and contrastive pretexts, e.g., ST-Puzzle, MSM, SkeletonCLR (Zhang et al., 2024). Downstream evaluations span recognition, retrieval, detection, and few-shot learning.
- Open-Set Recognition: Not all action classes are known at train time; models must classify seen classes and reject the unknown. Protocols sample per-split sets of unknowns, reporting open/closed-set metrics (Peng et al., 2023).
- Multi-Person and Group Activities: Datasets such as Volleyball encode inter-person interactions; benchmarks report per-clip accuracies and, in some cases, spatio-temporal detection metrics (Duan et al., 2023).
4. Comparative Results and Baseline Architectures
Benchmarks drive the assessment of GCN, 2s-AGCN, MS-G3D, CTR-GCN, ST-GCN++, AFE-CNN, and Transformer-based strategies, often under standardized augmentations and preprocessing (Duan et al., 2022, Guan et al., 2022). Representative benchmark results:
| Dataset | Method | Metric | Accuracy (%) |
|---|---|---|---|
| NTU60 XSub | CTR-GCN | Top-1 | 92.1–92.4 |
| NTU60 XView | ST-GCN++ | Top-1 | 97.0 |
| NTU120 XSub | MS-G3D | Top-1 | 87.8 |
| Kinetics-400 | PoseC3D | Top-1 | 49.1 |
| ANUBIS XView | 2s-AGCN | Top-1 | 59.1 |
| FineGYM | PoseC3D | Mean Class | 94.1 |
| Skeletics-152 | 4s-ShiftGCN | Top-1 | 57.0 |
Performance varies substantially according to data domain: controlled 3D settings (NTU) consistently achieve 87% Top-1, whereas in-the-wild (Skeletics-152, Kinetics-Skeleton) drop to 57%–55%. Robustness to noisy labels, open set, and domain shift are emergent focus areas, with baseline methods (SOP, NPC, naive GCNs) often yielding only marginal gains compared to specialized frameworks such as NoiseEraSAR or CrossMax (Xu et al., 2024, Peng et al., 2023).
5. Preprocessing, Augmentation, and Good Practices
Benchmark protocols standardize preprocessing for fair evaluation:
- 3D Skeletons: Centering by hip/root, normalization by body height, and alignment of the first-frame spine to the canonical axis (Duan et al., 2022, Xu et al., 2024).
- 2D Skeletons: Min–max normalization of to per-clip, data denoising (drop clearly erroneous frames), simple tracklet association for multi-person frames (Duan et al., 2022).
- Temporal Handling: Uniform resampling (not zero-padding), optional loop padding for sequence length regularization.
- Augmentations: Random rotation, scale, temporal jitter, and joint-level dropout, essential for generalization especially under cross-dataset evaluation (Liu et al., 2024).
Ablation studies confirm that augmentations, uniform sampling, and spatial alignment directly impact test accuracy and inter-benchmark comparability.
6. Benchmark Extensions: Multimodality and Cross-Domain
Recent benchmarks and methods have embraced multimodal, cross-dataset, and open-world recognition challenges:
- Multimodality: Benchmarks such as MMCL incorporate contrastive and soft-label refinement using RGB and text features at training, maintaining skeleton-only efficiency at inference (Liu et al., 2024). Improvements in zero-shot transfer (e.g., SYSU-Action 27.5%→42.5%) are directly reported.
- Domain Generalization: Evaluations on untrimmed, noisy, and multi-person sequences (e.g., Skeletics-152, Volleyball, Toyota SmartHome) expose domain gaps and highlight the limited transferability from lab to wild (Gupta et al., 2020).
- Open-Set and Unsupervised Protocols: OS-SAR benchmarks prescribe held-out class splits, with methods such as CrossMax addressing latent alignment across modalities and outperforming standard open-set baselines by 5–10 points in O-AUROC (Peng et al., 2023). Predict-and-Cluster benchmarks in the unsupervised scenario approach the accuracy of supervised models on NW-UCLA, UWA3D, NTU60 (Su et al., 2019).
7. Benchmark Recommendations and Future Directions
Benchmark design is evolving to quantify robustness, transferability, and open-world competence:
- For noisy labels: Recommend injecting symmetric noise at multiple rates (20–80%), reporting performance under both cross-subject and cross-view splits, and comparing co-teaching, sample selection, and multi-modality fusion (Xu et al., 2024).
- For domain transfer: URGE benchmarking on both controlled (NTU-120) and wild datasets (Skeletics-152, Metaphorics), with and without pre-training for transfer quantification (Gupta et al., 2020).
- For open-set/unlabeled action recognition: Protocols should include standard open/closed splits, O-AUROC/AUPR/closed-set accuracy, and ablations under varying noise and occlusion.
- Multi-person/group activity: Recommend metrics that capture inter-person interaction accuracy and per-class AP for group activities (Duan et al., 2023).
Open research avenues highlighted by benchmarks: finger-/hand-level skeletons for fine-grained action, multi-modal (RGB, depth, inertial) fusion, unsupervised/self-supervised representation learning for label-scarce settings, dynamic/online recognition, and benchmark tasks explicitly measuring cross-domain and in-the-wild robustness (Qin et al., 2022, Zhang et al., 2024, Guan et al., 2022).
By consolidating large-scale, systematic, and diverse benchmarks, the field of skeleton-based action recognition is positioned to rigorously assess algorithmic advances, facilitate fair comparison under a wide array of conditions, and guide the development of future intelligent systems robust to the complexities of real-world human activity understanding (Xu et al., 2024, Zhang et al., 2024, Gupta et al., 2020, Qin et al., 2022, Liu et al., 2024, Liu et al., 27 Nov 2025, Peng et al., 2023, Duan et al., 2023, Duan et al., 2022, Guan et al., 2022, Su et al., 2019).