Papers
Topics
Authors
Recent
2000 character limit reached

Skeleton-Based Action Recognition Benchmarks

Updated 18 January 2026
  • Skeleton-based action recognition benchmarks are standardized datasets and protocols designed to measure algorithm performance using defined splits and metrics.
  • They evaluate diverse modalities, including 3D Kinect-captured and 2D RGB-derived skeletons, with key metrics like Top-1 accuracy and mean class accuracy.
  • These benchmarks guide methodological improvements by assessing performance under supervised, noisy-label, unsupervised, and open-set conditions.

Skeleton-based action recognition benchmarks provide standardized datasets, protocols, and evaluation criteria for the development and fair assessment of algorithms tasked with recognizing human actions from articulated body pose sequences. The field encompasses varying sensor modalities, labeling conditions, action taxonomies, and real-world challenges. Benchmarks are critical for measuring progress, elucidating method strengths and weaknesses, and exposing domain transfer and generalizability issues within skeleton-based action recognition.

1. Major Datasets and Modalities

The most extensively adopted skeleton-based benchmarks are dominated by large-scale 3D Kinect-captured datasets, augmented in recent years by 2D skeleton benchmarks derived from RGB pose estimation in-the-wild. The canonical datasets include:

Dataset #Classes #Subjects #Joints Modality Typical Splits
NTU RGB+D 60 60 40 25 3D Cross-Subject, Cross-View
NTU RGB+D 120 120 106 25 3D Cross-Subject, Cross-Setup
Kinetics-Skeleton 400 – 18 2D Standard train/val
ANUBIS 102 80 32 3D Cross-Subject, Cross-View
FineGYM 99 – 18 2D Mean per-class Top-1
Skeletics-152 152 – 18–25 3D Standard splits
Toyota SmartHome 31 – 15 3D Open-set protocols

NTU RGB+D (NTU-60/NTU-120) remains the de facto 3D laboratory benchmark, utilizing Microsoft Kinect V2 for full-body 3D joint capture. Kinetics-Skeleton, Skeletics-152, and FineGYM represent in-the-wild, 2D or pose-tracked datasets where skeletons are extracted by vision-based pose estimators on unconstrained RGB video. Key modalities considered are joint coordinates ji,t=(xi,t,yi,t,zi,t)\mathbf{j}_{i,t} = (x_{i,t}, y_{i,t}, z_{i,t}), bone vectors bi,j,t=jj,t−ji,t\mathbf{b}_{i,j,t} = \mathbf{j}_{j,t} - \mathbf{j}_{i,t}, and joint motion mi,t=ji,t+1−ji,t\mathbf{m}_{i,t} = \mathbf{j}_{i,t+1} - \mathbf{j}_{i,t}, often processed as parallel input streams (Xu et al., 2024).

2. Evaluation Protocols and Performance Metrics

Benchmarks employ dataset-specific, rigorously defined splits and evaluation measures to ensure comparability:

  • Cross-Subject (CS/XSub): Training and test sets partitioned by subject IDs.
  • Cross-View (CV/XView): Partitioning by camera viewpoint(s).
  • Cross-Setup (CSet/XSet): Split by capture setups or environments.

Performance is typically measured by Top-1 accuracy: Acc1=1N∑i=1N1[y^i=yi]\mathrm{Acc}_1 = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat{y}_i = y_i] and, when class imbalance is significant, mean class accuracy: AmC@1=1K∑c=1K1Nc∑i:yi=c1(y^i=c)A_{\mathrm{mC}@1} = \frac{1}{K} \sum_{c=1}^K \frac{1}{N_c} \sum_{i:y_i=c} \mathbf{1}(\hat{y}_i = c) For detection or multi-label tasks (e.g., AVA, PKU-MMD), mean average precision (mAP) at fixed IoU thresholds is used (Duan et al., 2023, Zhang et al., 2024).

Open-set protocols gauge the capacity to reject unseen classes, using metrics such as O-AUROC/AUPR alongside Top-1 (Peng et al., 2023). Benchmarks for noisy labels employ strictly clean test sets, with varying noise ratios rr on the train set, and report mean Top-1 over multiple noise levels (Xu et al., 2024).

3. Benchmark Methodologies and Labeling Conditions

Benchmarks increasingly target varied annotation regimes and environmental realism:

  • Fully Supervised: Standard protocol; e.g., NTU60/120 with exhaustive manual activity labels per sequence (Duan et al., 2022).
  • Noisy-Label Regime: Controlled injection of symmetric label noise, with sample labels yy replaced with random incorrect classes at noise ratio rr (Xu et al., 2024). The NoiseEraSAR benchmark uniquely quantifies recognition robustness under such noise.
  • Unsupervised/Self-Supervised: Benchmarks support unsupervised (no label access) (Su et al., 2019), and self-supervised protocols with context-based, generative, and contrastive pretexts, e.g., ST-Puzzle, MSM, SkeletonCLR (Zhang et al., 2024). Downstream evaluations span recognition, retrieval, detection, and few-shot learning.
  • Open-Set Recognition: Not all action classes are known at train time; models must classify seen classes and reject the unknown. Protocols sample per-split sets of unknowns, reporting open/closed-set metrics (Peng et al., 2023).
  • Multi-Person and Group Activities: Datasets such as Volleyball encode inter-person interactions; benchmarks report per-clip accuracies and, in some cases, spatio-temporal detection metrics (Duan et al., 2023).

4. Comparative Results and Baseline Architectures

Benchmarks drive the assessment of GCN, 2s-AGCN, MS-G3D, CTR-GCN, ST-GCN++, AFE-CNN, and Transformer-based strategies, often under standardized augmentations and preprocessing (Duan et al., 2022, Guan et al., 2022). Representative benchmark results:

Dataset Method Metric Accuracy (%)
NTU60 XSub CTR-GCN Top-1 92.1–92.4
NTU60 XView ST-GCN++ Top-1 97.0
NTU120 XSub MS-G3D Top-1 87.8
Kinetics-400 PoseC3D Top-1 49.1
ANUBIS XView 2s-AGCN Top-1 59.1
FineGYM PoseC3D Mean Class 94.1
Skeletics-152 4s-ShiftGCN Top-1 57.0

Performance varies substantially according to data domain: controlled 3D settings (NTU) consistently achieve >>87% Top-1, whereas in-the-wild (Skeletics-152, Kinetics-Skeleton) drop to ∼\sim57%–55%. Robustness to noisy labels, open set, and domain shift are emergent focus areas, with baseline methods (SOP, NPC, naive GCNs) often yielding only marginal gains compared to specialized frameworks such as NoiseEraSAR or CrossMax (Xu et al., 2024, Peng et al., 2023).

5. Preprocessing, Augmentation, and Good Practices

Benchmark protocols standardize preprocessing for fair evaluation:

  • 3D Skeletons: Centering by hip/root, normalization by body height, and alignment of the first-frame spine to the canonical axis (Duan et al., 2022, Xu et al., 2024).
  • 2D Skeletons: Min–max normalization of x,yx,y to [–1,1][–1,1] per-clip, data denoising (drop clearly erroneous frames), simple tracklet association for multi-person frames (Duan et al., 2022).
  • Temporal Handling: Uniform resampling (not zero-padding), optional loop padding for sequence length regularization.
  • Augmentations: Random rotation, scale, temporal jitter, and joint-level dropout, essential for generalization especially under cross-dataset evaluation (Liu et al., 2024).

Ablation studies confirm that augmentations, uniform sampling, and spatial alignment directly impact test accuracy and inter-benchmark comparability.

6. Benchmark Extensions: Multimodality and Cross-Domain

Recent benchmarks and methods have embraced multimodal, cross-dataset, and open-world recognition challenges:

  • Multimodality: Benchmarks such as MMCL incorporate contrastive and soft-label refinement using RGB and text features at training, maintaining skeleton-only efficiency at inference (Liu et al., 2024). Improvements in zero-shot transfer (e.g., SYSU-Action 27.5%→42.5%) are directly reported.
  • Domain Generalization: Evaluations on untrimmed, noisy, and multi-person sequences (e.g., Skeletics-152, Volleyball, Toyota SmartHome) expose domain gaps and highlight the limited transferability from lab to wild (Gupta et al., 2020).
  • Open-Set and Unsupervised Protocols: OS-SAR benchmarks prescribe held-out class splits, with methods such as CrossMax addressing latent alignment across modalities and outperforming standard open-set baselines by 5–10 points in O-AUROC (Peng et al., 2023). Predict-and-Cluster benchmarks in the unsupervised scenario approach the accuracy of supervised models on NW-UCLA, UWA3D, NTU60 (Su et al., 2019).

7. Benchmark Recommendations and Future Directions

Benchmark design is evolving to quantify robustness, transferability, and open-world competence:

  • For noisy labels: Recommend injecting symmetric noise at multiple rates (20–80%), reporting performance under both cross-subject and cross-view splits, and comparing co-teaching, sample selection, and multi-modality fusion (Xu et al., 2024).
  • For domain transfer: URGE benchmarking on both controlled (NTU-120) and wild datasets (Skeletics-152, Metaphorics), with and without pre-training for transfer quantification (Gupta et al., 2020).
  • For open-set/unlabeled action recognition: Protocols should include standard open/closed splits, O-AUROC/AUPR/closed-set accuracy, and ablations under varying noise and occlusion.
  • Multi-person/group activity: Recommend metrics that capture inter-person interaction accuracy and per-class AP for group activities (Duan et al., 2023).

Open research avenues highlighted by benchmarks: finger-/hand-level skeletons for fine-grained action, multi-modal (RGB, depth, inertial) fusion, unsupervised/self-supervised representation learning for label-scarce settings, dynamic/online recognition, and benchmark tasks explicitly measuring cross-domain and in-the-wild robustness (Qin et al., 2022, Zhang et al., 2024, Guan et al., 2022).


By consolidating large-scale, systematic, and diverse benchmarks, the field of skeleton-based action recognition is positioned to rigorously assess algorithmic advances, facilitate fair comparison under a wide array of conditions, and guide the development of future intelligent systems robust to the complexities of real-world human activity understanding (Xu et al., 2024, Zhang et al., 2024, Gupta et al., 2020, Qin et al., 2022, Liu et al., 2024, Liu et al., 27 Nov 2025, Peng et al., 2023, Duan et al., 2023, Duan et al., 2022, Guan et al., 2022, Su et al., 2019).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skeleton-Based Action Recognition Benchmarks.