NTU RGB+D 60/120 Benchmarks
- NTU RGB+D 60/120 are comprehensive benchmarks featuring multi-modal and 3D skeleton data for evaluating human action recognition.
- They employ standardized evaluation protocols such as cross-subject and cross-view splits to rigorously test zero-shot and generalized recognition.
- The datasets drive methodological advances by providing rich semantic annotations and performance metrics that support fine-grained action analysis.
NTU RGB+D 60/120 refers to two widely used large-scale benchmarks for skeleton-based human action recognition, NTU RGB+D 60 and NTU RGB+D 120. Both datasets were collected using Microsoft Kinect sensors and provide multi-modal, multi-angle recordings of complex human activities, focusing specifically on 3D skeleton data for advancing both supervised and zero-shot action recognition. These benchmarks are foundational for evaluating recent methodological advances in skeleton-based zero-shot action recognition (SZAR), generalized zero-shot action recognition (GZSL), and related cross-modal learning paradigms.
1. Dataset Structure, Collection Protocols, and Modalities
NTU RGB+D 60 ("NTU-60") comprises 60 action classes performed by 40 subjects, totaling 56,578 video samples. It features multi-view RGB, depth, infrared, and 3D skeleton modalities, with each skeleton comprising 25 body joints captured at 30 FPS. Actions span daily, interactional, and health-related movements, with each instance annotated for action label, skeleton sequence, and helper attributes. NTU RGB+D 120 ("NTU-120") extends this design with 120 action classes, 106 subjects, and over 114,000 samples, significantly diversifying the pose and viewpoint space.
Data is partitioned according to either “cross-subject” (Xsub) or “cross-view” (Xview) (NTU-60) or “cross-setup” (NTU-120) protocols. In Xsub, subjects are split into disjoint train/test sets, while Xview/Xsetup divides samples according to acquisition camera layouts or experimental arrangements. This ensures robust generalization evaluation, especially for protocols demanding zero sample overlap between training ("seen") and test ("unseen") classes (Zhu et al., 19 Jun 2024).
Each sample in both datasets provides synchronized access to RGB streams, depth, infrared, RGB-D point clouds, and most importantly, accurate, temporally aligned 3D skeleton sequences.
2. Relevance as Benchmarks for Skeleton-based Zero-Shot and Generalized Zero-Shot Recognition
NTU RGB+D 60/120 are established as the primary testbeds for skeleton-based ZSL and GZSL. The large number of action classes and diverse subject pool, combined with high-quality 3D pose labels, enable protocol splits compatible with both strictly zero-shot and generalized zero-shot evaluation. The standard for recent literature has become to report Top-1 accuracy on a variety of (seen/unseen) splits, notably 55/5 and 48/12 for NTU-60, and 110/10, 96/24 for NTU-120; these splits are supported by several benchmark papers and are adopted almost verbatim in recent competitive work (Zhu et al., 12 Dec 2025, Zhu et al., 19 Jun 2024, Li et al., 18 Jul 2024, Li et al., 2023, Chen et al., 18 Nov 2024, Chen et al., 12 Nov 2025).
Protocol design allows for:
- Strict ZSL: Model is trained only on seen classes, tested exclusively on instances whose labels were never observed.
- GZSL: Joint classification over both seen and unseen actions during testing, measuring calibration and transfer under open-set conditions.
Performance metrics include Top-1 and Top-5 accuracy for ZSL, and seen accuracy, unseen accuracy, and harmonic mean (H) for GZSL (Li et al., 2023, Chen et al., 12 Nov 2025).
3. Canonical Splits and Evaluation Practices
Canonical evaluation splits are as follows:
- NTU-60: 55/5 (seen/unseen), 48/12, and occasionally larger (e.g., 40/20, 30/30) for increasing zero-shot difficulty (Zhu et al., 19 Jun 2024, Do et al., 16 Nov 2024).
- NTU-120: 110/10, 96/24, 80/40, 60/60 (Zhu et al., 19 Jun 2024, Zhu et al., 12 Dec 2025).
For each split, multiple random instantiations are typically averaged to ensure robustness, as single splits (especially with 5 or 10 unseen classes) can introduce significant variance (Zhu et al., 12 Dec 2025, Li et al., 18 Jul 2024). Some protocols further distinguish between "easy" (unseen semantically close to seen), "medium," and "hard" (unseen semantically furthest from seen) splits (Jasani et al., 2019).
The evaluation strictly precludes training on any skeleton data from unseen classes, requiring all semantic transfer to operate through explicit class-wise side information (label embeddings, descriptions, or LLM-derived semantics).
4. Role in Methodological Progress and Comparative Baselines
The NTU datasets have molded the trajectory of SZAR research. All major recent frameworks benchmark on NTU-60/120:
| Model | Type | Alignment Mechanism | Key Semantic Sources | ZSL Acc. NTU-60 (55/5) | ZSL Acc. NTU-120 (110/10) |
|---|---|---|---|---|---|
| SynSE (Gupta et al., 2021) | generative VAE | PoS-guided cross-modal VAE | label PoS tags | 75.8% | 62.7% |
| SMIE (Zhou et al., 2023) | explicit MI | Joint mutual information maximization | label/desc. (Sentence-BERT/CLIP) | 77.98% | 65.74% |
| PGFA (Zhou et al., 1 Jul 2025) | contrastive + prototype | End-to-end contrastive + prototype alignment | full GPT-desc. | 93.2% | 71.4% |
| STAR (Chen et al., 11 Apr 2024) | part-aware | Topology-driven dual-prompt w/ side information | part+global (GPT-3.5+CLIP) | 81.4% | 63.3% |
| PURLS (Zhu et al., 19 Jun 2024) | cross-attention alignment | Multi-scale (body/temporal/global) prompt fusion | GPT-3 + CLIP desc. | 79.23% | 71.95% |
| Neuron (Chen et al., 18 Nov 2024) | adaptive prototyping | Micro-prototypes with context-aware side info | multi-turn LLM embedding | 86.9% | — |
| DynaPURLS (Zhu et al., 12 Dec 2025) | dynamic refinement | Adaptive part-wise fusion + memory-bank adaptation | LLM-based multi-scale desc. | 88.5% | 89.1% |
| SA-DVAE (Li et al., 18 Jul 2024) | disentangled VAE | Disentanglement + adversarial correction | CLIP/SBERT label embeddings | 82.37% | 68.77% |
| Flora (Chen et al., 12 Nov 2025) | neighbor-attuned/flow | Neighbor-aug. semantic + token-level flow matching | CLIP token distill. | 86.3% | 79.6% |
| SUGAR (Ye et al., 13 Nov 2025) | CLIP-LM contrastive | Multimodal text prior + Q-Former LLM interface | CLIP/GPT motion+frame desc. | 65.3%* | — |
*Via cross-dataset transfer (NTU-60 → NTU-120) (Ye et al., 13 Nov 2025).
These datasets enable rigorous ablations (e.g., semantic granularity, backbone architectures, contrastive regime) and easy reproduction of results, facilitating fair comparison and meta-analysis (Zhou et al., 1 Jul 2025, Zhu et al., 12 Dec 2025).
5. Semantic Annotation and Side Information Integration
NTU RGB+D splits are uniquely suitable for research into semantic transfer due to their support for fine-grained, per-class side information:
- Action labels: canonical phrase names for each class.
- Action & Motion Descriptions: Manually curated, LLM-generated, or dictionary-based definitions (Li et al., 2023, Wu et al., 27 Jun 2025).
- Body-part/Temporal Descriptors: Granular descriptions of localized or phase-specific motion, which are then encoded via frozen LLMs (e.g., CLIP, Sentence-BERT) (Zhu et al., 19 Jun 2024, Zhu et al., 12 Dec 2025).
- LLM-derived Semantics: Recent methods incorporate GPT-3/3.5/4, CLIP-Text, or hybrid textual anchors, facilitating multi-granularity and part-aware semantic space modeling (global, spatial-local, temporal-local) (Zhu et al., 12 Dec 2025, Zhu et al., 19 Jun 2024).
Side information forms the primary means for models to generalize from seen to unseen action categories.
6. Impact on the Field and Observations from Empirical Studies
The NTU RGB+D 60/120 datasets have catalyzed several methodological advances:
- Multi-granular Alignment: Methods demonstrate substantial gains by moving beyond coarse label semantics to integrate action and motion descriptions, body-part and temporal phase cues, and LLM-generated multi-turn descriptions (Zhu et al., 12 Dec 2025, Zhu et al., 19 Jun 2024, Li et al., 2023).
- Test-Time Adaptation: Approaches such as Skeleton-Cache and DynaPURLS show that dynamic, retrieval-based, or online memory-driven adaptation on NTU splits yields significant improvement for GZSL (Zhu et al., 12 Dec 2025, Zhu et al., 12 Dec 2025).
- Dynamic and Region-Aware Decision Boundaries: Fine-grained, token- or region-level classifiers (e.g., Flora) demonstrate improved calibration and transfer, especially under large unseen sets (Chen et al., 12 Nov 2025).
- Consistency Across Benchmarks: Results on NTU 60/120 are well correlated with secondary datasets (PKU-MMD, Kinetics-skeleton), underscoring their representativeness and practical value (Zhu et al., 19 Jun 2024, Zhu et al., 12 Dec 2025).
NTU RGB+D 60/120 thus underpin the majority of reproducible, state-of-the-art research on skeleton-based ZSL/GZSL, with the evolution of evaluation protocols closely tied to these resources.
7. Controversies, Limitations, and Future Recommendations
Although NTU RGB+D 60/120 provide standardized, comprehensive testbeds, certain limitations exist:
- Action granularity and semantic overlap (e.g., "drinking water" vs. "drinking tea") can inflate confusion for methods lacking fine-grained alignment (Chen et al., 12 Nov 2025, Xu et al., 2 Jun 2024).
- Domain shift between seen and unseen classes remains the primary challenge, motivating dynamic alignment and test-time adaptation as in DynaPURLS, Skeleton-Cache, and Flora (Zhu et al., 12 Dec 2025, Zhu et al., 12 Dec 2025, Chen et al., 12 Nov 2025).
- Split size sensitivity: Small “unseen” splits introduce high variance; average over multiple random splits is recommended for meaningful benchmarking (Li et al., 18 Jul 2024, Zhu et al., 12 Dec 2025).
- The 3D pose modality, though robust to occlusion and view changes, is intrinsically information-poor, requiring compensation through semantics-rich prompts and side information (Xu et al., 2 Jun 2024, Wu et al., 27 Jun 2025).
- Future work calls for the development of even richer semantic codebooks, continual/lifelong protocols, and multimodal integration beyond skeleton alone (Zhu et al., 12 Dec 2025, Zhu et al., 12 Dec 2025).
NTU RGB+D 60/120 remain the anchor benchmarks for research into skeleton-based transfer, cross-modal learning, and zero-shot generalization. Efforts to augment their impact include third-party splits, integration with new sensor modalities, and the design of more difficult and compositional evaluation schemes.