Papers
Topics
Authors
Recent
2000 character limit reached

NTU RGB+D 60/120 Benchmarks

Updated 19 December 2025
  • NTU RGB+D 60/120 are comprehensive benchmarks featuring multi-modal and 3D skeleton data for evaluating human action recognition.
  • They employ standardized evaluation protocols such as cross-subject and cross-view splits to rigorously test zero-shot and generalized recognition.
  • The datasets drive methodological advances by providing rich semantic annotations and performance metrics that support fine-grained action analysis.

NTU RGB+D 60/120 refers to two widely used large-scale benchmarks for skeleton-based human action recognition, NTU RGB+D 60 and NTU RGB+D 120. Both datasets were collected using Microsoft Kinect sensors and provide multi-modal, multi-angle recordings of complex human activities, focusing specifically on 3D skeleton data for advancing both supervised and zero-shot action recognition. These benchmarks are foundational for evaluating recent methodological advances in skeleton-based zero-shot action recognition (SZAR), generalized zero-shot action recognition (GZSL), and related cross-modal learning paradigms.

1. Dataset Structure, Collection Protocols, and Modalities

NTU RGB+D 60 ("NTU-60") comprises 60 action classes performed by 40 subjects, totaling 56,578 video samples. It features multi-view RGB, depth, infrared, and 3D skeleton modalities, with each skeleton comprising 25 body joints captured at 30 FPS. Actions span daily, interactional, and health-related movements, with each instance annotated for action label, skeleton sequence, and helper attributes. NTU RGB+D 120 ("NTU-120") extends this design with 120 action classes, 106 subjects, and over 114,000 samples, significantly diversifying the pose and viewpoint space.

Data is partitioned according to either “cross-subject” (Xsub) or “cross-view” (Xview) (NTU-60) or “cross-setup” (NTU-120) protocols. In Xsub, subjects are split into disjoint train/test sets, while Xview/Xsetup divides samples according to acquisition camera layouts or experimental arrangements. This ensures robust generalization evaluation, especially for protocols demanding zero sample overlap between training ("seen") and test ("unseen") classes (Zhu et al., 19 Jun 2024).

Each sample in both datasets provides synchronized access to RGB streams, depth, infrared, RGB-D point clouds, and most importantly, accurate, temporally aligned 3D skeleton sequences.

2. Relevance as Benchmarks for Skeleton-based Zero-Shot and Generalized Zero-Shot Recognition

NTU RGB+D 60/120 are established as the primary testbeds for skeleton-based ZSL and GZSL. The large number of action classes and diverse subject pool, combined with high-quality 3D pose labels, enable protocol splits compatible with both strictly zero-shot and generalized zero-shot evaluation. The standard for recent literature has become to report Top-1 accuracy on a variety of (seen/unseen) splits, notably 55/5 and 48/12 for NTU-60, and 110/10, 96/24 for NTU-120; these splits are supported by several benchmark papers and are adopted almost verbatim in recent competitive work (Zhu et al., 12 Dec 2025, Zhu et al., 19 Jun 2024, Li et al., 18 Jul 2024, Li et al., 2023, Chen et al., 18 Nov 2024, Chen et al., 12 Nov 2025).

Protocol design allows for:

  • Strict ZSL: Model is trained only on seen classes, tested exclusively on instances whose labels were never observed.
  • GZSL: Joint classification over both seen and unseen actions during testing, measuring calibration and transfer under open-set conditions.

Performance metrics include Top-1 and Top-5 accuracy for ZSL, and seen accuracy, unseen accuracy, and harmonic mean (H) for GZSL (Li et al., 2023, Chen et al., 12 Nov 2025).

3. Canonical Splits and Evaluation Practices

Canonical evaluation splits are as follows:

For each split, multiple random instantiations are typically averaged to ensure robustness, as single splits (especially with 5 or 10 unseen classes) can introduce significant variance (Zhu et al., 12 Dec 2025, Li et al., 18 Jul 2024). Some protocols further distinguish between "easy" (unseen semantically close to seen), "medium," and "hard" (unseen semantically furthest from seen) splits (Jasani et al., 2019).

The evaluation strictly precludes training on any skeleton data from unseen classes, requiring all semantic transfer to operate through explicit class-wise side information (label embeddings, descriptions, or LLM-derived semantics).

4. Role in Methodological Progress and Comparative Baselines

The NTU datasets have molded the trajectory of SZAR research. All major recent frameworks benchmark on NTU-60/120:

Model Type Alignment Mechanism Key Semantic Sources ZSL Acc. NTU-60 (55/5) ZSL Acc. NTU-120 (110/10)
SynSE (Gupta et al., 2021) generative VAE PoS-guided cross-modal VAE label PoS tags 75.8% 62.7%
SMIE (Zhou et al., 2023) explicit MI Joint mutual information maximization label/desc. (Sentence-BERT/CLIP) 77.98% 65.74%
PGFA (Zhou et al., 1 Jul 2025) contrastive + prototype End-to-end contrastive + prototype alignment full GPT-desc. 93.2% 71.4%
STAR (Chen et al., 11 Apr 2024) part-aware Topology-driven dual-prompt w/ side information part+global (GPT-3.5+CLIP) 81.4% 63.3%
PURLS (Zhu et al., 19 Jun 2024) cross-attention alignment Multi-scale (body/temporal/global) prompt fusion GPT-3 + CLIP desc. 79.23% 71.95%
Neuron (Chen et al., 18 Nov 2024) adaptive prototyping Micro-prototypes with context-aware side info multi-turn LLM embedding 86.9%
DynaPURLS (Zhu et al., 12 Dec 2025) dynamic refinement Adaptive part-wise fusion + memory-bank adaptation LLM-based multi-scale desc. 88.5% 89.1%
SA-DVAE (Li et al., 18 Jul 2024) disentangled VAE Disentanglement + adversarial correction CLIP/SBERT label embeddings 82.37% 68.77%
Flora (Chen et al., 12 Nov 2025) neighbor-attuned/flow Neighbor-aug. semantic + token-level flow matching CLIP token distill. 86.3% 79.6%
SUGAR (Ye et al., 13 Nov 2025) CLIP-LM contrastive Multimodal text prior + Q-Former LLM interface CLIP/GPT motion+frame desc. 65.3%*

*Via cross-dataset transfer (NTU-60 → NTU-120) (Ye et al., 13 Nov 2025).

These datasets enable rigorous ablations (e.g., semantic granularity, backbone architectures, contrastive regime) and easy reproduction of results, facilitating fair comparison and meta-analysis (Zhou et al., 1 Jul 2025, Zhu et al., 12 Dec 2025).

5. Semantic Annotation and Side Information Integration

NTU RGB+D splits are uniquely suitable for research into semantic transfer due to their support for fine-grained, per-class side information:

  • Action labels: canonical phrase names for each class.
  • Action & Motion Descriptions: Manually curated, LLM-generated, or dictionary-based definitions (Li et al., 2023, Wu et al., 27 Jun 2025).
  • Body-part/Temporal Descriptors: Granular descriptions of localized or phase-specific motion, which are then encoded via frozen LLMs (e.g., CLIP, Sentence-BERT) (Zhu et al., 19 Jun 2024, Zhu et al., 12 Dec 2025).
  • LLM-derived Semantics: Recent methods incorporate GPT-3/3.5/4, CLIP-Text, or hybrid textual anchors, facilitating multi-granularity and part-aware semantic space modeling (global, spatial-local, temporal-local) (Zhu et al., 12 Dec 2025, Zhu et al., 19 Jun 2024).

Side information forms the primary means for models to generalize from seen to unseen action categories.

6. Impact on the Field and Observations from Empirical Studies

The NTU RGB+D 60/120 datasets have catalyzed several methodological advances:

NTU RGB+D 60/120 thus underpin the majority of reproducible, state-of-the-art research on skeleton-based ZSL/GZSL, with the evolution of evaluation protocols closely tied to these resources.

7. Controversies, Limitations, and Future Recommendations

Although NTU RGB+D 60/120 provide standardized, comprehensive testbeds, certain limitations exist:

NTU RGB+D 60/120 remain the anchor benchmarks for research into skeleton-based transfer, cross-modal learning, and zero-shot generalization. Efforts to augment their impact include third-party splits, integration with new sensor modalities, and the design of more difficult and compositional evaluation schemes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to NTU RGB+D 60/120.