Skeleton-based Zero-Shot Action Recognition

Updated 9 February 2026

Skeleton-based Zero-Shot Action Recognition (SZAR) uses 3D skeleton sequences and semantic descriptors to identify previously unseen human actions without labeled samples.
It employs advanced cross-modal alignment techniques, including generative models, contrastive objectives, and part-aware feature partitioning to boost generalization.
Dynamic test-time adaptation and neural embedding methods enhance scalability and robustness, addressing severe distribution shifts in action recognition tasks.

Skeleton-based Zero-Shot Action Recognition (SZAR) is a paradigm in human action recognition that targets the identification of previously unseen action categories, relying solely on 3D body pose (skeleton) input and high-level semantic information, without requiring labeled skeleton samples for unseen classes. SZAR has become a cornerstone for developing action recognition systems with scalability and generalization beyond a closed set of categories, enabled by cross-modal mapping between skeleton motion patterns and textual or semantic descriptors.

1. Formulation and Semantics of Skeleton-based ZSL

Let 𝒞ₛ and 𝒞ᵤ denote disjoint seen and unseen action class sets (𝒞ₛ ∩ 𝒞ᵤ=∅), with training data 𝒟ₛ = { (Xᵢ, yᵢ) | yᵢ ∈ 𝒞ₛ; Xᵢ ∈ ℝ^{L×J×M×3} } comprising skeleton sequences (L frames, J joints, M actors), and 𝒟ᵤ = { (Xⱼ, yⱼ) | yⱼ ∈ 𝒞ᵤ } for evaluation. Associated with each class y∈𝒞ₛ∪𝒞ᵤ is a semantic descriptor, ranging from simple label text to rich natural language or part-aware descriptions. The goal is to learn recognition models on 𝒟ₛ (𝒞ₛ only) that generalize to correctly classify elements of 𝒟ᵤ.

The problem is defined in both Zero-Shot (ZSL; test on 𝒞ᵤ only) and Generalized Zero-Shot (GZSL; test on 𝒞ₛ∪𝒞ᵤ) settings. Success hinges on aligning/importing transfer structures across modalities, under severe distribution shift.

SZAR research has evolved from global, static semantic alignment to intricate multi-scale visual-semantic correspondence, driven by the need to capture subtle motion cues and address domain shift.

Global Embedding Alignment: Early works projected skeleton dynamics and class name embeddings (e.g., Sent2Vec or Sentence-BERT encodings) into a shared space using linear projections or relation networks, optimizing ranking or MSE losses (Jasani et al., 2019). These approaches suffered from limited discriminativeness, especially as unseen classes differed more from seen.
Contrastive and Information-Theoretic Objectives: Methods such as SMIE (Zhou et al., 2023) utilized mutual information objectives to bring skeleton features and text embeddings together, while also accounting for temporal structure via frame-level selection, boosting transfer under complex inter-class variation.
Generative Cross-Modal Models and VAEs: A major development is the use of dual-branch VAEs (e.g., SynSE (Gupta et al., 2021), MSF (Li et al., 2023), SA-DVAE (Li et al., 2024), FS-VAE (Wu et al., 27 Jun 2025)), which learn to encode both skeleton and text semantics into a joint latent space, enforced by reconstruction and cross-modal alignment losses. SA-DVAE introduced explicit disentanglement of semantic and style latent subspaces, improving transfer robustness (Li et al., 2024). In parallel, generative alignment via PoS-structure and cross-modal latent regularization proved essential for compositionality.
Part-aware/Hierarchical Alignment: Recognizing the inadequacy of global alignment for fine-grained actions, PURLS (Zhu et al., 2024) and DynaPURLS (Zhu et al., 12 Dec 2025) leverage GPT-3 to generate multi-granularity textual descriptions (global, four body parts, three temporal segments), matching these via attention-based adaptive partitioning to corresponding skeleton features. STAR (Chen et al., 2024) similarly decomposes the skeleton and semantic space at the body part level, aligning them with dedicated prompts to enforce both intra-class compactness and inter-class separability.
Diffusion and Flow-based Dynamics: Instead of classical alignment/classification, TDSM (Do et al., 2024) aligns skeleton and text features via conditional reverse diffusion, combining a DDPM-style loss with triplet supervision for robust matching in a shared feature space. Flora (Chen et al., 12 Nov 2025) replaces static classifiers with a distribution-aware flow field between semantic and skeleton latent embeddings, coupled with neighbor-aware semantic aggregation, facilitating more nuanced and robust open-form decision boundaries.
Test-time and Dynamic Refinement: DynaPURLS (Zhu et al., 12 Dec 2025) pushes adaptive alignment further by introducing a dynamic refinement module at inference, learning per-sample scale and bias corrections to textual representations based on the incoming visual input, stabilized by a confidence-aware, class-balanced memory bank populated online with high-confidence test pseudo-labels.

3. Semantic Enrichment and Prompt Design

Semantic information—in both label and description form—has proven essential to SZAR's transferability:

Simple class names are insufficient for high-resolution action discrimination. Enriched natural language descriptions, part-specific prompts, temporally phased descriptions, and crowdsourced motion walkthroughs increase mutual information between skeleton dynamics and action semantics (Xu et al., 2024, Li et al., 2023, Zhu et al., 2024).
Selection and generation of side information via LLMs (GPT-3, GPT-4) has become standard to supply both global and part-aware cues (Zhu et al., 12 Dec 2025, Chen et al., 2024, Zhu et al., 2024).
Selective feature ensemble (SFE) and attention mechanisms act on large prompt sets (e.g., 100 GPT-generated phrases/class in InfoCPL (Xu et al., 2024)), with multi-level alignment providing branch-wise hyperplanes for robust class separation.

Semantic richness is vital for classes distinguished by subtle differences in local motion or temporally ordered events, and for generalization when class names themselves are lexically similar or ambiguous.

4. Feature Representation and Partitioning

Novel feature construction strategies have addressed the challenge of mapping skeleton sequences to discriminative spaces:

Adaptive Partitioning: Cross-attention or memory-based modules (PURLS, DynaPURLS, Neuron) partition spatio-temporal features according to semantic cues, allowing the model to focus on body regions and temporal intervals described as important for a given class.
Micro-prototype Evolution: Neuron (Chen et al., 2024) incrementally grows sets of spatial and temporal micro-prototypes, regulated step-wise with context-aware side information, spatial compression, and temporal memory gates to facilitate gradual, fine-to-coarse alignment.
Frequency-based Enhancement: FS-VAE (Wu et al., 27 Jun 2025) enhances skeleton features in the frequency domain, adaptively scaling low- and high-frequency bands, enriching global and fine-grained semantic responses.
Dynamic Test-Time Adaptation: Skeleton-Cache (Zhu et al., 12 Dec 2025) employs a cache of structured descriptors (global, body part, temporal interval) extracted at test time, with LLM-derived class-specific weights for fusion, enabling rapid, training-free adaptation of any frozen model.

These strategies yield tight intra-class skeleton clusters and improved cross-modal correspondence, as evidenced by t-SNE and class-discriminative analyses.

5. Decision Functions and Classifier Design

Classifier architectures in SZAR have evolved beyond fixed similarity/dot-product schemes:

Prototype-based Matching: PGFA (Zhou et al., 1 Jul 2025) computes centroid prototypes for each unseen class from high-confidence test predictions, replacing static text anchors with empirical skeleton feature means, thereby correcting for alignment bias and domain shift.
Generative Classifiers and Gating: SynSE (Gupta et al., 2021), MSF (Li et al., 2023), and SA-DVAE (Li et al., 2024) synthesize latent features for unseen classes, training explicit classifiers (or using confidence-based gating) to route test instances to seen/unseen-class softmax heads.
Distributional and Flow-based Predictors: Flora (Chen et al., 12 Nov 2025) builds distribution-aware classifiers using flow fields between class conditionals (semantic/skeleton), predicting token-level velocity for fine boundary adaptation.
Diffusion-based Matching and Test-Time Parameter Refinement: TDSM (Do et al., 2024) applies triplet-augmented DDPM-style transformers for implicit alignment, while DynaPURLS (Zhu et al., 12 Dec 2025) updates semantic queries with test-time learned corrections.
Test-time Retrieval and Cache Fusion: Skeleton-Cache (Zhu et al., 12 Dec 2025) reformulates prediction as retrieval over a body-part/temporal-part/non-parametric cache, fusing results by LLM-guided, class-specific importance scores without gradient updates.

6. Experimental Protocols, Quantitative Benchmarks, and Ablations

Standard evaluation is performed on NTU-RGB+D 60/120 and PKU-MMD (and recently Kinetics-skeleton 200), under multiple splits (e.g., 55/5, 48/12 for NTU-60; 110/10, 96/24 for NTU-120). Metrics include Top-1 accuracy on unseen classes (ZSL) and harmonic mean H for GZSL.

Selected top reported results (ZSL Top-1, best in bold):

Method	NTU-60 (55/5)	NTU-60 (48/12)	NTU-120 (110/10)	NTU-120 (96/24)	PKU-MMD
SMIE (Zhou et al., 2023)	77.98	40.18	65.74	45.30	-
InfoCPL (Xu et al., 2024)	85.91	53.32	74.81	60.05	85.15
PURLS (Zhu et al., 2024)	79.23	40.99	71.95	52.01	-
DynaPURLS (Zhu et al., 12 Dec 2025)	88.52	71.80	89.06	69.11	78.26
PGFA (Zhou et al., 1 Jul 2025)	80.26	55.99	79.99	59.42	87.80
FS-VAE (Wu et al., 27 Jun 2025)	86.9	57.2	74.4	62.5	-
SA-DVAE (Li et al., 2024)	82.37	41.38	68.77	46.12	-
TDSM (Do et al., 2024)	86.49	56.03	74.15	65.06	-
STAR (Chen et al., 2024)	81.4	45.10	63.30	44.30	-
Flora (Chen et al., 12 Nov 2025)	86.3–88.6	65.3	80.7–71.2	66.4	71.6

Ablation studies consistently show that multi-level/part-aware semantics, adaptive alignment, dynamic refinement, and semantic fusion are each responsible for large performance gains (e.g., DynaPURLS reports +9–17% from dynamic refinement, InfoCPL +17–20% from branch ensemble, PGFA +10% from prototype alignment).

7. Outlook and Open Directions

The field has converged on multi-granularity semantics, attention-based feature partitioning, generative/contrastive alignment objectives, and dynamic test-time adaptation as pillars for SZAR performance. However, several challenges remain:

Skeleton-Only Limitation: Actions differing only by manipulated objects or subtle context may remain indistinguishable in pure skeleton space; future work targets multi-modal fusion (RGB, depth, audio).
Prompt and Semantic Quality: Automated description generation via LLMs is central, yet may still yield ambiguous or noisy supervision, especially for highly granular actions; integrating feedback-driven or context-aware prompt tuning is an open problem.
Resource Scaling: Emerging methods such as Skeleton-Cache and memory banks offer efficient adaptation, but scalability to truly open-vocabulary settings or low-shot domains depends on continued advances in non-parametric retrieval, hashing, and distributional matching.
Fine-tuning and Self-Supervision: End-to-end fine-tuning, self-supervised skeleton pretraining, and transformer-based architectures are promising for further closing the transfer gap.
Temporal Symbolization and LLM Reasoning: LLM-based reasoning on compressed skeleton representations (e.g., SUGAR (Ye et al., 13 Nov 2025)) with temporal projection is an incipient research line, enabling flexible symbolic abstraction and open-ended description of actions in the zero-shot regime.

Skeleton-based Zero-Shot Action Recognition thus continues to serve both as a touchstone for multimodal, transfer learning methodologies and as a concrete target for robust, scalable action understanding in embodied AI systems.