Zero-shot Skeleton-based Action Recognition

Updated 17 December 2025

Zero-shot skeleton-based action recognition is a method that leverages semantic descriptions to recognize human actions from unseen skeleton data by bridging the gap between visual features and language embeddings.
Recent approaches use mutual information maximization, generative models, and adaptive partitioning to align and disentangle rich multi-scale skeleton representations with semantic class prototypes.
Empirical studies on benchmarks like NTU-60 highlight significant performance gains through innovations in multi-granularity semantics, dynamic test-time adaptation, and flow-based decision frameworks.

Zero-shot skeleton-based action recognition (ZS-SAR) is the task of recognizing human actions represented by skeleton sequences, where some target action classes are never seen during training. Unlike supervised skeleton-based action recognition, ZS-SAR leverages semantic knowledge—most often in the form of language descriptions or class prototypes—to transfer knowledge from seen to unseen classes. This involves constructing a bridge between the visual space of skeleton features and the semantic space of action categories. The field has advanced through innovations in visual-semantic alignment, semantic prompt engineering, adaptive feature partitioning, generative cross-modal modeling, feature disentanglement, dynamic test-time adaptation, and flow-based decision frameworks.

1. Formal Problem Definition

ZS-SAR considers a labeled training set with skeleton sequences from a subset of actions (seen classes $\mathcal{Y}_s$ ). Each sample consists of a sequence $x \in \mathbb{R}^{T\times J \times C}$ , where $T$ is the number of frames, $J$ joints, and $C$ coordinate channels, together with a semantic label. At inference, the system must label sequences from disjoint, unseen classes $\mathcal{Y}_u$ , where no training samples for $y\in\mathcal{Y}_u$ are available. Each class $y$ (seen or unseen) is associated with a class name or richer textual description, which is mapped to a semantic embedding $a$ via a LLM, e.g., Sentence-BERT or CLIP.

The ZS-SAR goal is to learn a recognition function $f:\mathcal{X} \to \mathcal{Y}_u$ using only labeled skeletons from $\mathcal{Y}_s$ , such that for a test skeleton $x_{test}$ ,

$\hat{y} = \arg\max_{a(y)\,:\,y\in\mathcal{Y}_u} T(v_{x_{test}}, a(y)),$

where $v_{x_{test}}$ is the encoded visual feature and $T$ is the scoring/alignment function between skeleton and semantic features. In the generalized setting (GZSL), test samples may come from both $\mathcal{Y}_s$ and $\mathcal{Y}_u$ .

2. Visual-Semantic Alignment Methodologies

ZS-SAR methodologies focus on aligning skeleton features with semantic prototypes under severe data distribution shifts. Early works relied on mapping skeleton features and class label embeddings to a shared space using projection (e.g., DeViSE) or relation networks (Jasani et al., 2019). Subsequent approaches recognize the need for improved alignment due to domain gap and distributional bias.

Mutual Information Maximization

The SMIE approach directly maximizes the mutual information (MI) between visual and semantic embeddings, using a Jensen–Shannon divergence–based lower bound to encourage distributional alignment (Zhou et al., 2023). The scoring network $T(v, a)$ is trained with global MI and an incremental loss that leverages temporal information, enforcing that additional video frames increase the MI. This joint objective is implemented as

$L = -m + \lambda\, \max(0,\, \beta - (m - \hat{m})),$

where $m$ and $\hat{m}$ are MI estimates for full and partial sequences, respectively.

Multi-Semantic, Part-, and Temporal-level Descriptions

Recent models leverage richer semantic representations. PURLS and DynaPURLS introduce part-aware and temporal-interval-aware alignment (Zhu et al., 19 Jun 2024, Zhu et al., 12 Dec 2025). Each skeleton class is annotated with a global description, body-part motion descriptions (e.g., head, hands, torso, legs), and temporal-phase descriptions (start, middle, end), generated via LLM prompts and encoded with CLIP. These local and global features are aligned using adaptive attention, producing a multi-scale correspondence between skeleton joints/motions and hierarchical semantics.

Generative and Disentangled Models

Generative frameworks (e.g., MSF, SA-DVAE, FS-VAE) utilize variational autoencoders (VAEs) to bring skeleton and semantic features into a shared latent space (Li et al., 2023, Li et al., 18 Jul 2024, Wu et al., 27 Jun 2025). Disentanglement, as implemented in SA-DVAE, splits skeleton encodings into semantic-relevant and irrelevant parts, improving cross-modal alignment. FS-VAE also explicitly enriches skeleton features in frequency space, highlighting high-frequency (fine-grained motions) and low-frequency (global coordination) components.

Diffusion and Flow Frameworks

TDSM reframes the alignment as a conditional denoising diffusion process, in which reverse-diffusion fuses skeleton and text embeddings. A triplet diffusion loss encourages correct skeleton-text alignment while pushing apart incorrect matches (Do et al., 16 Nov 2024). Flora transitions from point-to-point to point-to-region alignment via neighbor-aware semantic aggregation and further introduces a “flow classifier” that learns a velocity field transporting semantic embeddings to skeleton representations, supporting fine-grained token-level classifiers and open-form generalization (Chen et al., 12 Nov 2025).

3. Semantic Prompt Engineering and Multi-Granularity Representations

The design of class semantic prototypes plays a critical role. Early works used only class names; subsequent methods exploit external resources and prompt engineering. LLMs such as GPT-3/4 are prompted to produce:

Multiple auxiliary descriptions per class, including action definitions, part- and motion-specific descriptions, and context sentences.
Codebook-based approaches, where each class is associated with a rich pool of generated descriptors, which are then selectively aggregated via attention mechanisms (InfoCPL) (Xu et al., 2 Jun 2024).

Multi-level alignment and selective feature ensemble techniques (SFE, MLA) form branches that align visual features with semantic cues at different granularities. These mechanisms augment inter-class separation by leveraging the diversity of LLM outputs.

Fine-grained, local visual-semantic correspondences enhance ZS-SAR transferability. Adaptive partitioning modules group joints and frames semantically, attending to body parts and temporal intervals as guided by LLM-driven textual descriptions (Zhu et al., 19 Jun 2024, Zhu et al., 12 Dec 2025). Key innovations include:

Dynamic refinement modules (e.g., in DynaPURLS), which adapt the semantic prototypes during test-time inference using class-balanced memory banks and lightweight learnable projections, mitigating semantic-visual domain shift on-the-fly (Zhu et al., 12 Dec 2025).
Training-free adaptation using non-parametric caches (Skeleton-Cache), in which representative global and local descriptors of test samples are stored and retrieved at inference, with descriptor-wise predictions fused using LLM-derived class-specific weights, all without updating model parameters (Zhu et al., 12 Dec 2025).

5. Generative, Flow-based, and Prototype-guided Decision Mechanisms

ZS-SAR classifiers range from embedding similarity computations to generative models and distribution-aware decision rules:

VAE- and diffusion-based classifiers: Cross-modal VAEs support feature alignment and synthesis of unseen skeleton prototypes (Li et al., 2023, Li et al., 18 Jul 2024, Wu et al., 27 Jun 2025, Chen et al., 12 Nov 2025, Do et al., 16 Nov 2024). Diffusion-based methods enable reverse denoising processes that fuse skeleton and text features for alignment, regularized with triplet or contrastive diffusion losses.
Prototype-guided alignment: PGFA estimates unseen class centers (prototypes) from confidently pseudo-labeled test skeletons, correcting for textual-visual misalignment caused by static text-only class anchors (Zhou et al., 1 Jul 2025). This approach is especially effective in transductive settings.
Flow-based open-form classifiers: Flora models distributional transport of semantic to skeleton latents via a velocity ODE, enabling fine-grained, token-level decision boundaries robust to semantic imprecision and domain shift (Chen et al., 12 Nov 2025).

6. Experimental Results and Empirical Insights

ZS-SAR progress is tracked by Top-1 accuracy on unseen classes (ZSL) and harmonic mean scores under GZSL. Benchmarks include NTU-RGB+D 60/120 and PKU-MMD, with various seen/unseen splits:

Progressive advances: Early global-only alignment (SMIE) yielded unseen accuracy gains (e.g., 40.18% vs SynSE 33.3% on NTU60), but part-aware, adaptive, and diffusion-based methods (PURLS, STAR, TDSM, DynaPURLS, Flora) have driven performance as high as 88.5–93.2% depending on protocol (Zhu et al., 19 Jun 2024, Chen et al., 11 Apr 2024, Do et al., 16 Nov 2024, Zhu et al., 12 Dec 2025, Zhou et al., 1 Jul 2025, Chen et al., 12 Nov 2025).
Importance of semantic richness: Enriching class anchors with multi-granularity and LLM-generated descriptions consistently yields +5–20% gains over static class-name baselines (Xu et al., 2 Jun 2024, Li et al., 2023, Zhu et al., 12 Dec 2025).
Fine-grained and dynamic techniques, including adaptive refinement (DynaPURLS), neighbor-aware attunement (Flora), and prototype-guided alignment (PGFA), show state-of-the-art results and reduced domain shift in both ZSL and GZSL (Zhu et al., 12 Dec 2025, Chen et al., 12 Nov 2025, Zhou et al., 1 Jul 2025).
Skeleton-Cache test-time adaptation boosts existing models' accuracy by 4–7 points without retraining (Zhu et al., 12 Dec 2025).
Ablations confirm the necessity of multi-level alignment, dynamic adaptation, and codebook/descriptor diversity (Xu et al., 2 Jun 2024, Zhu et al., 12 Dec 2025, Zhou et al., 1 Jul 2025).

7. Interpretability, Limitations, and Open Challenges

Current ZS-SAR research has illuminated several issues:

Fine-grained alignment is critical for separating visually or semantically similar actions; coarse, global-only mappings are insufficient (Chen et al., 11 Apr 2024, Zhu et al., 12 Dec 2025).
Domain shift between seen and unseen class distributions persists, motivating dynamic adaptation frameworks and prototype-guided corrections (Zhu et al., 12 Dec 2025, Zhou et al., 1 Jul 2025).
Interpretability is improved via explicit part/temporal alignment, prototype evolution, and velocity flow transitions, which reveal class- and instance-level alignment behavior (Chen et al., 18 Nov 2024, Chen et al., 11 Apr 2024, Chen et al., 12 Nov 2025).
Limitations remain: reliance on LLM prompt quality, sensitivity to hyperparameters, challenges in streaming or incremental settings, and restricted generalization to in-the-wild skeletons or multiple modalities.

Potential future directions include end-to-end multimodal modeling, unsupervised or continual ZS-SAR learning, adaptive partitioning per class/instance, enhanced uncertainty estimation for dynamic adaptation, and cross-modal large-scale pretraining for skeleton and language representations.

Table: Selected State-of-the-Art Results on NTU-60

Method	55/5 Split (%)	48/12 Split (%)	Source
SynSE	33.30	28.96	(Zhou et al., 2023)
SMIE	77.98	40.18	(Zhou et al., 2023)
PURLS	79.23	40.99	(Zhu et al., 19 Jun 2024)
STAR	81.40	45.10	(Chen et al., 11 Apr 2024)
SA-DVAE	84.20	–	(Li et al., 18 Jul 2024)
TDSM	88.88	–	(Do et al., 16 Nov 2024)
DynaPURLS	88.52	71.80	(Zhu et al., 12 Dec 2025)
Flora	85.60	–	(Chen et al., 12 Nov 2025)
PGFA	93.20	–	(Zhou et al., 1 Jul 2025)

Numbers may vary across experimental settings (main/restricted splits, random/optimized seeds, encoder backbones).

Zero-shot skeleton-based action recognition has rapidly advanced through multi-granularity semantics, fine-grained visual partitioning, dynamic adaptation, and cross-modal generative and flow-based modeling, setting an evolving state-of-the-art on challenging human action benchmarks.