Skeleton-Based Zero-Shot Action Recognition

Updated 19 December 2025

Skeleton-Based Zero-Shot Action Recognition (SZAR) is a framework that uses semantic embeddings and prototype adaptation to recognize unseen 3D human actions from skeleton data.
The method employs a dual-encoder system with graph-based skeleton encoders and frozen text encoders, optimizing bidirectional cross-modal contrastive loss to enhance feature discrimination.
Empirical findings show that prototype-guided alignment significantly boosts intra-class compactness and accuracy, though its reliance on batch processing limits real-time applicability.

Skeleton-Based Zero-Shot Action Recognition (SZAR) involves learning to recognize human actions from 3D skeleton sequences for classes not present during training, using shared semantic information—typically in the form of textual descriptions or embeddings—to enable generalization. This problem occupies a critical intersection of action understanding, cross-modal transfer, and semantic representation, presenting unique challenges due to modality heterogeneity and severe sample asymmetry between skeleton data and class-level semantics.

1. Task Formulation and Fundamental Challenges

In SZAR, let $\mathcal{X}$ denote the domain of 3D skeleton sequences (each an array of joint coordinates over time) and $\mathcal{T}$ the space of semantic class descriptions. The action classes are partitioned into disjoint sets of seen classes $\mathcal{C}^s$ and unseen classes $\mathcal{C}^u$ , with training set $\mathcal{D}^s = \{(x_i^s, t_i^s)\}_{i=1}^{N^s}$ ( $x_i^s$ from $\mathcal{C}^s$ ), and test set $\mathcal{D}^u = \{x_j^u\}_{j=1}^{N^u}$ ( $x_j^u$ from $\mathcal{C}^u$ ). The objective is to construct a classifier $f: \mathcal{X} \to \mathcal{C}^u$ solely by leveraging domain knowledge from $\mathcal{D}^s$ and semantic embeddings for $\mathcal{C}^u$ .

SZAR is fundamentally challenged by:

Insufficient Skeleton Discrimination: Skeleton encoders trained purely for seen-class classification (cross-entropy) often produce features with high intra-class variance and poor inter-class separability, which hinders semantic alignment.
Semantic Alignment Bias: Alignment learned on $\mathcal{C}^s$ does not necessarily transfer to $\mathcal{C}^u$ due to distributional shifts; direct comparison with static unseen-class semantic embeddings often yields systematic misalignment and decision bias.
Cross-Modal Gap: Semantics (class descriptions, contextual narratives) and skeleton signals are inherently heterogeneous, complicating the construction of a truly shared latent space.

2. Prototype-Guided Feature Alignment (PGFA): Architecture and Methodology

PGFA addresses SZAR by constructing an end-to-end skeleton-text alignment model, with the following core modules (Zhou et al., 1 Jul 2025):

Skeleton Encoder $E_x$ : Graph-based networks such as ST-GCN or Shift-GCN, mapping $x \mapsto h_x \in \mathbb{R}^{d_x}$ .
Text Encoder $E_t$ : Frozen Sentence-BERT extracting $h_t \in \mathbb{R}^{d_t}$ from class descriptions.
Linear Projection $\psi$ : $h_x \mapsto v_x \in \mathbb{R}^D$ (where $D = 768$ ); text features $v_t$ are dimension-matched.

Training Pipeline:

Input mini-batch: $\{x_i^s, t_i^s\}_{i=1}^b$
Compute skeleton features $h_i^s = E_x(x_i^s)$ , then project $v_i^s = \psi(h_i^s)$
Fix text features $u_i^s = E_t(t_i^s)$
Optimize KL-divergence–based cross-modal contrastive loss $\mathcal{L}_{KL}$ with bidirectional supervision (details in §3)
Parameters $E_x,\psi$ are updated by SGD

Test-Time Prototype Alignment:

Extract test features $v_j^u = \psi(E_x(x_j^u))$
Assign initial pseudo-label $\hat{y}_j = \argmax_k \text{sim}(v_j^u, u^{u,k})$
Aggregate $v_j^u$ over pseudo-labeled test samples, filter low-confidence assignments by entropy threshold
Compute class prototype $p^{k}$ as the mean of normalized features in $S^k$
Predict by $\tilde{y}_j = \argmax_k \text{sim}(v_j^u, p^{k})$

Instead of standard InfoNCE, PGFA's contrastive loss is a bidirectional KL-divergence formulation that accounts for multiple instances sharing the same class in a batch. For a batch $\{v_i^s, u_j^s\}$ , the similarity matrix is $s_{ij} = \cos(v_i^s, u_j^s)/\tau$ (with learnable temperature $\tau$ ). Softmax is computed both row- and column-wise to respect x-to-t and t-to-x directions respectively:

$p_j^{x\to t}(i) = \frac{\exp(s_{ij})}{\sum_{\ell=1}^b\exp(s_{i\ell})},\quad p_i^{t\to x}(j) = \frac{\exp(s_{ij})}{\sum_{\ell=1}^b\exp(s_{\ell j})}$

The ground-truth distributions $m_i^{x\to t}$ and $m_i^{t\to x}$ assign positive mass to all same-label batch entries. The loss is:

$\mathcal{L}_{KL} = \tfrac12 \sum_{i=1}^b \Big[\text{KL}(p^{x\to t}_i \Vert m^{x\to t}_i) + \text{KL}(p^{t\to x}_i \Vert m^{t\to x}_i)\Big]$

This enables robust alignment in presence of label duplication and improves feature discrimination.

4. Prototype-Guided Adaptation and Theoretical Guarantees

To address misalignment of static semantic text anchors in $\mathcal{C}^u$ , PGFA proposes updating the class text prototypes to reflect the true geometric centers of inferred skeleton features for each unseen class. The mechanism is:

$S^k = \left\{ v_j^u/\|v_j^u\|\, |\, \hat{y}_j = k \right\},\quad p^k = \begin{cases} \frac{1}{|S^k|} \sum_{v\in S^k} v, & |S^k|>0 \ u^{u,k}, & |S^k|=0 \end{cases}$

This is theoretically justified under the assumption that normalized skeleton features for unseen class $k$ are von Mises–Fisher distributed with mean direction $\mu_k$ , implying that the empirical prototype $p^k$ converges to $\mu_k$ as $|S^k| \to \infty$ . As such, classifying by $\argmax_k \cos(v,p^k)$ is asymptotically optimal for the class-conditional distribution.

Entropy-based filtering ensures that only high-confidence samples contribute to prototype estimation, further mitigating noisy attribution.

5. Training Paradigms, Implementation, and Empirical Findings

PGFA contrasts three skeletal feature learning schemes:

Pretrain skeleton encoder with CE, freeze, and align with a separate alignment stage.
Pretrain with CE, then fine-tune under contrastive loss.
Full end-to-end cross-modal contrastive training (PGFA paradigm).

Empirically, full end-to-end contrastive learning achieves substantially higher intra-class compactness (quantified by Fisher Discrimination Ratio and silhouette scores) and test accuracy, due to consistently aligned skeletal and semantic features.

Performance gains are robust across major evaluation protocols:

NTU-60 (55/5): 93.17% (PGFA) vs. 70.21% (prior SMIE)
NTU-120 (110/10): 71.38% vs. 58.85%
PKU-MMD I (46/5): 87.80% vs. 69.26%

Action description granularity also significantly influences results: using complete descriptions or skeleton-focused text yields up to 15% improvement over class names alone.

Ablation studies indicate that prototype-guided adaptation provides ~10% absolute accuracy gain over static semantic matching, and the method is robust to entropy threshold margin $\alpha$ within reasonable ranges.

6. Limitations, Open Issues, and Future Directions

A major limitation of the PGFA paradigm is its reliance on access to all test samples (or at least a substantial batch) prior to computation of class prototypes, restricting applicability in real-time or online inference scenarios. Current prototyping is non-incremental; handling single-stream data without batch context remains unresolved.

Future research is directed at developing incremental or streaming prototype banks, enabling prototypes to be updated online with each sample. There is also scope for strengthening robustness against semantic ambiguity, adapting to evolving class vocabularies, and integrating dynamic context or part-level motion cues.

7. Context and Comparative Position

PGFA is situated at the frontier of cross-modal alignment methods for SZAR, substantially outperforming mutual information maximization (SMIE), generative VAE-based alignment (MSF, SynSE), and earlier joint embedding techniques (Zhou et al., 1 Jul 2025, Li et al., 2023, Gupta et al., 2021). It makes explicit the critical role of prototype adaptation and end-to-end contrastive feature learning. The broad trend in the field is migration toward multi-granular semantic description, fine-grained alignment, and dynamic or adaptive prototype estimation—a direction reinforced by the success of the prototype-guided paradigm.

PGFA's advances have altered benchmarks for SZAR, establishing new baselines for both accuracy and robustness to domain shift. It crystallizes current understanding of how class-level skeleton distributions and semantic prototypes should interact in the zero-shot regime.