Papers
Topics
Authors
Recent
Search
2000 character limit reached

Skeleton-Based Zero-Shot Action Recognition

Updated 19 December 2025
  • Skeleton-Based Zero-Shot Action Recognition (SZAR) is a framework that uses semantic embeddings and prototype adaptation to recognize unseen 3D human actions from skeleton data.
  • The method employs a dual-encoder system with graph-based skeleton encoders and frozen text encoders, optimizing bidirectional cross-modal contrastive loss to enhance feature discrimination.
  • Empirical findings show that prototype-guided alignment significantly boosts intra-class compactness and accuracy, though its reliance on batch processing limits real-time applicability.

Skeleton-Based Zero-Shot Action Recognition (SZAR) involves learning to recognize human actions from 3D skeleton sequences for classes not present during training, using shared semantic information—typically in the form of textual descriptions or embeddings—to enable generalization. This problem occupies a critical intersection of action understanding, cross-modal transfer, and semantic representation, presenting unique challenges due to modality heterogeneity and severe sample asymmetry between skeleton data and class-level semantics.

1. Task Formulation and Fundamental Challenges

In SZAR, let X\mathcal{X} denote the domain of 3D skeleton sequences (each an array of joint coordinates over time) and T\mathcal{T} the space of semantic class descriptions. The action classes are partitioned into disjoint sets of seen classes Cs\mathcal{C}^s and unseen classes Cu\mathcal{C}^u, with training set Ds={(xis,tis)}i=1Ns\mathcal{D}^s = \{(x_i^s, t_i^s)\}_{i=1}^{N^s} (xisx_i^s from Cs\mathcal{C}^s), and test set Du={xju}j=1Nu\mathcal{D}^u = \{x_j^u\}_{j=1}^{N^u} (xjux_j^u from Cu\mathcal{C}^u). The objective is to construct a classifier f:XCuf: \mathcal{X} \to \mathcal{C}^u solely by leveraging domain knowledge from Ds\mathcal{D}^s and semantic embeddings for Cu\mathcal{C}^u.

SZAR is fundamentally challenged by:

  • Insufficient Skeleton Discrimination: Skeleton encoders trained purely for seen-class classification (cross-entropy) often produce features with high intra-class variance and poor inter-class separability, which hinders semantic alignment.
  • Semantic Alignment Bias: Alignment learned on Cs\mathcal{C}^s does not necessarily transfer to Cu\mathcal{C}^u due to distributional shifts; direct comparison with static unseen-class semantic embeddings often yields systematic misalignment and decision bias.
  • Cross-Modal Gap: Semantics (class descriptions, contextual narratives) and skeleton signals are inherently heterogeneous, complicating the construction of a truly shared latent space.

2. Prototype-Guided Feature Alignment (PGFA): Architecture and Methodology

PGFA addresses SZAR by constructing an end-to-end skeleton-text alignment model, with the following core modules (Zhou et al., 1 Jul 2025):

  • Skeleton Encoder ExE_x: Graph-based networks such as ST-GCN or Shift-GCN, mapping xhxRdxx \mapsto h_x \in \mathbb{R}^{d_x}.
  • Text Encoder EtE_t: Frozen Sentence-BERT extracting htRdth_t \in \mathbb{R}^{d_t} from class descriptions.
  • Linear Projection ψ\psi: hxvxRDh_x \mapsto v_x \in \mathbb{R}^D (where D=768D = 768); text features vtv_t are dimension-matched.

Training Pipeline:

  1. Input mini-batch: {xis,tis}i=1b\{x_i^s, t_i^s\}_{i=1}^b
  2. Compute skeleton features his=Ex(xis)h_i^s = E_x(x_i^s), then project vis=ψ(his)v_i^s = \psi(h_i^s)
  3. Fix text features uis=Et(tis)u_i^s = E_t(t_i^s)
  4. Optimize KL-divergence–based cross-modal contrastive loss LKL\mathcal{L}_{KL} with bidirectional supervision (details in §3)
  5. Parameters Ex,ψE_x,\psi are updated by SGD

Test-Time Prototype Alignment:

  1. Extract test features vju=ψ(Ex(xju))v_j^u = \psi(E_x(x_j^u))
  2. Assign initial pseudo-label y^j=arg maxksim(vju,uu,k)\hat{y}_j = \argmax_k \text{sim}(v_j^u, u^{u,k})
  3. Aggregate vjuv_j^u over pseudo-labeled test samples, filter low-confidence assignments by entropy threshold
  4. Compute class prototype pkp^{k} as the mean of normalized features in SkS^k
  5. Predict by y~j=arg maxksim(vju,pk)\tilde{y}_j = \argmax_k \text{sim}(v_j^u, p^{k})

3. Bidirectional Cross-Modal Contrastive Learning

Instead of standard InfoNCE, PGFA's contrastive loss is a bidirectional KL-divergence formulation that accounts for multiple instances sharing the same class in a batch. For a batch {vis,ujs}\{v_i^s, u_j^s\}, the similarity matrix is sij=cos(vis,ujs)/τs_{ij} = \cos(v_i^s, u_j^s)/\tau (with learnable temperature τ\tau). Softmax is computed both row- and column-wise to respect x-to-t and t-to-x directions respectively:

pjxt(i)=exp(sij)=1bexp(si),pitx(j)=exp(sij)=1bexp(sj)p_j^{x\to t}(i) = \frac{\exp(s_{ij})}{\sum_{\ell=1}^b\exp(s_{i\ell})},\quad p_i^{t\to x}(j) = \frac{\exp(s_{ij})}{\sum_{\ell=1}^b\exp(s_{\ell j})}

The ground-truth distributions mixtm_i^{x\to t} and mitxm_i^{t\to x} assign positive mass to all same-label batch entries. The loss is:

LKL=12i=1b[KL(pixtmixt)+KL(pitxmitx)]\mathcal{L}_{KL} = \tfrac12 \sum_{i=1}^b \Big[\text{KL}(p^{x\to t}_i \Vert m^{x\to t}_i) + \text{KL}(p^{t\to x}_i \Vert m^{t\to x}_i)\Big]

This enables robust alignment in presence of label duplication and improves feature discrimination.

4. Prototype-Guided Adaptation and Theoretical Guarantees

To address misalignment of static semantic text anchors in Cu\mathcal{C}^u, PGFA proposes updating the class text prototypes to reflect the true geometric centers of inferred skeleton features for each unseen class. The mechanism is:

Sk={vju/vjuy^j=k},pk={1SkvSkv,Sk>0 uu,k,Sk=0S^k = \left\{ v_j^u/\|v_j^u\|\, |\, \hat{y}_j = k \right\},\quad p^k = \begin{cases} \frac{1}{|S^k|} \sum_{v\in S^k} v, & |S^k|>0 \ u^{u,k}, & |S^k|=0 \end{cases}

This is theoretically justified under the assumption that normalized skeleton features for unseen class kk are von Mises–Fisher distributed with mean direction μk\mu_k, implying that the empirical prototype pkp^k converges to μk\mu_k as Sk|S^k| \to \infty. As such, classifying by arg maxkcos(v,pk)\argmax_k \cos(v,p^k) is asymptotically optimal for the class-conditional distribution.

Entropy-based filtering ensures that only high-confidence samples contribute to prototype estimation, further mitigating noisy attribution.

5. Training Paradigms, Implementation, and Empirical Findings

PGFA contrasts three skeletal feature learning schemes:

  • Pretrain skeleton encoder with CE, freeze, and align with a separate alignment stage.
  • Pretrain with CE, then fine-tune under contrastive loss.
  • Full end-to-end cross-modal contrastive training (PGFA paradigm).

Empirically, full end-to-end contrastive learning achieves substantially higher intra-class compactness (quantified by Fisher Discrimination Ratio and silhouette scores) and test accuracy, due to consistently aligned skeletal and semantic features.

Performance gains are robust across major evaluation protocols:

  • NTU-60 (55/5): 93.17% (PGFA) vs. 70.21% (prior SMIE)
  • NTU-120 (110/10): 71.38% vs. 58.85%
  • PKU-MMD I (46/5): 87.80% vs. 69.26%

Action description granularity also significantly influences results: using complete descriptions or skeleton-focused text yields up to 15% improvement over class names alone.

Ablation studies indicate that prototype-guided adaptation provides ~10% absolute accuracy gain over static semantic matching, and the method is robust to entropy threshold margin α\alpha within reasonable ranges.

6. Limitations, Open Issues, and Future Directions

A major limitation of the PGFA paradigm is its reliance on access to all test samples (or at least a substantial batch) prior to computation of class prototypes, restricting applicability in real-time or online inference scenarios. Current prototyping is non-incremental; handling single-stream data without batch context remains unresolved.

Future research is directed at developing incremental or streaming prototype banks, enabling prototypes to be updated online with each sample. There is also scope for strengthening robustness against semantic ambiguity, adapting to evolving class vocabularies, and integrating dynamic context or part-level motion cues.

7. Context and Comparative Position

PGFA is situated at the frontier of cross-modal alignment methods for SZAR, substantially outperforming mutual information maximization (SMIE), generative VAE-based alignment (MSF, SynSE), and earlier joint embedding techniques (Zhou et al., 1 Jul 2025, Li et al., 2023, Gupta et al., 2021). It makes explicit the critical role of prototype adaptation and end-to-end contrastive feature learning. The broad trend in the field is migration toward multi-granular semantic description, fine-grained alignment, and dynamic or adaptive prototype estimation—a direction reinforced by the success of the prototype-guided paradigm.

PGFA's advances have altered benchmarks for SZAR, establishing new baselines for both accuracy and robustness to domain shift. It crystallizes current understanding of how class-level skeleton distributions and semantic prototypes should interact in the zero-shot regime.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skeleton-Based Zero-Shot Action Recognition (SZAR).