Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalized Zero-Shot Action Recognition

Updated 9 February 2026
  • Generalized Zero-Shot Action Recognition is a paradigm in video action recognition that uses auxiliary semantic information to classify actions without prior visual examples.
  • Key methodologies include compatibility learning, generative models (GANs/VAEs), and clustering techniques that align visual features with semantic embeddings.
  • Evaluation protocols focus on calibration challenges and harmonic mean accuracy to balance performance between seen and unseen classes.

Generalized Zero-Shot Action Recognition (GZSL) is a paradigm in video action recognition where the objective is to recognize both previously observed (seen) and completely novel (unseen) action categories at test time, even though no visual examples of the unseen categories were available during training. Unlike classical supervised learning, which requires labeled examples from every class, GZSL enables recognition of instances from action classes using only auxiliary semantic information (such as attributes, word embeddings, or textual descriptions) that relate unseen to seen classes. GZSL extends the classical zero-shot setting—which restricts predictions to strictly unseen classes—by allowing any test instance to belong to either seen or unseen categories, introducing a fundamental calibration challenge and a strong bias toward seen classes.

1. Problem Formulation and Semantic Representation

GZSL operates on a training set S={(xi,yi,e(yi))}\mathcal{S} = \{(x_i, y_i, e(y_i))\}, where xi∈Rdx_i \in \mathbb{R}^d is a feature vector representing a video, yiy_i is its action class in seen set Ys\mathcal{Y}^s, and e(yi)∈Rme(y_i) \in \mathbb{R}^m is a semantic embedding (from attributes or language). At test time, instances xx can belong either to Ys\mathcal{Y}^s or a disjoint unseen set Yu\mathcal{Y}^u, the latter being defined only by their semantic representation e(u)e(u) with no visual samples (Liu et al., 2017, Gowda et al., 2021, Mandal et al., 2019). The unified label space is thus Ys∪Yu\mathcal{Y}^s \cup \mathcal{Y}^u.

Semantic embeddings are central: models rely on their transferability to link unseen actions to the visual groundings of seen ones. Common sources include:

2. Core Methodological Approaches

GZSL methods for action recognition can be organized around several key model archetypes:

(a) Compatibility Learning and Semantic Alignment

Early approaches—including ConSE, SynC, and LatEm—focus on learning visual-semantic compatibility either via direct mapping into the semantic space (ConSE: convex combination of top seen-class semantic vectors) or through synthesized classifiers (SynC: phantom-anchored ensembles, LatEm: piecewise linear models) (Liu et al., 2017). All rely on shared semantic embedding spaces and nearest-neighbor or classifier-based prediction across Ys∪Yu\mathcal{Y}^s \cup \mathcal{Y}^u.

(b) Generative and Probabilistic Modeling

Generative approaches construct explicit distributions over visual features for each class, using attribute-informed parameterizations:

  • Gaussian Mixture Models: Each class cc is modeled as p(x∣c)=N(x∣μc,Σc)p(x|c) = \mathcal{N}(x|\mu_c, \Sigma_c), with μc\mu_c and (optionally) Σc\Sigma_c specified as (possibly nonlinear) functions of e(c)e(c) (Mishra et al., 2018). Parameters are learned on seen classes, then extrapolated to unseen classes via semantic embeddings. Transductive variants employ EM adaptation to calibrate unseen distributions using unlabeled test data, reducing domain shift (Mishra et al., 2018).
  • Conditional Generative Adversarial Networks (GANs): Synthesize visual features for unseen classes conditioned on e(u)e(u) using conditional WGAN-GP with auxiliary cycle-consistency and cosine-embedding losses (Mandal et al., 2019). This allows training standalone classifiers for Yu\mathcal{Y}^u, as well as supporting out-of-distribution detection techniques.
  • Variational Autoencoders (VAEs): Learn multimodal alignment (e.g., skeleton action features and PoS-tagged class embeddings), with intra-modal VAE losses and cross-modal reconstruction constraints, to support compositional generalization in skeleton-based recognition (Gupta et al., 2021).

(c) Centroid and Clustering Methods

Centroid-based methods, such as CLASTER, create robust action representations by combining raw visual-semantic embeddings with cluster centroids computed via K-means, then refined using reinforcement learning (Gowda et al., 2021). Centroids serve as prototypes or anchors, regularizing the representation to improve generalization to outlier (unseen) test examples.

(d) Out-of-Distribution (OOD) Detection and Calibration

Explicit OOD detection modules—typically entropy-based posteriors over seen classes—allow models to route test samples to specialized classifiers: a standard softmax for seen classes and a separate classifier (or nearest prototype) for unseen classes (Mandal et al., 2019, Gowda et al., 2021). Gating or bias-detection components further calibrate class assignment and mitigate the pronounced bias toward seen categories.

3. Evaluation Protocols, Datasets, and Metrics

The evaluation of GZSL for action recognition is anchored in rigorous protocols:

  • Dataset splits: Established video benchmarks (UCF101, HMDB51, Olympic Sports, NTU-60, NTU-120) are divided into disjoint sets (Ys\mathcal{Y}^s, Yu\mathcal{Y}^u), typically 50/50 or, for skeleton-based datasets, customized (e.g., 110 seen / 10 unseen for NTU-120) (Gowda et al., 2021, Gupta et al., 2021).
  • Features: Strong baselines use CNN features (C3D, I3D, ResNet-50 average pooling, 4s-ShiftGCN for skeletons), with dimensionality reduction as appropriate (Liu et al., 2017, Gupta et al., 2021).
  • Metrics:

The protocol mandates reporting over multiple random splits to ensure statistical significance. Additional measures like cluster purity and t-SNE visualization are employed for in-depth analysis (Gowda et al., 2021).

4. Empirical Results and Comparative Performance

Table: Representative GZSL Harmonic Mean HH (%) Results (mean ± std)

Dataset Method Manual Attr. word2vec Reference
Olympic OD 66.2 53.1 (Mandal et al., 2019)
CLASTER 68.8 58.1 (Gowda et al., 2021)
GGM 52.4 42.2 (Mishra et al., 2018)
UCF101 OD 49.4 37.3 (Mandal et al., 2019)
CLASTER 50.9 42.1 (Gowda et al., 2021)
GGM 23.7 17.5 (Mishra et al., 2018)
HMDB51 OD — 36.1 (Mandal et al., 2019)
CLASTER — 42.4 (Gowda et al., 2021)
GGM — 20.1 (Mishra et al., 2018)

Key empirical findings:

  • Bias to seen classes: Standard ZSL methods collapse on GZSL (unseen acc →0\rightarrow 0 at default calibration) (Liu et al., 2017).
  • Robust clustering and RL-based centroids: GLASTER yields consistent gains (HH up to 68.8% on Olympic with attributes) over both generative and OOD-detection baselines (Gowda et al., 2021).
  • Entropy-based OOD detection: Improves unseen-class accuracy without sacrificing seen-class performance, outperforming WGAN-only and non-calibrated baselines (+7.0% on Olympic, +4.9% UCF101 vs. f-CLSWGAN) (Mandal et al., 2019).
  • Fine-grained compositionality: Syntactic decomposition (SynSE) achieves state-of-the-art ZSL and competitive GZSL performance on skeleton-action datasets, with harmonic mean h=59.02h=59.02% on NTU-60 (55/5 split) (Gupta et al., 2021).
  • Transductive adaptation: GGM and EM-based domain adaptation provide further benefit when unlabeled unseen data is available (Mishra et al., 2018).

5. Model Ablations, Limitations, and Interpretive Insights

Ablation studies and analysis have identified several drivers of GZSL performance:

  • Visual-semantic alignment: Integrating both visual and semantic cues yields 12–15% accuracy gains over visual-only methods (Gowda et al., 2021).
  • Clustering and RL: Regularizing with centroids refined by REINFORCE updates increases unseen-class accuracy by up to 10% and raises cluster purity from 0.77 (K-means) to 0.89 (Gowda et al., 2021).
  • Generative feature synthesis: Conditional GANs with auxiliary losses bridge the semantic gap and reduce seen-class bias (Mandal et al., 2019).
  • OOD calibration: Entropy-based detectors substantially outperform naïve binary gating as the number of classes increases (Mandal et al., 2019).
  • Syntactic guidance: Decomposing labels into PoS components allows fine-grained mapping between skeleton motion and action semantics, supporting compositional generalization (Gupta et al., 2021).
  • Limitations: Reliance on frozen backbones in many pipelines precludes joint optimization; increased unseen-class cardinality leads to accuracy drop; semantic embedding quality directly limits generalization (Gupta et al., 2021, Mishra et al., 2018).

6. Datasets, Benchmarks, and Impact

GZSL for action recognition has spurred dataset construction and novel evaluation protocols:

  • Conventional datasets: Olympic Sports, UCF101, HMDB51, ActivityNet (web-scale).
  • Skeleton-based: NTU-60, NTU-120 (Gupta et al., 2021).
  • Surveillance-focused: Surv5H, a 500-video set of public-safety actions, evaluates real-world transfer from web video to practical scenarios (Liu et al., 2017).

The release of Surv5H and new skeleton splits has facilitated standardized comparison and highlighted the difficulty of bridging domain gaps in real settings. Studies confirm that classical ZSL protocols do not adequately reflect the practical challenge, reinforcing the necessity for GZSL benchmarks and calibration-aware models.

7. Open Problems and Future Directions

Several persistent challenges delineate the research landscape in GZSL for action recognition:

  • Semantic embedding robustness: Advances in extraction and alignment, including sentence-level and context-aware LLMs, are needed to augment transfer (Gowda et al., 2021, Gupta et al., 2021).
  • Joint optimization of visual and semantic spaces: End-to-end approaches that can fine-tune backbone feature extractors in tandem with semantic bridges remain largely untapped, especially for skeleton modalities (Gupta et al., 2021).
  • Unified calibration mechanisms: Current gating and OOD approaches rely on held-out data or heuristics; confidence-aware latent priors and holistic calibration strategies are active areas (Gupta et al., 2021, Mandal et al., 2019).
  • Complex generative capacity: The effectiveness of generative models is limited by their ability to capture intra-class variability and complex visual dynamics; deep VAEs and multi-modal GANs may address these limitations (Mishra et al., 2018).
  • Scaling to large label spaces: As the number of unseen categories grows, performance degrades unless attribute mining and regularization improve.

Significant progress has been made, but the GZSL setting remains fundamentally challenging due to its union of generalization, calibration, and semantic grounding requirements across video modalities (Gowda et al., 2021, Gupta et al., 2021, Mandal et al., 2019, Mishra et al., 2018, Liu et al., 2017).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Zero-Shot Action Recognition (GZSL).