Zero-shot Action Classification

Updated 12 March 2026

Zero-shot action classification is a method that recognizes video actions without labeled examples by mapping visual features to shared semantic embeddings.
It leverages compatibility-based, generative, and structured semantic approaches to effectively transfer knowledge between seen and unseen classes.
Rigorous evaluation protocols ensure fairness by enforcing non-overlap between training, pretraining, and unseen classes to prevent semantic leakage.

Zero-shot action classification addresses the problem of recognizing video action categories for which no labeled examples are available during training. The core principle is knowledge transfer between "seen" (labeled during training) and "unseen" (completely absent during training) classes via a shared semantic space, typically instantiated through distributed representations of action labels such as word, sentence, or attribute embeddings. The field has produced a diverse algorithmic landscape, encompassing compatibility-based models, generative frameworks, structured semantic augmentation, and transductive realignment strategies. Rigorous evaluation protocols—especially those addressing pretraining-leakage and realistic test conditions for "true" zero-shot—are now a focus, given the strong reliance on pretrained video features. The following sections synthesize the state-of-the-art paradigm, methodology, core technical advances, evaluation frameworks, and challenges in zero-shot action classification.

1. Formalization and Core Principles

Zero-shot action classification (ZSL) is defined over a set of classes partitioned into seen $(Y_s)$ and unseen $(Y_u)$ categories, with $Y_s \cap Y_u = \emptyset$ (Gowda et al., 2021). Each video $x \in \mathbb{R}^p$ is mapped to a visual feature, while each class label $y$ is mapped to a semantic embedding $a(y) \in \mathbb{R}^d$ (word2vec, sen2vec, sentence BERT, etc.). The model learns a compatibility function $F(x, y)$ , often bilinear or cosine-based,

$F(x, y) = \langle \theta(x), a(y) \rangle \quad \text{or} \quad F(x, y; W) = \theta(x)^\top W a(y)$

At test time, $\hat y = \arg\max_{y \in Y_u} F(x, y)$ . In generalized zero-shot learning (GZSL), predictions are drawn from $Y_s \cup Y_u$ .

A crucial requirement, particularly with deep visual backbones pre-trained on large-scale action datasets (e.g., Kinetics-400), is that $Y_u \cap Y_p = \emptyset$ , where $Y_p$ denotes classes seen during pre-training (Gowda et al., 2021). Violation of this leads to inflated zero-shot performance through semantic or visual label leakage.

2. Methodological Advances and Model Taxonomy

2.1 Compatibility-based and Generative Models

Compatibility-based methods dominate, mapping either from visual to semantic space [CLASTER, (Gowda et al., 2021)], via regression or similarity, or into joint/intermediate spaces (e.g., CCA, Sammon mapping). Generative models synthesize unseen-class visual features from semantic embeddings using cGANs or VAEs, mitigating sample imbalance and enabling standard discriminative training across seen and unseen classes (Sun et al., 2021, Gowda et al., 2023, Gowda et al., 2024). The two-stage pipeline of "GAN for Vision, KG for Relation" first synthesizes features with a WGAN-GP and then classifies with a knowledge-graph-based GCN, reflecting this paradigm (Sun et al., 2021).

2.2 Structured and Rich Semantic Representation

Research converges on the inadequacy of simple class-name embeddings. Richer semantic augmentation includes:

Elaborative Descriptions: Human-annotated, definition-style sentences for each action class; BERT encodings outperform classic word2vec (Chen et al., 2021).
Narrative "Stories": Multi-sentence, stepwise action descriptions mined from WikiHow and curated sources; encoded via Sentence-BERT, they enable feature generation via WGAN and outperform traditional word embeddings across all protocols (Gowda et al., 2023).
Multi-Semantic Fusion: Concatenation of label, action, and motion descriptions for skeleton-based ZSL; aligned via bidirectional VAEs (Li et al., 2023).

Integrating such structured semantics into mapping and generative models substantially narrows the semantic gap and boosts transfer. For example, using Stories embeddings improves zero-shot accuracy on UCF-101 from 49.7% to 73.4% with CLIP features (Gowda et al., 2023).

2.3 Spatial and Object-driven Embedding

Object-centric reasoning is foundational to several frameworks. Spatial-aware embeddings leverage object detection, actor–object spatial priors, and global scene-context to construct action representations (Mettes et al., 2017). SBERT-based object–action affinities, fused with video captions, yield state-of-the-art on Kinetics-400 (Estevam et al., 2022). Such approaches excel particularly where objects serve as strong proxies for actions.

2.4 Temporal and Compositional Signatures

Temporal logic is captured in models like DASZL, encoding action classes as dynamic finite-state acceptors over attribute sequences. This facilitates compositional and interpretable specification of complex, fine-grained activities (Kim et al., 2019).

Reinforcement learning over clustered joint visual-semantic spaces (CLASTER) further regularizes representation, improving generalized ZSL (Gowda et al., 2021).

2.5 Vision-Language Prompting

Semantic prompting of frozen vision–LLMs (e.g., CLIP) using structured, multi-level prompts (intent, motion, object interaction) can rival parameterized temporal adaptation, showing that careful language engineering alone can achieve strong zero-shot transfer (Iqbal et al., 9 Mar 2026).

3. Evaluation Protocols and Datasets

The TruZe split (Gowda et al., 2021) sets a new standard by ensuring $Y_u \cap Y_p = \emptyset$ , preventing leakage from backbone pre-training. Visual and semantic similarities between ZSL labels and backbone training labels are explicitly measured to construct disjoint splits. This yields substantially harder benchmarks: on UCF101 and HMDB51, switching to TruZe causes up to 8.9% drop in unseen accuracy in ZSL and up to 9.4% drop in Acc_U in GZSL.

Common datasets include UCF101 (101 actions), HMDB51 (51), Olympic Sports (16), Kinetics-400/600/700, ActivityNet, and specialized egocentric or skeleton-based corpora (e.g., EPIC-KITCHENS with structured verb–noun splits (Scott et al., 2020), NTU for skeleton). Recent protocols emphasize:

Mean per-class accuracy to counter class imbalance
Generalized zero-shot metrics: Acc_U (unseen), Acc_S (seen), and harmonic mean $H = 2\cdot\text{Acc}_U\cdot\text{Acc}_S/(\text{Acc}_U+\text{Acc}_S)$
Area Under Seen–Unseen Curve (AUSUC) for calibration sensitivity (Liu et al., 2017)

4. Empirical Trends and Key Results

Generative replay and continual learning (GIL) achieves +6.2% (UCF-101) and +5.5% (HMDB51) gain in top-1 ZSL accuracy over strong baselines, with harmonically measured GZSL boosts of up to +19.7% (Gowda et al., 2024).
Narrative-augmented feature synthesis (Stories): Models trained with Stories embeddings surpass even few-shot baselines, e.g., 42.1% top-1 on Kinetics ZSL (0-shot) exceeds 1-shot (31.8%) and approaches 2-shot (45.0%) (Gowda et al., 2023).
Spatial-aware embeddings and object-action fusion provide consistent gains, especially in scenes where object cues are dominant (Mettes et al., 2017, Mettes, 2022, Estevam et al., 2022).
Metric learning and joint video-text embedding significantly improve zero-shot generalization in open-set and compositional action recognition, elevating 5-way zero-shot accuracy by 6–10 percentage points (Scott et al., 2020).
Prototype transport via hyperspherical optimal transport corrects semantic–visual bias, realigns unreachable prototypes, and increases coverage of unseen labels (previously, up to 23% were never selected) (Mettes, 2022).

The following table summarizes key ZSL performance metrics reported for several leading models on representative splits:

Method	Dataset / Protocol	ZSL Top-1 (%)	GZSL Harmonic (%)	Notes
GIL	UCF-101 (Stories)	79.4	77.5	Generative, continual learning
ER-ZSAR	Kinetics ZSAR	42.1	n/a	Human-edited definitions
CLASTER	UCF-101 (S)	50.2	48.3	RL-based clustering
FGGA (GAN+KG)	HMDB51	31.2	36.4	Feature synthesis + GCN
SDR+CLIP (Stories)	UCF-101 (Xu)	73.4	59.7	Story narrative + WGAN
SP-CLIP	UCF101 ZSL	80.4	—	Semantic prompting

5. Challenges, Limitations, and Recommendations

Pre-training overlap: Backbone pretraining on datasets with overlapping classes falsely boosts zero-shot metrics; enforcing $Y_u \cap Y_p = \emptyset$ is required for faithful evaluation (Gowda et al., 2021).
Semantic gap: Word vectors alone are coarse for compound or fine-grained actions. Elaborative, narrative, and multi-modal augmentation are necessary (Chen et al., 2021, Gowda et al., 2023, Li et al., 2023).
Hubness and bias: High-dimensional mapping can yield "hub" classes; techniques such as metric learning, clustering, and prototype transport mitigate this (Scott et al., 2020, Gowda et al., 2021, Mettes, 2022).
Failure modes: Poor detector coverage for rare objects, word2vec semantic ambiguity, and lack of explicit temporal modeling are persistent sources of error (Mettes et al., 2017).
Protocol recommendations: Always enforce non-overlapping seen/unseen/pre-training splits, report both ZSL and GZSL metrics, publicly release split definitions, and compare under identical protocols (Gowda et al., 2021).

6. Extensions and Future Directions

Emerging directions include:

Continual and few-shot learning integration: Merging generative replay (e.g., GIL (Gowda et al., 2024)) with zero/few-shot transfer.
LLM fusion: Incorporating large language and narrative models (Stories, BERT) for compositional and context-rich semantics (Gowda et al., 2023, Chen et al., 2021).
Semantic prompting: Structured multi-level prompting for CLIP-style models enables strong zero-shot classification without any additional adaptation (Iqbal et al., 9 Mar 2026).
Open-set and domain adaptation: Explicitly modeling open-world uncertainty and domain gaps using semantic gating, attribute composition, or adversarial domain adaptation.
Temporal logic and compositional description: Encoding dynamic signatures and attribute transitions as finite-state acceptors or learned logic machines (Kim et al., 2019).
Transductive adaptation: Techniques such as prototype transport (UPT), which align unseen class semantics to the empirical test distribution, unlock classification for otherwise unreachable classes (Mettes, 2022).

7. Summary Table of Principal Approaches

Category	Representative Methods	Key Features
Compatibility-based	Latem, SynC, CLASTER	Bilinear/similarity mapping; joint clustering
Generative feature synth	GAN for Vision, GIL, SDR	cWGAN, WGAN-GP/feature replay, semantically conditioned
Structured semantic	Stories, ER, MSF	BERT, sentence/Story embeddings, narrative fusion
Object-centric	SBERT-fusion, S-Aware Obj	Actor/object detection, spatial priors, SBERT paraphrases
RL and clustering	CLASTER	Joint RL-optimized centroids, regularized cluster space
Temporal logic	DASZL	Dynamic action signatures, FSM composition
Prompting (ViL models)	SP-CLIP	Structured semantic prompts for CLIP
Transductive alignment	UPT	Prototype realignment via OT; unlocks unreachable labels

Zero-shot action classification now comprises a suite of methods rigorously evaluated under semantically honest protocols, with best practices converging on rich structured semantics, balanced generative and compatibility models, and strict control for semantic leakage from pre-training. Advances in language modeling, continual learning, and transductive realignment are actively narrowing the gap to supervised recognition and enabling robust transfer to unseen actions (Gowda et al., 2021, Gowda et al., 2023, Gowda et al., 2024, Chen et al., 2021, Mettes, 2022, Gowda et al., 2021, Estevam et al., 2022, Kim et al., 2019, Li et al., 2023, Iqbal et al., 9 Mar 2026, Sun et al., 2021).