Zero-Shot Action Classification

Updated 5 December 2025

Zero-shot action classification is defined as assigning previously unseen action labels to videos using high-level semantic cues like attribute vectors, language descriptions, or embeddings.
Key methodologies include projection-based alignment, generative feature synthesis, multi-modal fusion, and graph-based approaches that bridge visual and semantic spaces.
Evaluation protocols such as ZSL and GZSL employ strict, leakage-free splits to benchmark performance across standard video datasets.

Zero-shot action classification is the problem of assigning action labels to videos (or related modalities, such as skeleton sequences or egocentric first-person videos) that belong to classes for which no visual examples have been presented during training. The key characteristic of this regime is the exclusive reliance on a “bridge”: high-level semantic side information—attributes, language descriptions, object sets, or embeddings—that connects seen and unseen classes. Zero-shot action classification is motivated by the combinatorial scale of possible action labels, the infeasibility of curating exhaustive video datasets for all actions, and the need for open-set recognition in real-world settings (Estevam et al., 2019).

1. Formal Problem Setup and Evaluation Protocols

Let $\mathcal{C}^{s}$ denote the set of seen classes and $\mathcal{C}^{u}$ the set of unseen classes, with $\mathcal{C}^{s} \cap \mathcal{C}^{u} = \emptyset$ . Training data consists of pairs $(v_i, y_i)$ with $y_i \in \mathcal{C}^{s}$ and $v_i$ a video. Given semantic side information $\psi(c)$ for each $c \in \mathcal{C}^s \cup \mathcal{C}^u$ (word embeddings, attribute vectors, sentences, etc.), the goal is to assign correct labels from $\mathcal{C}^{u}$ to test videos $v_j$ such that $y_j \in \mathcal{C}^{u}$ .

Zero-shot action classification is evaluated in two main regimes:

Zero-Shot Learning (ZSL): Test videos are restricted to unseen classes $\mathcal{C}^{u}$ , i.e., $f_\mathrm{ZSL}: \mathcal{V} \to \mathcal{C}^{u}$ .
Generalized Zero-Shot Learning (GZSL): The label space at test time is $\mathcal{C}^{s} \cup \mathcal{C}^{u}$ , i.e., $f_\mathrm{GZSL}: \mathcal{V} \to \mathcal{C}^{s} \cup \mathcal{C}^{u}$ , evaluating both accuracy on seen and unseen classes and their harmonic mean $H=2 \frac{Acc_{s} \cdot Acc_{u}}{Acc_{s} + Acc_{u}}$ (Liu et al., 2017).

Recent research emphasizes the need for standardized, pretraining-leakage-free splits (e.g., “TruZe”) that guarantee no overlap between unseen classes and pretraining categories (Gowda et al., 2021).

2. Taxonomy of Models and Core Methodological Paradigms

Zero-shot action classification methods can be broadly grouped as follows (Estevam et al., 2019):

Compatibility or Projection-based Methods: Learn a mapping $f: \mathbb{R}^d \to \mathbb{R}^k$ aligning visual features $\phi(v)$ and semantic prototypes $\psi(c)$ . Example: mean-squared error regression, or bilinear compatibility $F(\theta(x),\phi(c)) = \theta(x)^\top W \phi(c)$ , where $W$ is learned on seen classes. Inference assigns the class maximizing the similarity between projected visual features and semantic prototypes (Zellers et al., 2017, Liu et al., 2017).
Intermediate-space or Multi-modal Embedding Methods: Both visual and semantic features are embedded into a joint space (via, e.g., CCA, deep encoders, cross-modal contrastive objectives), facilitating domain adaptation and improved generalization.
Generative Feature Synthesis: Train a conditional generative model (GAN, VAE) to synthesize visual features of unseen classes conditioned on their semantic description, then train a classifier on these features (Sun et al., 2021, Gowda et al., 2023). Wasserstein GANs with semantic conditioning and cycle consistency are widely used, and the synthetic samples aim to mitigate bias toward seen classes evident in embedding-based approaches.
Graph-based and Knowledge-Graph Methods: Construct a graph over action and object classes, exploiting edges from external knowledge bases and propagating attribute or word-embedding information with graph convolutional or attention mechanisms (Sun et al., 2021).
Compositional or Attribute-State Models: Model fine-grained temporal signatures of actions as compositions of atomic attribute detectors (objects, poses, relations) and first-principles grammars, allowing for on-the-fly action detector composition without end-to-end supervised training (Kim et al., 2019).
Clustering and Reinforcement Optimized Methods: Cluster joint visual-semantic representations and refine centroids via reinforcement learning to increase robustness to domain shift between seen and unseen classes (Gowda et al., 2021).
Multi-Modal and Descriptor Fusion Models: Fuse multiple semantic channels (object detectors, video captions, dictionary sentences) with strong text encoders (e.g., SBERT, CLIP), improving discriminative power and seen-unseen generalization (Estevam et al., 2022, Li et al., 2023, Zhou et al., 22 Jan 2024).

3. Semantic Information: Role and Representation

The transferability of zero-shot action recognition critically depends on the richness and alignment of the semantic representations used.

Manual Attributes: Early work used human-defined binary or scalar attributes over verbs or action categories (Zellers et al., 2017).
Word Embeddings: Distributed embeddings (word2vec, GloVe) of action labels or compound object phrases provide scalable, but often sparse, semantic spaces (Mettes et al., 2017). However, they can conflate homonyms and lack sensitivity to fine-grained or compositional distinctions.
Sentence Embeddings and Natural Descriptions: Recent advances leverage detailed action descriptions harvested from knowledge bases (WikiHow, WordNet), video captions, or multi-sentence narratives. Embeddings from SBERT or large transformer models (CLIP, BERT) produce rich class prototypes that capture objects, scenes, typical action sequences, and affordance context (Gowda et al., 2023, Zhou et al., 22 Jan 2024, Li et al., 2023). Empirical results show that fusing multi-sentence stories with data-driven feature synthesis can yield large increases (up to 6–20% absolute) in top-1 classification accuracy compared to word2vec or manual attribute baselines (Gowda et al., 2023).
Object and Scene Affinity: A line of work quantifies the affinity between detected objects in video and action categories by semantic similarity (e.g., word embedding dot products) and by modeling spatial and temporal priors over actor-object relations, often using hand-crafted or estimated priors from large-scale datasets (Mettes et al., 2017, Estevam et al., 2022).
Graph/Knowledge Structures: Incorporating ConceptNet relations or dynamic attention-based adjacency matrices allows context-aware propagation of semantic and visual signals across related classes and objects (Sun et al., 2021).

4. Core Architectural and Algorithmic Techniques

Most approaches involve three key architectural blocks:

Visual Feature Extractors: 2D/3D CNNs (C3D, I3D, TSM, R(2+1)D), sometimes augmented by RNNs for temporal aggregation, are the standard backbone for extracting compact video-level or snippet-level representations.
Semantic Encoders: Mapping from class label, object, or textual descriptor to a semantic embedding—implemented using pretrained word2vec, SBERT, BERT, or CLIP encoders.
Alignment Module: A function—linear, nonlinear, contrastive, or generative—that links or maps between visual and semantic modalities, with various training losses:
- Cross-entropy (for classifiers over generated or fused features)
- Regression or ranking losses (for projection-based models)
- Adversarial, cycle-consistency, or mutual information maximization (for GAN/feature synthesis)
- Cluster assignment/diversity objectives or RL-based centroid updates (Gowda et al., 2021)
- Gating mechanisms to balance seen/unseen predictions in GZSL (Li et al., 2023)

Inference Pipelines: For generative approaches, synthetic features for each unseen class are produced via the trained generator and used to train a discriminative classifier. For others, video features are projected and scored against all unseen or all seen+unseen semantic prototypes, with the class optimizer selecting the maximal scoring label (Liu et al., 2017).

Composite and Compositional Models: Temporal evolution of attributes or object–actor interactions is modeled by logic-based finite-state machines or weighted finite-state transducers, enabling fine-grained discrimination of actions with similar static attribute sets, notable for complex actions in security or surgical scenarios (Kim et al., 2019).

5. Datasets, Experimental Protocols, and Benchmarking

Benchmark datasets include:

Dataset	Clips	Classes	Type	Notable Properties
HMDB51	7,000	51	3-10s trimmed	Movie/YouTube
UCF101	13,320	101	2-10s trimmed	Sports/YouTube
Olympic Sports	800	16	2-10s trimmed	Sports in the wild
Kinetics-400/600	242,658+	400–600	10s trimmed	Web-scale, wide coverage
ActivityNet	20k+	>200	Untrimmed, variable	Daily activities
Surv5H	500	9	Surveillance	Public safety
ActionHub	3.6M desc	1,211	Video/web desc.	Text-rich, noisy, diverse

Key protocol standards:

Seen/unseen splits: Random (e.g., 50/51, 25/26), TruZe (removes classes overlapping with pretraining), contextually separated by action/fine-grained noun (Gowda et al., 2021, Scott et al., 2020).
Class-level supervision only on $\mathcal{C}^s$ , with $\mathcal{C}^u$ restricted to side information. No video features from $\mathcal{C}^u$ are used for training any part of the model.
Metrics: Mean per-class accuracy (on unseen); in GZSL, harmonic mean of seen/unseen; sometimes area under seen-unseen accuracy curves (AUSUC) when calibration is possible (Liu et al., 2017).

Impact of pretraining leakage: Overlaps between Kinetics pretraining labels and “unseen” test classes can inflate ZSL numbers by up to 9% absolute; careful class splits (TruZe) are now required to avoid such leakage (Gowda et al., 2021).

6. Empirical Results and Analysis

Zero-shot action classification has seen significant performance improvements as the following technical developments have matured:

Simple projection and compatibility models (linear, SVR, bilinear) attain 15–25% (UCF101, HMDB51) in standard ZSL (Estevam et al., 2019).
Multi-modal or object-based models (Mettes et al., 2017) that incorporate spatial priors and object-score fusion achieve up to 51.2% on UCF101 (20/81 split).
Generative feature synthesis with semantic-rich stories boosts UCF101 (50 unseen) accuracy from ≈25–40% (GAN baselines) up to 62.9–73.4% (I3D/CLIP backbones with SDR+Stories) (Gowda et al., 2023).
Fusion of object and sentence-based descriptors via SBERT yields 53.1% (UCF101, 0/50 split) and 60.5% (TruZe 34 test classes) for object+sentence fusion, outperforming object- or sentence-only channels (Estevam et al., 2022).
Clustering/centroid-based approaches with RL optimization lead to further gains (e.g., CLASTER: 46.7–52.7% on UCF101, word2vec/sentence2vec/combined embeddings) (Gowda et al., 2021).
Rich textual datasets (ActionHub, Stories) and multi-modal semantic fusion (video label+caption aggregation, cycle consistency) push Kinetics-ZSAR top-1 up by 6.2% over prior benchmarks (Zhou et al., 22 Jan 2024).

Tabulated excerpt of comparative results (UCF101 ZSL, 50/51 or similar splits):

Method	UCF101 Top-1 ZSL (%)	UCF101 TruZe (%)
Baseline (SVR)	~15–25	–
Object+Spatial (Mettes et al., 2017)	51.2 (20/81)	–
CEWGAN (Sun et al., 2021)	26.9 ± 2.8	–
FGGA (Sun et al., 2021)	28.3 ± 1.8	–
SDR+I3D (Gowda et al., 2023)	62.9	–
SDR+CLIP(Gowda et al., 2023)	73.4	–
SBERT Fusion(Estevam et al., 2022)	53.1	60.5
CLASTER (Gowda et al., 2021)	52.7 ± 2.2	45.2
CoCo+ActionHub(Zhou et al., 22 Jan 2024)	51.2 ± 2.9	–

Ablations repeatedly show that narrative, multi-modal, and compositional signals are crucial; gains of 6–20% over w2v baselines are common when using sentence embeddings or graph/fusion models. Performance degrades by 4–9% absolute under strict TruZe splits, confirming that leakage-free protocols are essential for fair benchmarking (Gowda et al., 2021).

7. Open Challenges and Research Frontiers

Major challenges and directions include (Estevam et al., 2019, Gowda et al., 2021, Zhou et al., 22 Jan 2024):

Semantic gap and compositionality: Distributed word embeddings or short class names fail to encode necessary distinctions between nuanced actions or similar subclasses; multi-sentence and object-enriched descriptions are far superior.
Domain shift and hubness: The domain gap between seen and unseen classes induces bias toward seen-class predictions in GZSL; bias calibration, hubness correction, and feature distribution matching are ongoing areas of paper (Mettes, 2022).
Generative and continual learning: Generative iterative replay (synthesizing past and novel class features for continual exposure) bridges continual learning with zero-shot action recognition, boosting GZSL performance (Gowda et al., 14 Oct 2024).
Fine-grained temporal modeling: Most approaches aggregate entire video; compositional dynamic signatures, temporal grammars, and segment-level inference enable more robust detection of actions differing in structure over time (Kim et al., 2019).
Evaluation standards: Random splits are prone to leakage; reports must use fixed, pretraining-leakage-free splits, and should include both ZSL and GZSL metrics for full transparency (Gowda et al., 2021).
Data scarcity and long-tail classes: Performance on rare actions with few or noisy descriptions remains the main bottleneck in open-world recognition.

A plausible implication is that progress in zero-shot action recognition will increasingly rely on large-scale, description-rich datasets, advanced fusion of heterogeneous side information, and advances in open-set generalization, continual lifelong learning, and calibration under data shift.

References

“Zero-Shot Action Recognition in Videos: A Survey” (Estevam et al., 2019)
“Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions” (Mettes et al., 2017)
“GAN for Vision, KG for Relation: a Two-stage Deep Network for Zero-shot Action Recognition” (Sun et al., 2021)
“Telling Stories for Common Sense Zero-Shot Action Recognition” (Gowda et al., 2023)
“Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based Action Recognition” (Li et al., 2023)
“Global Semantic Descriptors for Zero-Shot Action Recognition” (Estevam et al., 2022)
“A New Split for Evaluating True Zero-Shot Action Recognition” (Gowda et al., 2021)
“CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition” (Gowda et al., 2021)
“ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition” (Zhou et al., 22 Jan 2024)
“DASZL: Dynamic Action Signatures for Zero-shot Learning” (Kim et al., 2019)
“Generalized Zero-Shot Learning for Action Recognition with Web-Scale Video Data” (Liu et al., 2017)
“Universal Prototype Transport for Zero-Shot Action Recognition and Localization” (Mettes, 2022)
“Continual Learning Improves Zero-Shot Action Recognition” (Gowda et al., 14 Oct 2024)
“Zero-Shot Activity Recognition with Verb Attribute Induction” (Zellers et al., 2017)