Zero-Shot Human Activity Recognition

Updated 3 August 2025

Zero-Shot HAR is a framework that uses semantic and attribute spaces derived from language and multimodal sources to infer labels for unseen activities.
The approach aligns input signals like video or sensor data with text embeddings or verb attributes, allowing transfer learning without class-specific training samples.
By enhancing scalability in areas such as surveillance, healthcare, and HCI, zero-shot HAR offers practical solutions for recognizing emergent human actions.

Zero-shot Human Activity Recognition (HAR) is a domain within activity recognition focused on inferring previously unseen human actions or activities by leveraging auxiliary information rather than explicit class-wise training samples. The fundamental goal is to utilize semantic, linguistic, attribute-based, or multimodal side information to infer the label of an activity for which no labeled training data exist, with extensive application across video, wearable sensor, and multimodal IoT contexts.

1. Conceptual Foundations

Zero-shot HAR is predicated on the transfer of knowledge from seen to unseen activity categories by projecting both input signals (e.g., video, IMU, sensor streams) and candidate activity descriptions into a shared semantic or attribute space. This enables the classification of unseen activities as long as semantic or relational representations (e.g., verb attributes, word embeddings, textual descriptions) exist for these classes. The approach contrasts with fully supervised HAR, which requires labeled data for every target class, and addresses challenges of limited annotation, unseen action generalization, and cross-domain deployment.

Formally, let $X$ denote the observable activity data, $Y^S$ the set of seen (training) classes, $Y^U$ the set of unseen (test-only) classes with $Y^S \cap Y^U = \emptyset$ , and $A: Y \rightarrow \mathcal{S}$ a mapping from activity label to side information (attributes, embeddings, etc.). The recognition task is to learn $f: X \rightarrow Y^U$ given $f$ trained only with $(x, y_s)$ pairs $y_s \in Y^S$ .

2. Attribute and Semantic Space Construction

A central thread is the representation of activities in a semantic or attribute space derived from language or multimodal sources.

Verb Attribute Modeling

"Zero-Shot Activity Recognition with Verb Attribute Induction" (Zellers et al., 2017) formulates each activity verb as a vector of semantic properties—including aspect, duration, motion dynamics, social context, transitivity, effect on object, and body involvement. These attributes are predicted from dictionary definitions or word embeddings:

For each verb $v$ , linguistic input $x_v$ (definition or embedding) is encoded, and a nonlinear map predicts attribute distributions:

$\hat{y}_{v, k} = \sigma \left(W^{(k)} f(x_v)\right)$

where $k$ indexes attributes, $W^{(k)}$ are learnable parameters, and $\sigma$ is a sigmoid (for binary) or softmax (multi-class attributes).

Attributes are mined automatically from language, eliminating manual annotation for unseen classes and establishing a linguistic bridge to visual data.

Distributed Word Embeddings

Word embeddings (e.g., GloVe, Word2Vec) serve as alternative semantic representations that capture syntactic and semantic relationships among activity labels. Models either directly transfer these vectors or fuse them with attribute predictions for shared representation spaces.

Multimodal and Joint Embeddings

Emergent works (e.g., (Ornek, 2020, Zhou et al., 2023)) construct a joint embedding space by aligning video or sensor representations with text or attribute vectors. This often involves autoencoders, contrastive objectives, or cross-modal alignment losses.

3. Modeling Paradigms and Learning Frameworks

Zero-shot HAR frameworks span a range of architectural and algorithmic paradigms:

Approach	Input Modality	Semantic Transfer Mechanism
Attribute Induction	Image/video	Verb attribute prediction from language (Zellers et al., 2017)
Joint Embedding/Ranking	Video	Visual-semantic ranking in latent space (Wang et al., 2017)
Set Prediction	Wearable sensors	Set-valued deep output with cardinality estimation (Varamin et al., 2018)
Dynamic Signatures	Video	Temporal attribute state machines (WFST) (Kim et al., 2019)
Cross-modal Contrastive	Video/sensor/text	Joint alignment of sensor and language feature spaces (Zhou et al., 2023, Cheshmi et al., 17 Jul 2025)

A representative workflow:

Extract domain features via deep CNNs (images), LSTM/Transformer (sequential sensor), or GNNs (multi-sensor graphs).
Map features to semantic or latent spaces via attribute classifiers, text embedding matching, or cross-modal alignment.
For a test example belonging to an unseen class, predict its attributes or embedding and compute similarity with the semantic vectors of unseen activity categories.
Output class is assigned based on nearest neighbor or softmax in the semantic space.

Importantly, joint optimization of visual/sensor and semantic mappings using ranking loss functions or contrastive objectives is key to effective zero-shot transfer (Wang et al., 2017, Zhou et al., 2023, Cheshmi et al., 17 Jul 2025).

4. Evaluation Protocols and Empirical Findings

Zero-shot HAR research relies on careful evaluation protocols reflecting the absence of labeled samples for unseen classes:

Dataset splits are conducted such that no visual or sensor examples of unseen classes ( $Y^U$ ) appear in training or validation. Protocols include instance-first-split (IFS) and label-first-split (LFS) (Wang et al., 2017).
Standard metrics: top-1/top-5 accuracy (for multi-class), mean average precision (MAP), F1, instance- and label-centric scores (for multi-label), exact match ratio (for set outputs) (Varamin et al., 2018).
Baselines: direct attribute prediction (DAP), word embedding transfer (DeVISE), and their hybrids.
On the imSitu dataset (96-class ZSL), ensemble attribute + embedding models achieve ~18.15% top-1 and 42.17% top-5 accuracy (Zellers et al., 2017). In multi-label scenarios, fusions of ranking loss-driven models outperform DAP, ConSE, Fast0Tag, and other ZSL baselines (Wang et al., 2017).
Attribute prediction from language alone can match or surpass manual “gold” attributes on certain classes (Zellers et al., 2017). Joint models tend to perform best, reflecting complementary linguistic signals.
Realistic splits and careful consideration of co-occurring labels are critical for meaningful zero-shot evaluation.

5. Challenges, Limitations, and Ongoing Research Directions

Several challenges are highlighted in the literature:

Attribute Sparsity and Fidelity: With only 24–40 attributes, representational capacity is tightly constrained (Zellers et al., 2017). Dynamic action signatures offer a richer temporal characterization (Kim et al., 2019).
Contextual Variability: Lexical definitions and word embeddings may not fully capture the nuances of activities with context-dependent semantics or diverse physical instantiations.
Hubness Phenomenon: Embedding-based models can over-predict common or high-frequency verbs (“hubness”), reducing specificity in unseen class predictions.
Temporal Granularity: Most early work focused on image-based or static attribute models; newer research emphasizes temporal structure and dynamic attribute evolution (Kim et al., 2019), joint segmentation (Kim et al., 2019), and set-output models (Varamin et al., 2018).
Limited Modalities: Many approaches focus on visual data; sensor-based and multimodal models (video, IMU, LiDAR, IoT) are increasingly important for broad applicability.
Evaluation Leakage: Data split strategies can inadvertently allow information from unseen actions to bias model selection or tuning (Wang et al., 2017).

Future research directions explicitly proposed in the data include:

Expanding and diversifying attribute and semantic spaces.
End-to-end data-driven attribute learning.
Integrating temporal models (video, sequential sensor streams).
Cross-modal and multi-modal alignment (e.g., video-text, IMU-video, IoT-language).
Counteracting embedding space hubness and enhancing domain generalization.

6. Applications and Impact

Zero-shot HAR approaches extend the reach of activity recognition to domains and deployment scenarios where labeled training data for every possible action is unattainable. Key areas of impact include:

Surveillance and Security: Unseen or emergent behaviors can be inferred with semantic attribute models.
Human–Computer Interaction and Robotics: Enables agents to interpret or respond to unfamiliar human actions using linguistic knowledge.
Healthcare and Assisted Living: Facilitates continuous monitoring with the ability to adapt to novel patient behaviors.
Video Analytics and Content Retrieval: Unlabeled actions present in massive databases can be indexed or retrieved by virtue of semantic similarity.

The paradigm accelerates the scalability, adaptability, and transparency of HAR systems, with empirical evidence of competitive accuracy and robustness compared to supervised and few-shot learning baselines (Zellers et al., 2017, Wang et al., 2017, Varamin et al., 2018, Kim et al., 2019).

7. Notable Research and Comparative Table

The following table summarizes representative zero-shot HAR approaches and their core mechanisms:

Reference	Input Modality	Semantic Representation	Zero-Shot Mechanism
(Zellers et al., 2017)	Image (imSitu)	Verb attributes, GloVe	Attribute induction, fusion
(Wang et al., 2017)	Video (Breakfast, Charades)	Word2Vec, semantic network	Joint latent ranking embedding
(Varamin et al., 2018)	Wearable sensors	N/A (set labels)	Set prediction, auto-encoder
(Kim et al., 2019)	Video, attribute detectors	Dynamic action signatures (temporal)	Composition via WFST
(Ornek, 2020)	Video, text	Multimodal latent autoencoders	Joint latent space, contrastive
(Zhou et al., 2023)	Video, LiDAR, mmWave, text	Contrastive language alignment	Unified semantic feature space
(Cheshmi et al., 17 Jul 2025)	IMU, video	Cross-modal contrastive	Prototype alignment, OOD general.

In summary, zero-shot human activity recognition leverages transferable semantic structures—be they attribute-based, linguistic, or multimodal—to circumvent the essential limitation of labeled data for every possible action, advancing the robustness and reach of HAR technologies in contemporary machine perception.