Zero-Shot Human Activity Recognition

Updated 19 March 2026

Zero-shot HAR is a technique that uses auxiliary semantic and language-based information to recognize human activities never seen during training.
It overcomes the need for comprehensive labeled datasets by employing joint embedding spaces, cross-modal representations, and retrieval-augmented reasoning.
Recent models achieve strong generalization and transparency, proving effective across various sensor modalities and real-world environments.

Zero-shot Human Activity Recognition (HAR) is the paradigm in which systems are required to recognize human activities that are not encountered during training, using only auxiliary domain knowledge—such as language or semantic attributes—for transfer. This task addresses the core limitation of standard supervised HAR: the dependence on exhaustive labeled data for every possible activity, sensor modality, or environment. Recent progress has enabled reliable zero-shot HAR across sensor modalities (e.g., IMU, video, radar, LiDAR, smart-home event sensors), leveraging joint embedding spaces, cross-modal representations, retrieval-augmented reasoning, and the interpretability of LLMs. These advances radically improve HAR generalization to new subjects, activities, or domains, and foster transparency in predictions.

1. Zero-Shot Human Activity Recognition: Formalization and Motivation

Zero-shot HAR extends the zero-shot learning (ZSL) framework—originally studied for classification and vision tasks—to scenarios where no labeled data for certain target activities ("unseen classes") are available at training. Formally, consider an activity label set $\mathcal{Y} = \mathcal{Y}_\text{seen} \cup \mathcal{Y}_\text{unseen}$ , with $\mathcal{Y}_\text{seen} \cap \mathcal{Y}_\text{unseen} = \emptyset$ . The objective is to learn a function $f: X \to \mathcal{Y}$ , such that for input $x$ (e.g., a raw IMU sequence or a video clip), $f(x) \in \mathcal{Y}_\text{unseen}$ is possible, even though no training instances of $\mathcal{Y}_\text{unseen}$ are present in labeled training data (Silva et al., 25 Jun 2025, Wang et al., 2017, Cheshmi et al., 17 Jul 2025).

Zero-shot ability is essential for HAR deployment in fields such as healthcare, smart environments, and mobile sensing. Here, unseen user behaviors, new sensor configurations, or shifting environmental affordances can generate activities absent from pre-collected HAR datasets. The primary enablers for zero-shot HAR are (1) cross-modal/semantic transfer (e.g., from word embeddings, LLMs, or video primitives), and (2) robust evaluation on carefully partitioned datasets—ensuring that test activities (or subjects) are absent from any model adaptation or training.

Central to zero-shot HAR is the construction of a semantic embedding space that permits transfer between observed and unobserved classes. Several architectures have been reported:

Semantic Alignment via Word Embeddings or Descriptors: Models such as SEZ-HARN and Joint Latent Ranking Embedding map sensor inputs and activity labels into a shared latent space, leveraging word embeddings (e.g., GloVe, Word2Vec) or text descriptors as semantic anchors. Feature similarity (typically dot-product or cosine) enables direct comparison, so that even when an activity is unseen, its embedding can serve as a query prototype (Wang et al., 2017, Silva et al., 25 Jun 2025).
Vision-Language Alignment and Video Semantics: Approaches such as video-based autoencoders (Ornek, 2020), TENT (Zhou et al., 2023), and cross-modal pretraining (Cheshmi et al., 17 Jul 2025) fuse features from videos and textual descriptors. For instance, TENT uses CLIP-style encoders for both sensor data and language, training with a contrastive InfoNCE objective, supplementing class labels with short descriptions and learned soft prompts to enhance generalization.
Natural Language Summarization and Embedding: In smart-home settings, raw sensor events are heuristically summarized into natural language (human-readable) sentences, which, together with one-sentence activity descriptors, are embedded in a pretrained SBERT model. Cosine similarity between sensor summary and activity descriptor embeddings directly yields zero-shot predictions (Dhekane et al., 29 Jul 2025).

Table 1: Shared Semantic Embedding Paradigms

Approach	Modality	Semantic Anchor
Joint Latent Embedding	Video	Word2Vec, attributes
SEZ-HARN	IMU, video	I3D video features
SBERT Summary	Event sensors	Sentential descriptors
TENT	Video, LiDAR	CLIP text+desc+prompts
PatchTST+VideoMAE	IMU, video	Joint cross-modal space

The diversity of representations (from deep feature embeddings to structured language) demonstrates the generality of the semantic transfer principle in zero-shot HAR.

3. Algorithms and Architectural Paradigms

Zero-shot HAR models fall into several principal design families:

Joint-Embedding Models: These architectures map both sensor input and class descriptors to a shared latent space, enforcing proximity of semantically related pairs via alignment or ranking losses (e.g., cosine similarity, cross-modal ranking). Zero-shot classification operates as nearest-neighbor search or scoring in this space (Ornek, 2020, Wang et al., 2017, Silva et al., 25 Jun 2025, Zhou et al., 2023, Cheshmi et al., 17 Jul 2025).
LLM-Prompted Classification: Emerging works demonstrate that frozen LLMs (GPT-4, Gemini) can be prompted with raw or preprocessed sensory input (e.g., IMU time series directly serialized as numeric tuples) plus candidate activities. Contextual prompting—particularly with "role play" and "step-by-step analysis" cues—enables robust zero-shot activity assignment without gradient updates (Ji et al., 2024).
Retrieval-Augmented and Agent-Based Reasoning: Frameworks such as ZARA maintain a feature-level knowledge base for all class pairs and dynamically retrieve proto-typical examples via a frozen time-series embedder. A hierarchical agent pipeline embedded in an LLM then performs feature selection, evidence pruning, and chain-of-thought justification, achieving state-of-the-art zero-shot accuracy and interpretability (Li et al., 6 Aug 2025).
Language Modeling via Embedding: Instead of online prompting, sensor events and activities are converted into natural language representations and embedded using a static sentence-BERT model, allowing for deterministic and privacy-preserving zero-shot recognition (Dhekane et al., 29 Jul 2025).

4. Evaluation Protocols and Benchmarks

Proper zero-shot HAR assessment requires strict subject/activity partitioning and diverse metrics:

Partitioning: Common protocols hold out a subset of activities as "unseen" (class-wise split), or exclude certain users/participants (subject-wise split). For example, HARGPT used test-unseen splits—20% of users never seen during training—for both Capture24 and HHAR datasets (Ji et al., 2024). SEZ-HARN performs k-fold class-wise splits, reserving 20–30% of classes as $C_u$ (Silva et al., 25 Jun 2025). Multi-label settings further restrict training and evaluation overlap by label or by instance (Wang et al., 2017).
Metrics: Typical evaluation metrics include per-class/macro F1, balanced accuracy, mean average precision (mAP), Top-N retrieval accuracy, recall@K, and mean reciprocal rank (MRR). Macro-F1 and per-class accuracy guard against class imbalance (Li et al., 6 Aug 2025, Cheshmi et al., 17 Jul 2025, Ji et al., 2024, Wang et al., 2017).
Datasets: Zero-shot HAR spans modalities and domains:
- Video: Kinetics (Ornek, 2020), Charades, Breakfast (Wang et al., 2017)
- IMU: PAMAP2, DaLiAc, UTD-MHAD, MHEALTH (Silva et al., 25 Jun 2025), HHAR, Capture24 (Ji et al., 2024)
- Multisensor/smart-home: CASAS, MARBLE, ARAS (Dhekane et al., 29 Jul 2025)
- IoT: MM-Fi (video, LiDAR, radar) (Zhou et al., 2023)
- OOD: MMEA, Parkinson’s Disease IMU (Cheshmi et al., 17 Jul 2025)
- Time-series: Opportunity, UCI-HAR, WISDM, DSADS (Li et al., 6 Aug 2025)
- Wireless CSI: not detailed owing to lack of verbatim data (Diaz et al., 2023)

5. Quantitative Results and Empirical Comparisons

Zero-shot HAR methods have demonstrated significant gains over classical or unimodal benchmarks, with state-of-the-art results supported by rigorous ablation studies.

LLM-Prompted Zero-Shot HAR: HARGPT (GPT-4, Chain-of-Thought prompting) achieved macro F1 of 0.795 (Capture24, 4-class) and 0.790 (HHAR, 2-class) on unseen-subject splits, surpassing conventional DCNNs, LIMU-LSTM, RF, and SVMs. Direct-output (no CoT) prompts led to much lower F1, confirming the criticality of prompt engineering (Ji et al., 2024).
Self-Explainable Models: SEZ-HARN matched or exceeded the strongest black-box ZS-HAR models on DaLiAc, UTD-MHAD, and MHEALTH, and was within 3% of the best model on PAMAP2. Explanation modules (skeleton video generation) incurred no meaningful accuracy degradation, validated both quantitatively (DTW, DFD) and via human studies (>80% correct super-class identification) (Silva et al., 25 Jun 2025).
Retrieval- and Agent-Based Approaches: ZARA yielded macro F1 up to 92.5% (Opportunity) and 81.4% on average across eight benchmarks—2.53x improvement over best baselines (e.g., UniMTS). Ablations showed that each module (retrieval, prior knowledge, evidence pruning) contributed 10–20 percentage points to accuracy (Li et al., 6 Aug 2025).
Cross-Modal Pretraining: Self-supervised IMU-video alignment with PatchTST and VideoMAE improved zero-shot balanced accuracy to 33.9% on MMEA and 24.1% for PD, outperforming IMU2CLIP and vision-language variants (Cheshmi et al., 17 Jul 2025). Similar cross-modal alignment drives TENT’s advances in video, LiDAR, and radar-based HAR (absolute gains of 12–23.5% in top-1 accuracy over vision-language baselines) (Zhou et al., 2023).
Language Modeling via Embedding: Sentence-BERT smart-home pipeline delivered accuracy up to 81% (MARBLE) and 71% (CASAS Aruba), on par or better than prompt-based LLM solutions, with privacy and reliability advantages (Dhekane et al., 29 Jul 2025).

6. Explainability, Limitations, and Practical Considerations

Explainability is increasingly fundamental in zero-shot HAR:

Self-Explanation via Skeleton Generation: SEZ-HARN uniquely provides skeleton video traces for each prediction, with reconstruction losses enforcing kinematic fidelity to auxiliary video semantics. Human studies and DTW/DFD metrics confirmed the interpretability and realism of generated explanations (Silva et al., 25 Jun 2025).
Structured Agent Rationales: ZARA’s agent pipeline yields natural language rationales citing discriminative features and retrieved evidence, judged faithful in human evaluation (Li et al., 6 Aug 2025).
Language Embedding Transparency: Fixed SBERT encoders and handcrafted descriptors produce deterministic, inspectable decision rules without reliance on external APIs or variable black-box LLM outputs (Dhekane et al., 29 Jul 2025).

Limitations highlighted across recent literature include:

Brittle or verbose LLM responses (particularly under variant prompt styles) (Ji et al., 2024).
Manual burden in crafting activity descriptors and aggregating summaries (Dhekane et al., 29 Jul 2025).
Dependence of cross-modal or retrieval-based models on embedding quality and coverage—rare or complex activities may yield ambiguous, less reliable predictions (Li et al., 6 Aug 2025, Cheshmi et al., 17 Jul 2025).
Real-time performance constraints, especially for agent-based or multi-stage retrieval models (Li et al., 6 Aug 2025).

7. Open Problems and Future Research Directions

The field continues to evolve along several axes:

Benchmark and Protocol Standardization: There is an ongoing need for widely adopted zero-shot HAR benchmarks, strict split protocols, and unified metrics to foster fair comparison and reproducibility (Ji et al., 2024, Dhekane et al., 29 Jul 2025).
Multimodal and Few-shot Extensions: Future work is aimed at fusing additional sensor modalities (e.g., audio, barometer, EMG, WiFi CSI) and combining zero-shot with few-shot strategies—using embedded class prototypes from sparse labeled instances for improved discrimination (Cheshmi et al., 17 Jul 2025, Dhekane et al., 29 Jul 2025).
Automated Descriptor Generation: Semi-automatic or privacy-preserving LLM-assisted generation of sensor summaries and activity descriptors can mitigate manual overhead and increase portability (Dhekane et al., 29 Jul 2025).
Adaptive and Plug-and-play Systems: Retrieval-based and agent-oriented frameworks are being extended for dynamic knowledge base updating, online learning, and flexible sensor sets (Li et al., 6 Aug 2025).
Transparent Reasoning and Trust: Combining visual, textual, and tabular explanation modalities (e.g., skeleton video + natural language + attention overlays) is advocated to further enhance user trust and regulatory compliance in critical environments (Silva et al., 25 Jun 2025, Li et al., 6 Aug 2025).

Zero-shot HAR remains a rapidly developing area, unifying advances in transfer learning, multimodal representation, language modeling, and human-centered AI, with strong empirical evidence for both performance and interpretability over an expanding range of sensor platforms and settings.