ActivityNarrated: An Open-Ended Narrative Paradigm for Wearable Human Activity Understanding

Published 1 Apr 2026 in cs.LG | (2604.00767v1)

Abstract: Wearable HAR has improved steadily, but most progress still relies on closed-set classification, which limits real-world use. In practice, human activity is open-ended, unscripted, personalized, and often compositional, unfolding as narratives rather than instances of fixed classes. We argue that addressing this gap does not require simply scaling datasets or models. It requires a fundamental shift in how wearable HAR is formulated, supervised, and evaluated. This work shows how to model open-ended activity narratives by aligning wearable sensor data with natural-language descriptions in an open-vocabulary setting. Our framework has three core components. First, we introduce a naturalistic data collection and annotation pipeline that combines multi-position wearable sensing with free-form, time-aligned narrative descriptions of ongoing behavior, allowing activity semantics to emerge without a predefined vocabulary. Second, we define a retrieval-based evaluation framework that measures semantic alignment between sensor data and language, enabling principled evaluation without fixed classes while also subsuming closed-set classification as a special case. Third, we present a language-conditioned learning architecture that supports sensor-to-text inference over variable-length sensor streams and heterogeneous sensor placements. Experiments show that models trained with fixed-label objectives degrade sharply under real-world variability, while open-vocabulary sensor-language alignment yields robust and semantically grounded representations. Once this alignment is learned, closed-set activity recognition becomes a simple downstream task. Under cross-participant evaluation, our method achieves 65.3% Macro-F1, compared with 31-34% for strong closed-set HAR baselines. These results establish open-ended narrative modeling as a practical and effective foundation for real-world wearable HAR.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces an open-ended narrative paradigm for wearable HAR that maps variable-length sensor streams to natural language descriptions, challenging fixed-taxonomy methods.
The methodology leverages spectral tokenization and a Q-Former module, achieving 38.5% R@1 in retrieval and 65.3% Macro-F1 in closed-set classification.
The framework demonstrates robust performance across cross-subject, cross-position, and missing-sensor scenarios, highlighting its potential for real-world applications.

Open-Vocabulary Narratives for Wearable Human Activity Recognition with ActivityNarrated

Paradigmatic Shift in HAR: From Closed-Set Labels to Open-Ended Narratives

Prevailing pipelines in wearable human activity recognition (HAR) operate under a closed-set classification paradigm—fixed temporal windows, canonical sensor placements, and predefined activity vocabularies dominate both benchmarks and modeling practice. This paradigm yields high benchmark accuracy but fails under real-world variability such as sensor placement shift, missing sensors, and semantically complex, long-tail behaviors. Critically, scaling dataset taxonomy or model parameter counts does not resolve the inherent limitations of the closed taxonomy (Figure 1).

Figure 1: Conceptual comparison between traditional closed-set HAR and the open-vocabulary narrative paradigm; open-vocabulary HAR subsumes closed-set classification as a downstream task by grounding sensor signals in a language embedding space.

"ActivityNarrated: An Open-Ended Narrative Paradigm for Wearable Human Activity Understanding" (2604.00767) directly challenges the closed-set paradigm. It introduces a framework where activities are modeled as open-ended narratives, with activity semantics discovered through the alignment of multi-position wearable IMU sensor data and natural language descriptions.

This approach reframes recognition: activity understanding is no longer the selection of a class index but the retrieval or generation of narrative descriptions that are temporally aligned with ongoing sensorial evidence. The system explicitly models the real-world properties of human behavior—unscripted, compositional, long-tailed, and personalized.

The ActivityNarrated Dataset: Naturalistic, Multi-Position, and Language-Grounded

To support rigorous evaluation of this paradigm, ActivityNarrated is collected with three pillars:

Unscripted, Open-Ended Behavior: 22 participants naturally interact in a controlled environment (Figure 2) with seven activity hotspots facilitating diverse, concurrent activities without imposed order or segmentation.
Figure 2: Room-scale testbed containing seven activity hotspots with multi-view camera coverage to capture unconstrained, multitask behaviors.
Dense Multi-Position IMU Sensing: Each participant is instrumented with up to 15 IMU sensors on various body locations (Figure 3), capturing both the diversity and heterogeneity of real-world deployments.
Figure 3: Platform for multi-position IMU sensing, highlighting extensive body coverage and supporting explicit test of sensor placement invariance.
Rich, Layered Annotation: Three sources of semantic supervision are paired to each activity segment: free-form participant narrations, VLM-derived weak labels, and expert post-hoc annotations (both in natural language and mapped to a 23-class taxonomy).

This design induces highly variable sensor configurations, segment durations, and narrative forms, producing a challenging regime for both open-vocabulary and closed-set HAR tasks.

Open-Vocabulary Benchmarking: Metrics and Tasks

HAR evaluation in this paradigm eschews accuracy and Macro-F1 as sole targets. Instead, three axes are used for benchmarking:

Sensor–Language Retrieval: The core metric is R@K (Recall at K) for retrieval given a segment and a candidate pool of descriptions, supplemented with MRR (Mean Reciprocal Rank) and nDCG for ranking quality amid description paraphrases and semantic granularity differences.
Discretization Metrics: Quality and invariance of the learned IMU representations are assessed with time- and spectral-domain $\ell_1$ reconstruction error and JS divergence for token stability across placements and subjects.
Downstream Closed-Set Classification: As a diagnostic, closed-set recognition metrics are reported by mapping generated narratives to fixed taxonomies.

The ActNarrator Architecture: Discrete Motion Lexicon Meets LLMs

The proposed system, ActNarrator, consists of two principal modules (Figures 4 and 5):

Spectral VQ-VAE Discretization: IMU streams are split into overlapping windows processed with multiple encoders: time-domain, STFT, and wavelet (Figure 4). The resulting fused latent is quantized via a codebook (typically $K=128$ entries) to yield robust, compositional motion tokens.
Figure 4: Spectral VQ-VAE encodes multi-view IMU streams into discrete tokens, facilitating robust motion primitive extraction suitable for cross-modal alignment.
Sensor-to-LLM Generation: Variable-length, multi-position token sequences are embedded and aggregated via a Q-Former, with a textual header encoding sensor configuration and segment duration (Figure 5). The fused prompt is input to a frozen LLM (e.g., Qwen 7B), which generates the open-vocabulary narrative conditioned on the sensor evidence.
Figure 5: Sensor-to-LLM generation pipeline: Q-Former aggregates arbitrary sensor tokens and formats them for LLM-driven narrative generation.

Token-based augmentations (Figure 6) are used during training to improve compositionality and robustness to annotation and sensor irregularities.

Figure 6: Token-level augmentations for increased variability and transferability of motion primitives in the learned lexicon.

Micro-activity decomposition emerges naturally as recurring token patterns, reflecting shared motion primitives across activities and participants (Figure 7).

Figure 7: Micro-activity decomposition — similar activities yield related token subsequences, highlighting compositionality in real-world activity streams.

Empirical Results: Robust Sensor–Language Alignment and HAR Generalization

Open-vocabulary Retrieval

Compared to recent sensor–language alignment baselines (IMU2CLIP, OVHAR, PaLM-E), ActNarrator achieves substantial improvements. With Spectral VQ-VAE tokens and Q-Former, R@1 rises to 38.5% (XS) and 34.0% (XP), compared to the strongest baseline at ~23%. Improvement persists across R@5 and MRR. Missing-sensor experiments show graceful degradation, with models retaining >30% R@1 with just one or two sensors.

Closed-Set Classification

When evaluated for 23-class activity recognition, ActNarrator's cross-subject Macro-F1 is 65.3%, nearly doubling DeepConvLSTM and TinyHAR baselines (31–34%). Under cross-position shift (XP), classical pipelines degrade to <20% Macro-F1, while ActNarrator remains at >50%. Notably, the classification head is derived from LLM narrative outputs rather than direct classifier training.

Tokenization Ablations

A dictionary size of $K=128$ with 2-second windows provides the optimal trade-off in representation fidelity and transfer, as evidenced by minimal time/spectral $\ell_1$ error and low JS divergence. Discretization yields reusable, placement-invariant primitives necessary for robust sensor–language alignment.

Figure 8: Open-vocabulary outputs for segments using all sensors versus minimal input; LLM generation remains semantically aligned, demonstrating grace under missing-sensor regimes.

Theoretical and Practical Implications

Theoretical Repercussions

Closed-set Labeling Becomes a Downstream Special Case: Once sensor streams are grounded in open-vocabulary narratives, fixed-class HAR is reduced to constraining the natural language output space. The learned representation is inherently compositional and robust to label set expansion, semantic drift, and taxonomy shift.
Discretized, Multi-View Motion Tokens Enable Cross-Modal Generalization: Robust sensor–language alignment is enabled by quantized spectral tokens, not by simple scaling of LLM capacity or classifier width.

Practical Consequences

Deployment Suitability: The framework demonstrates graceful degradation under missing sensors and variable placements, aligning with practical requirements for longitudinal health monitoring, context-aware interfaces, and unscripted field deployments.
Supervision Efficiency and Extensibility: Expert-annotated soft labels yield the most stable alignment, but narrative-based pipelines can, in principle, leverage user-driven or partially automated annotation expansion without redefinition of the model objective.
Model Size vs. Inductive Bias: LLMs in the 7–8B parameter range yield best results; further scaling is not efficient without enhanced cross-modal alignment structures.

Future Directions

Dataset Scaling and Personalization: The open-vocabulary paradigm inherently supports expansion to larger or more demographically diverse datasets, continual learning of personalized or rare activity narratives, and inter-dataset semantic transfer.
Multi-Modal Integration: Extending the pipeline to handle vision, audio, and physiological streams as equivalently tokenized modalities could facilitate richer, context-aware narratives.
Evaluation Beyond Label Recovery: Future benchmarks must move towards multi-label, compositional, concurrent, and temporally overlapping activity narratives, exploiting the flexibility of language supervision.

Conclusion

ActivityNarrated (2604.00767) operationalizes a departure from closed-set wearable HAR towards a semantically open, robust, and extensible narrative framework. By unifying multi-position discrete motion tokenization and LLM-based generation, it achieves performance that simultaneously dominates both open-vocabulary and closed-set benchmarks, especially under sensor and annotation variability. This approach subsumes classical classification as a downstream task and offers a scalable path towards adaptive, real-world activity understanding.

Markdown Report Issue