Language alignment of sensor embeddings without prompt or caption overfitting

Establish methods to align sensor embeddings from sensor-based Human Activity Recognition with natural-language representations while avoiding overfitting to prompts or captions, enabling semantic grounding that generalizes across tasks and datasets.

Background

The survey highlights that language grounding is crucial for interpretability and zero-/few-shot transfer in HAR foundation models. However, aligning continuous sensor embeddings to discrete language spaces often leads to overfitting on prompt templates or limited captions, reducing generalization. The authors call for alignment approaches that provide semantic grounding without sacrificing robustness.

References

Key open questions include how to encode routines and transitions in a form that remains stable across users and devices, and how to align sensor embeddings with language without overfitting to prompts or captions.

Foundation Models Defining A New Era In Sensor-based Human Activity Recognition: A Survey And Outlook  (2604.02711 - Bian et al., 3 Apr 2026) in Subsection 6.2, Representation Limits: Temporal Hierarchy and Semantic Abstraction