Weakly Supervised IMU-TAL

Updated 9 February 2026

WS-IMU-TAL is a weakly supervised framework that localizes both action categories and temporal boundaries in untrimmed inertial data using a MIL-based approach.
It adapts techniques from audio, image, and video domains to overcome sparse, sequence-level annotations by employing specialized pooling and proposal strategies.
Empirical evaluations highlight domain-specific strengths and limitations, prompting advances in signal-driven proposal generation and multi-scale temporal reasoning.

Weakly Supervised IMU-based Temporal Action Localization (WS-IMU-TAL) addresses the challenge of predicting both action categories and their temporal boundaries in untrimmed inertial data streams, under only sequence-level (bag-level) annotation. Bypassing the need for dense, frame-level temporal action boundaries, WS-IMU-TAL leverages weak supervision typical to paradigms in audio, image, and video domains, with the objective of making large-scale, fine-grained behavior analysis from IMU data tractable in the absence of dense labels (Li et al., 2 Feb 2026).

1. Formal Problem Formulation and Evaluation Metrics

WS-IMU-TAL is cast as a Multiple-Instance Learning (MIL) problem. Given an untrimmed IMU sequence $X\in\mathbb{R}^{S\times T}$ (where $S$ is sensor channels, $T$ is time steps), a set of action classes $\mathcal{C}=\{1,...,C\}$ , and a multi-hot bag-level label $\mathbf{y}\in\{0,1\}^C$ , the objective is to predict, at inference, a set of temporally localized action instances $\hat{\mathcal{Y}}=\{(\hat{c}_i, \hat{s}_i, \hat{e}_i)\}_{i=1}^{\hat{N}}$ , where $[\hat{s}_i, \hat{e}_i]\subset[1,T]$ and $\hat{c}_i\in\mathcal{C}$ . Training is limited to sequence-level labels without any boundary information.

A canonical MIL pipeline decomposes $X$ into $L$ candidate segments $\{x_\ell\}_{\ell=1}^L$ , applies a shared instance classifier to produce per-segment scores $p_{\ell, c}$ , and aggregates via a permutation-invariant operator (e.g., max-pool, average-pool, or attention-pool) into bag-level predictions $P_c = \mathrm{Agg}\{p_{\ell, c}\}$ . The basic MIL objective applies a multi-label cross-entropy over the aggregated predictions. More sophisticated objectives—for example, CDur's duration-robust pooling—incorporate linear-softmax pooling and regularization to mitigate fragmentation and enforce plausible action durations.

Inference involves thresholding segment scores and temporal non-maximum suppression (NMS) to produce temporally localized segments.

Evaluation is performed at two levels:

Frame-level labeling: Per time-step predictions are compared with ground truth using class-wise precision, recall, F1, and disaggregation of time-alignment errors (deletion, underfill, fragmentation, insertion, overfill, merge).
Segment-level detection: Detected segments are matched if their temporal Intersection over Union (tIoU) with ground truth exceeds a threshold $\tau$ ; mean Average Precision (mAP) is reported across thresholds $\tau\in\{0.3,0.4,0.5,0.6,0.7\}$ .

2. Methodological Approaches: Adaptations from Other Domains

WS-IMU-TAL methods are primarily adaptations from three established weakly supervised localization paradigms: audio weakly supervised sound event detection (WSSED), image weakly supervised object detection (WSOD), and video weakly supervised temporal action localization (WSTAL). The principal distinction across methods lies in instance definition, score aggregation, and auxiliary modules.

Audio-Derived (WSSED) Approaches:

DCASE Baseline: Employs a 1D CRNN with Bi-GRU layers for temporal modeling; attention weights serve both MIL pooling and localization saliency. IMU axes are analogous to audio channels.
CDur: Introduces a duration regularizer and linear-softmax pooling to suppress over-fragmentation and promote sustained activations.

Image-Derived (WSOD) Approaches:

WSDDN: Dual-stream network where a 1D proposal generator samples temporal segments. Parallel heads compute classification and localization weights; their product is matched to the bag label.
OICR: Refines instance assignments by pseudo-labeling high-scoring proposals and iteratively improving temporal precision.
PCL: Clusters proposals by feature similarity, generating cluster-wise pseudo-GT to stabilize MIL training under noisy boundary conditions.

Video-Derived (WSTAL) Approaches:

CoLA: Uses snippet-level contrastive loss to encourage intra-action coherence in feature space, with top-k or attention pooling for MIL.
RSKP: Applies memory bank-based pseudo-labeling and temporal affinity propagation, diffusing initial predictions to fill gaps and minimize fragmentation.

3. Benchmarked Datasets, Experimental Protocol, and Ablations

WS-IMUBench evaluates these methods on seven public IMU-based human activity datasets:

Dataset	Subjects	Classes	Axes	Median Duration (s)	Scenario
SBHAR	30	12	3	12.76	Locomotion transitions
Opportunity	4	17	113	2.60	Daily-life actions
WetLab	22	8	3	11.23	Laboratory tasks
Hang-Time	24	5	3	1.80	Sports
RWHAR	15	8	21	621.16	Long-duration activities
WEAR	18	18	12	43.12	Multi-sensor activities
XRFV2	16	30	36	7.00	Diverse events

Protocols include leave-one-subject-out (LOSO) cross-validation for most datasets; XRFV2 uses both LOSO and official in-domain splits. Training uses fixed-length clips with overlap for pretraining encoders, which are then fine-tuned or frozen during weak supervision phases. Models are evaluated in both full_input (end-to-end sequence) and window_input (segmented windows merged by NMS) modes, each method run with three random seeds for reproducibility.

4. Empirical Findings and Failure Modes

Quantitative analysis reveals strong modality-dependent transfer effects. Temporal-domain (audio/video-derived) methods consistently surpass image-derived proposal approaches in stability and localization accuracy on IMU data.

On datasets with long actions and high sensor dimensionality (RWHAR, WEAR), audio/video-derived models (DCASE: up to 65.07% mAP; RSKP: up to 63.29% mAP) nearly close the gap to fully supervised methods (TriDet: 70.85%, 75.37% mAP).
On datasets with short actions and low-dimensional sensing (Opportunity, SBHAR, Hang-Time), all weakly supervised methods have mAP below 10%; supervised models far exceed these baselines (e.g., 94% mAP on SBHAR).
Image-derived WSOD adaptations (WSDDN, OICR, PCL) underperform (always below 20% mAP), limited by the inadequacy of random proposal generation and lack of content-aware temporal segmentation.

Failures are predominantly attributable to:

Short actions (<5 s): yield low snippet-level confidence, high fragmentation, and poor recall.
Temporal ambiguity and fragmentation: MIL pooling rewards only peak activations, missing action boundaries (high underfill, fragmentation).
Proposal quality: Content-agnostic random sampling in WSOD domains restricts recall, regardless of classification accuracy.

Ablations across input modalities indicate that the DCASE and CDur architectures benefit from global context, whereas MIL-based image approaches (WSDDN/OICR/PCL) degrade with longer sequences due to signal dilution. RSKP exhibits robustness across both full and windowed inference due to effective affinity propagation.

5. Principles of Transferability and Effectiveness

Three key research questions guided empirical insights:

RQ1—Transferability: Core MIL pooling and snippet-level scoring transfer across domains, provided sequential inductive bias is retained. Temporal methods aligned with IMU stream structure outperform image proposal variants lacking 1D temporal awareness.
RQ2—Effectiveness: Weak supervision is competitive (within 10–20 percentage points of full supervision) on datasets with long, unambiguous actions and high-dimensional sensors, as MIL pooling with attention or affinity propagation enables accurate boundary learning.
RQ3—Failure Modes: Weak supervision fundamentally struggles with short, dynamic, ambiguous actions and where proposal granularity is mismatched to action scale.

This suggests that future improvements require domain-specific handling of proposal generation, temporal boundaries, and sequential coherence in feature space.

6. Recommendations and Future Directions

Concrete advances toward scalable WS-IMU-TAL highlighted include:

IMU-Specific Proposal Generation: Replace content-agnostic, random windowing with signal-driven segmentation (e.g., unsupervised change-point detection, energy-based edge detection) to improve temporal proposal relevance.
Boundary-Aware Objectives: Introduce auxiliary predictors for onset/offset detection; leverage masked autoencoding with boundary-preserving masking.
Advanced Temporal Reasoning: Incorporate multi-scale temporal transformers, hierarchical temporal convolutional networks, and memory-augmented modules for joint fine/coarse motion modeling and iterative refinement.
Unified IMU Foundation Models: Develop multi-dataset self-supervised IMU backbones, harmonize sensor metadata/coordinate frames, and deploy lightweight domain adapters to support cross-device generalization.
Hybrid Supervision: Combine weak and semi-supervised paradigms, leveraging sparse boundary annotations, timestamp anchors, or point-level clicks to mitigate underfill and fragmentation.

The establishment of WS-IMUBench—with its standardized datasets, protocols, and diagnostic tools—demonstrates that weakly supervised IMU-TAL is feasible, while also clarifying the challenges that remain before practical deployment at scale is realized (Li et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization? (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weakly Supervised IMU-TAL (WS-IMU-TAL).