PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision (2411.15127v2)

Published 22 Nov 2024 in cs.LG

Abstract: Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. Labeled IMU data is scarce, however, unlabeled or weakly labeled IMU data can be used to model human motions. For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data to build a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. However, pretraining methods are poorly understood for IMU data, and pipelines are rarely evaluated on out-of-domain tasks. We propose PRIMUS: a method for PRetraining IMU encoderS that uses a novel pretraining objective that is empirically validated based on downstream performance on both in-domain and out-of-domain datasets. The PRIMUS objective effectively enhances downstream performance by combining self-supervision, multimodal, and nearest-neighbor supervision. With fewer than 500 labeled samples per class, PRIMUS improves test accuracy by up to 15%, compared to state-of-the-art baselines. To benefit the broader community, we have open-sourced our code at github.com/nokia-bell-labs/pretrained-imu-encoders.

Summary

The paper presents a novel pretraining method (PRIMUS) that fuses self-supervised and multimodal losses to generate transferable IMU representations, achieving up to 15% improved test accuracy with few-shot learning.
The method strategically combines self-supervision for noise invariance, multimodal alignment with text and video, and nearest-neighbor loss to diversify supervision.
Ablation studies confirm the critical role of each component, especially the multimodal loss, demonstrating efficient pretraining with limited labeled and aligned data.

This paper introduces PRIMUS, a novel method for pretraining IMU encoders using a multi-objective representation learning strategy that combines self-supervised (SS) and multimodal (MM) losses. The goal is to create transferable representations from IMU data that can be adapted to specific tasks with limited labeled data, addressing the scarcity of labeled IMU data and the under-exploration of pretraining in the IMU domain.

PRIMUS leverages three key objectives: a self-supervision loss ( $\mathcal{L}_{SS}$ ) that enforces invariance to noise through data augmentation, a multimodal loss ( $\mathcal{L}_{MM}$ ) that aligns IMU representations with corresponding text and video representations using pretrained video and text encoders, and a nearest-neighbor loss ( $\mathcal{L}_{NN}$ ) that leverages the closest examples in representation space as positive pairs to increase the diversity of supervision. The IMU encoder architecture, adopted from IMU2CLIP, consists of convolutional, group normalization, max-pooling, and GRU layers.

The effectiveness of PRIMUS is evaluated on both in-domain and out-of-domain classification tasks using few-shot learning. The results demonstrate that PRIMUS significantly enhances downstream performance compared to state-of-the-art multimodal and self-supervised training methods, achieving up to a 15% improvement in test accuracy with fewer than 500 labeled samples per class. Ablation studies confirm the importance of each component in the PRIMUS objective, with $\mathcal{L}_{MM}$ being particularly critical. The paper also explores pretraining data efficiency, showing that PRIMUS can achieve comparable performance with significantly less aligned video and text data. The authors plan to publicly release the code and pretrained IMU encoders.

PDF Markdown

GitHub

Nokia Bell Labs · GitHub

PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision (2411.15127v2)

Summary

Related Papers

GitHub