Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

C3T: Cross-modal Transfer Through Time for Human Action Recognition (2407.16803v2)

Published 23 Jul 2024 in cs.CV, cs.AI, cs.HC, cs.LG, and eess.SP

Abstract: In order to unlock the potential of diverse sensors, we investigate a method to transfer knowledge between modalities using the structure of a unified multimodal representation space for Human Action Recognition (HAR). We formalize and explore an understudied cross-modal transfer setting we term Unsupervised Modality Adaptation (UMA), where the modality used in testing is not used in supervised training, i.e. zero labeled instances of the test modality are available during training. We develop three methods to perform UMA: Student-Teacher (ST), Contrastive Alignment (CA), and Cross-modal Transfer Through Time (C3T). Our extensive experiments on various camera+IMU datasets compare these methods to each other in the UMA setting, and to their empirical upper bound in the supervised setting. The results indicate C3T is the most robust and highest performing by at least a margin of 8%, and nears the supervised setting performance even in the presence of temporal noise. This method introduces a novel mechanism for aligning signals across time-varying latent vectors, extracted from the receptive field of temporal convolutions. Our findings suggest that C3T has significant potential for developing generalizable models for time-series sensor data, opening new avenues for multi-modal learning in various applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Adaptnet: human activity recognition via bilateral domain adaptation using semi-supervised deep translation networks. IEEE Sensors Journal, 21(18):20398–20411, 2021.
  2. Imu2doppler: Cross-modal domain adaptation for doppler-based activity recognition using imu data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(4):1–20, 2021.
  3. Multimodal fusion via teacher-student network for indoor action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3199–3207, 2021.
  4. Activity recognition in wearables using adversarial multi-source domain adaptation. Smart Health, 19:100174, 2021.
  5. A systematic study of unsupervised domain adaptation for robust human-activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–30, 2020.
  6. Czu-mhad: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors. IEEE Sensors Journal, 22(7):7034–7042, 2022.
  7. Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE International conference on image processing (ICIP), pages 168–172. IEEE, 2015.
  8. A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pages 877–894, 2021.
  9. Posegpt: Chatting about 3d human pose. arXiv preprint arXiv:2311.18836, 2023.
  10. Personalized human activity recognition based on integrated wearable sensor and transfer learning. Sensors, 21(3):885, 2021.
  11. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  12. Cromosim: A deep learning-based cross-modality inertial measurement simulator. IEEE Transactions on Mobile Computing, 2022.
  13. Swl-adapt: An unsupervised domain adaptation model with sample weight learning for cross-user wearable human activity recognition. In Proceedings of the AAAI Conference on artificial intelligence, volume 37, pages 6012–6020, 2023.
  14. Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10285–10292. IEEE, 2020.
  15. Maven: A memory augmented recurrent approach for multimodal fusion. IEEE Transactions on Multimedia, 2022.
  16. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012.
  17. Scaling human activity recognition via deep learning-based domain adaptation. In 2018 IEEE international conference on pervasive computing and communications (PerCom), pages 1–9. IEEE, 2018.
  18. Mmact: A large-scale dataset for cross modal human action understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8658–8667, 2019.
  19. Imutube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(3):1–29, 2020.
  20. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 388–404. Springer, 2022.
  21. Vision and inertial sensing fusion for human action recognition: A review. IEEE Sensors Journal, 21(3):2454–2467, 2020.
  22. Imuposer: Full-body pose estimation using imus in phones, watches, and earbuds. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2023.
  23. Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text. arXiv preprint arXiv:2210.14395, 2022.
  24. A decade survey of transfer learning (2010–2020). IEEE Transactions on Artificial Intelligence, 1(2):151–166, 2020.
  25. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  26. Rethinking video vits: Sparse video tubes for joint image and video learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2214–2224, 2023.
  27. Hybrid domain adaptation with deep network architecture for end-to-end cross-domain human activity recognition. Computers & Industrial Engineering, 151:106953, 2021.
  28. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  29. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6):96–108, 2017.
  30. Toward multimodal human-computer interface. Proceedings of the IEEE, 86(5):853–869, 1998.
  31. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  32. Cross-modal knowledge distillation for action recognition. In 2019 IEEE International Conference on Image Processing (ICIP), pages 6–10. IEEE, 2019.
  33. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  34. Deep transfer learning for cross-domain activity recognition. In proceedings of the 3rd International Conference on Crowd Science and Engineering, pages 1–8, 2018.
  35. Multimodal learning with incomplete modalities by knowledge distillation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1828–1838, 2020.
  36. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6312–6322, 2023.
  37. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  38. Fusion of video and inertial sensing for deep learning–based human action recognition. Sensors, 19(17):3680, 2019.
  39. Towards continual egocentric activity recognition: A multi-modal egocentric activity dataset for continual learning. IEEE Transactions on Multimedia, 2023.
  40. The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation. arXiv preprint arXiv:2206.06487, 2022.
  41. Xhar: Deep domain adaptation for human activity recognition with smart devices. In 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), pages 1–9. IEEE, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Abhi Kamboj (6 papers)
  2. Anh Duy Nguyen (6 papers)
  3. Minh Do (13 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets