A Survey of IMU Based Cross-Modal Transfer Learning in Human Activity Recognition (2403.15444v1)
Abstract: Despite living in a multi-sensory world, most AI models are limited to textual and visual understanding of human motion and behavior. In fact, full situational awareness of human motion could best be understood through a combination of sensors. In this survey we investigate how knowledge can be transferred and utilized amongst modalities for Human Activity/Action Recognition (HAR), i.e. cross-modality transfer learning. We motivate the importance and potential of IMU data and its applicability in cross-modality learning as well as the importance of studying the HAR problem. We categorize HAR related tasks by time and abstractness and then compare various types of multimodal HAR datasets. We also distinguish and expound on many related but inconsistently used terms in the literature, such as transfer learning, domain adaptation, representation learning, sensor fusion, and multimodal learning, and describe how cross-modal learning fits with all these concepts. We then review the literature in IMU-based cross-modal transfer for HAR. The two main approaches for cross-modal transfer are instance-based transfer, where instances of one modality are mapped to another (e.g. knowledge is transferred in the input space), or feature-based transfer, where the model relates the modalities in an intermediate latent space (e.g. knowledge is transferred in the feature space). Finally, we discuss future research directions and applications in cross-modal HAR.
- S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
- K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, vol. 27, 2014.
- Z. Lin, S. Geng, R. Zhang, P. Gao, G. de Melo, X. Wang, J. Dai, Y. Qiao, and H. Li, “Frozen clip models are efficient video learners,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 388–404, Springer, 2022.
- R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, L. Yuan, and Y.-G. Jiang, “Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6312–6322, 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, pp. 8748–8763, PMLR, 2021.
- Y. Wang, K. Li, Y. Li, Y. He, B. Huang, Z. Zhao, H. Zhang, J. Xu, Y. Liu, Z. Wang, et al., “Internvideo: General video foundation models via generative and discriminative learning,” arXiv preprint arXiv:2212.03191, 2022.
- Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10078–10093, 2022.
- A. Piergiovanni, W. Kuo, and A. Angelova, “Rethinking video vits: Sparse video tubes for joint image and video learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2214–2224, 2023.
- V. Mollyn, R. Arakawa, M. Goel, C. Harrison, and K. Ahuja, “Imuposer: Full-body pose estimation using imus in phones, watches, and earbuds,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–12, 2023.
- O. D. Lara and M. A. Labrador, “A survey on human activity recognition using wearable sensors,” IEEE communications surveys & tutorials, vol. 15, no. 3, pp. 1192–1209, 2012.
- Y. Dai, Y. Lin, C. Wen, S. Shen, L. Xu, J. Yu, Y. Ma, and C. Wang, “Hsc4d: Human-centered 4d scene capture in large-scale indoor-outdoor space using wearable imus and lidar,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6792–6802, 2022.
- Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, “Human action recognition from various data modalities: A review,” IEEE transactions on pattern analysis and machine intelligence, 2022.
- C. Jobanputra, J. Bavishi, and N. Doshi, “Human activity recognition: A survey,” Procedia Computer Science, vol. 155, pp. 698–703, 2019.
- Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” International Journal of Computer Vision, vol. 130, no. 5, pp. 1366–1401, 2022.
- L. Bibbò, R. Carotenuto, and F. Della Corte, “An overview of indoor localization system for human activity recognition (har) in healthcare,” Sensors, vol. 22, no. 21, p. 8119, 2022.
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299, 2017.
- I. Rodomagoulakis, N. Kardaris, V. Pitsikalis, E. Mavroudi, A. Katsamanis, A. Tsiami, and P. Maragos, “Multimodal human action recognition in assistive human-robot interaction,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2702–2706, IEEE, 2016.
- R. Sharma, V. I. Pavlovic, and T. S. Huang, “Toward multimodal human-computer interface,” Proceedings of the IEEE, vol. 86, no. 5, pp. 853–869, 1998.
- Y.-L. Li, L. Xu, X. Liu, X. Huang, Y. Xu, S. Wang, H.-S. Fang, Z. Ma, M. Chen, and C. Lu, “Pastanet: Toward human activity knowledge engine,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 382–391, 2020.
- K. Chen, D. Zhang, L. Yao, B. Guo, Z. Yu, and Y. Liu, “Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities,” ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–40, 2021.
- C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in 2015 IEEE International conference on image processing (ICIP), pp. 168–172, IEEE, 2015.
- M. Zhang and A. A. Sawchuk, “Usc-had: A daily activity dataset for ubiquitous activity recognition using wearable sensors,” in Proceedings of the 2012 ACM conference on ubiquitous computing, pp. 1036–1043, 2012.
- X. Chao, Z. Hou, and Y. Mo, “Czu-mhad: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors,” IEEE Sensors Journal, vol. 22, no. 7, pp. 7034–7042, 2022.
- S. Das, R. Dai, M. Koperski, L. Minciullo, L. Garattoni, F. Bremond, and G. Francesca, “Toyota smarthome: Real-world activities of daily living,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 833–842, 2019.
- L. Arrotta, C. Bettini, and G. Civitarese, “The marble dataset: Multi-inhabitant activities of daily living combining wearable and environmental sensors data,” in International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services, pp. 451–468, Springer, 2021.
- K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022.
- D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al., “Scaling egocentric vision: The epic-kitchens dataset,” in Proceedings of the European conference on computer vision (ECCV), pp. 720–736, 2018.
- E. H. Spriggs, F. De La Torre, and M. Hebert, “Temporal segmentation and activity classification from first-person sensing,” in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 17–24, IEEE, 2009.
- F. Rosique, P. J. Navarro, L. Miller, and E. Salas, “Autonomous vehicle dataset with real multi-driver scenes and biometric data,” Sensors, vol. 23, no. 4, p. 2009, 2023.
- H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631, 2020.
- D. Ramachandram and G. W. Taylor, “Deep multimodal learning: A survey on recent advances and trends,” IEEE signal processing magazine, vol. 34, no. 6, pp. 96–108, 2017.
- S. Majumder and N. Kehtarnavaz, “Vision and inertial sensing fusion for human action recognition: A review,” IEEE Sensors Journal, vol. 21, no. 3, pp. 2454–2467, 2020.
- F. Attal, S. Mohammed, M. Dedabrishvili, F. Chamroukhi, L. Oukhellou, and Y. Amirat, “Physical human activity recognition using wearable sensors,” Sensors, vol. 15, no. 12, pp. 31314–31338, 2015.
- H. F. Nweke, Y. W. Teh, M. A. Al-Garadi, and U. R. Alo, “Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges,” Expert Systems with Applications, vol. 105, pp. 233–261, 2018.
- E. Ramanujam, T. Perumal, and S. Padmavathi, “Human activity recognition with smartphone and wearable sensors using deep learning techniques: A review,” IEEE Sensors Journal, vol. 21, no. 12, pp. 13029–13040, 2021.
- S. Zhang, Y. Li, S. Zhang, F. Shahabi, S. Xia, Y. Deng, and N. Alshurafa, “Deep learning in human activity recognition with wearable sensors: A review on advances,” Sensors, vol. 22, no. 4, p. 1476, 2022.
- J. Wang, V. W. Zheng, Y. Chen, and M. Huang, “Deep transfer learning for cross-domain activity recognition,” in proceedings of the 3rd International Conference on Crowd Science and Engineering, pp. 1–8, 2018.
- M. A. R. Ahad, A. D. Antar, M. Ahmed, M. A. R. Ahad, A. D. Antar, and M. Ahmed, “Deep learning for sensor-based activity recognition: recent trends,” IoT Sensor-Based Activity Recognition: Human Activity Recognition, pp. 149–173, 2021.
- H. S. Hossain, M. A. A. H. Khan, and N. Roy, “Active learning enabled activity recognition,” Pervasive and Mobile Computing, vol. 38, pp. 312–330, 2017.
- S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
- K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big data, vol. 3, no. 1, pp. 1–40, 2016.
- S. Niu, Y. Liu, J. Wang, and H. Song, “A decade survey of transfer learning (2010–2020),” IEEE Transactions on Artificial Intelligence, vol. 1, no. 2, pp. 151–166, 2020.
- O. Day and T. M. Khoshgoftaar, “A survey on heterogeneous transfer learning,” Journal of Big Data, vol. 4, pp. 1–42, 2017.
- A. Farahani, S. Voghoei, K. Rasheed, and H. R. Arabnia, “A brief review of domain adaptation,” Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pp. 877–894, 2021.
- Y. Chang, A. Mathur, A. Isopoussu, J. Song, and F. Kawsar, “A systematic study of unsupervised domain adaptation for robust human-activity recognition,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 1, pp. 1–30, 2020.
- S. An, A. Medda, M. N. Sawka, C. J. Hutto, M. L. Millard-Stafford, S. Appling, K. L. Richardson, and O. T. Inan, “Adaptnet: human activity recognition via bilateral domain adaptation using semi-supervised deep translation networks,” IEEE Sensors Journal, vol. 21, no. 18, pp. 20398–20411, 2021.
- S. Bhalla, M. Goel, and R. Khurana, “Imu2doppler: Cross-modal domain adaptation for doppler-based activity recognition using imu data,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 4, pp. 1–20, 2021.
- A. G. Prabono, B. N. Yahya, and S.-L. Lee, “Hybrid domain adaptation with deep network architecture for end-to-end cross-domain human activity recognition,” Computers & Industrial Engineering, vol. 151, p. 106953, 2021.
- R. Hu, L. Chen, S. Miao, and X. Tang, “Swl-adapt: An unsupervised domain adaptation model with sample weight learning for cross-user wearable human activity recognition,” in Proceedings of the AAAI Conference on artificial intelligence, vol. 37, pp. 6012–6020, 2023.
- Z. Fu, X. He, E. Wang, J. Huo, J. Huang, and D. Wu, “Personalized human activity recognition based on integrated wearable sensor and transfer learning,” Sensors, vol. 21, no. 3, p. 885, 2021.
- M. A. A. H. Khan, N. Roy, and A. Misra, “Scaling human activity recognition via deep learning-based domain adaptation,” in 2018 IEEE international conference on pervasive computing and communications (PerCom), pp. 1–9, IEEE, 2018.
- Z. Zhou, Y. Zhang, X. Yu, P. Yang, X.-Y. Li, J. Zhao, and H. Zhou, “Xhar: Deep domain adaptation for human activity recognition with smart devices,” in 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), pp. 1–9, IEEE, 2020.
- A. Chakma, A. Z. M. Faridee, M. A. A. H. Khan, and N. Roy, “Activity recognition in wearables using adversarial multi-source domain adaptation,” Smart Health, vol. 19, p. 100174, 2021.
- R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-shot learning through cross-modal transfer,” Advances in neural information processing systems, vol. 26, 2013.
- V. Radu, C. Tong, S. Bhattacharya, N. D. Lane, C. Mascolo, M. K. Marina, and F. Kawsar, “Multimodal deep learning for activity and context recognition,” Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 1, no. 4, pp. 1–27, 2018.
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696, 2011.
- O. Banos, A. Calatroni, M. Damas, H. Pomares, D. Roggen, I. Rojas, and C. Villalonga, “Opportunistic activity recognition in iot sensor ecosystems via multimodal transfer learning,” Neural Processing Letters, pp. 1–29, 2021.
- H. Kwon, C. Tong, H. Haresamudram, Y. Gao, G. D. Abowd, N. D. Lane, and T. Ploetz, “Imutube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 3, pp. 1–29, 2020.
- H. Kwon, G. D. Abowd, and T. Plötz, “Complex deep neural networks from large scale virtual imu data for effective human activity recognition using wearables,” Sensors, vol. 21, no. 24, p. 8337, 2021.
- Y. Hao, X. Lou, B. Wang, and R. Zheng, “Cromosim: A deep learning-based cross-modality inertial measurement simulator,” IEEE Transactions on Mobile Computing, 2022.
- T. Xing, S. S. Sandha, B. Balaji, S. Chakraborty, and M. Srivastava, “Enabling edge devices that learn from each other: Cross modal training for activity recognition,” in Proceedings of the 1st International Workshop on Edge Systems, Analytics and Networking, pp. 37–42, 2018.
- Z. Song, H. Zhou, S. Wang, J. Fan, K. Guo, W. Zhou, X. Wang, and X.-Y. Li, “Imfi: Imu-wifi based cross-modal gait recognition system with hot-deployment,” in 2021 17th International Conference on Mobility, Sensing and Networking (MSN), pp. 279–286, IEEE, 2021.
- C. Tong, J. Ge, and N. D. Lane, “Zero-shot learning for imu-based activity recognition using video embeddings,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 4, pp. 1–23, 2021.
- S. Deldari, H. Xue, A. Saeed, D. V. Smith, and F. D. Salim, “Cocoa: Cross modality contrastive learning for sensor data,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 6, no. 3, pp. 1–28, 2022.
- B. Khaertdinov, E. Ghaleb, and S. Asteriadis, “Contrastive self-supervised learning for sensor-based human activity recognition,” in 2021 IEEE International Joint Conference on Biometrics (IJCB), pp. 1–8, IEEE, 2021.
- Y. Jain, C. I. Tang, C. Min, F. Kawsar, and A. Mathur, “Collossl: Collaborative self-supervised learning for human activity recognition,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 6, no. 1, pp. 1–28, 2022.
- S. Moon, A. Madotto, Z. Lin, A. Dirafzoon, A. Saraf, A. Bearman, and B. Damavandi, “Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text,” arXiv preprint arXiv:2210.14395, 2022.
- R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190, 2023.
- Y. Zhao, P. Barnaghi, and H. Haddadi, “Multimodal federated learning on iot data,” in 2022 IEEE/ACM Seventh International Conference on Internet-of-Things Design and Implementation (IoTDI), pp. 43–54, IEEE, 2022.
- B. Xiong, X. Yang, F. Qi, and C. Xu, “A unified framework for multi-modal federated learning,” Neurocomputing, vol. 480, pp. 110–118, 2022.
- Y. Jiang, J. Konečnỳ, K. Rush, and S. Kannan, “Improving federated learning personalization via model agnostic meta learning,” arXiv preprint arXiv:1909.12488, 2019.
- C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning, pp. 1126–1135, PMLR, 2017.
- Q. Yu, Y. Liu, Y. Wang, K. Xu, and J. Liu, “Multimodal federated learning via contrastive representation ensemble,” arXiv preprint arXiv:2302.08888, 2023.
- H. Q. Le, Y. Qiao, L. X. Nguyen, L. Zou, and C. S. Hong, “Federated multimodal learning for iot applications: A contrastive learning approach,” in 2023 24st Asia-Pacific Network Operations and Management Symposium (APNOMS), pp. 201–206, IEEE, 2023.
- A. D. Antar, M. Ahmed, and M. A. R. Ahad, “Challenges in sensor-based human activity recognition and a comparative analysis of benchmark datasets: A review,” in 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), pp. 134–139, IEEE, 2019.
- M. M. Islam and T. Iqbal, “Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10285–10292, IEEE, 2020.
- T. Sheng and M. Huber, “Weakly supervised multi-task representation learning for human activity analysis using wearables,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 2, pp. 1–18, 2020.
- Abhi Kamboj (6 papers)
- Minh Do (13 papers)