Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Multimodal Wearable Sensor-based Human Action Recognition (2404.15349v1)

Published 14 Apr 2024 in eess.SP, cs.LG, and cs.MM

Abstract: The combination of increased life expectancy and falling birth rates is resulting in an aging population. Wearable Sensor-based Human Activity Recognition (WSHAR) emerges as a promising assistive technology to support the daily lives of older individuals, unlocking vast potential for human-centric applications. However, recent surveys in WSHAR have been limited, focusing either solely on deep learning approaches or on a single sensor modality. In real life, our human interact with the world in a multi-sensory way, where diverse information sources are intricately processed and interpreted to accomplish a complex and unified sensing system. To give machines similar intelligence, multimodal machine learning, which merges data from various sources, has become a popular research area with recent advancements. In this study, we present a comprehensive survey from a novel perspective on how to leverage multimodal learning to WSHAR domain for newcomers and researchers. We begin by presenting the recent sensor modalities as well as deep learning approaches in HAR. Subsequently, we explore the techniques used in present multimodal systems for WSHAR. This includes inter-multimodal systems which utilize sensor modalities from both visual and non-visual systems and intra-multimodal systems that simply take modalities from non-visual systems. After that, we focus on current multimodal learning approaches that have applied to solve some of the challenges existing in WSHAR. Specifically, we make extra efforts by connecting the existing multimodal literature from other domains, such as computer vision and natural language processing, with current WSHAR area. Finally, we identify the corresponding challenges and potential research direction in current WSHAR area for further improvement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (281)
  1. U. DESA, “World population prospects: Key findings and advance tables,” New York: UN DESA, 2017.
  2. A. Kuerbis, A. Mulliken, F. Muench, A. A. Moore, and D. Gardner, “Older adults and mobile technology: Factors that enhance and inhibit utilization in the context of behavioral health,” 2017.
  3. S. K. Yadav, K. Tiwari, H. M. Pandey, and S. A. Akbar, “A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions,” Knowledge-Based Systems, vol. 223, p. 106970, 2021.
  4. Y. Liu, K. Wang, G. Li, and L. Lin, “Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition,” TIP, vol. 30, pp. 5573–5588, 2021.
  5. I. Rodomagoulakis, N. Kardaris, V. Pitsikalis, E. Mavroudi, A. Katsamanis, A. Tsiami, and P. Maragos, “Multimodal human action recognition in assistive human-robot interaction,” in ICASSP.   IEEE, 2016, pp. 2702–2706.
  6. M. Á. Á. de la Concepción, L. M. S. Morillo, J. A. Á. García, and L. González-Abril, “Mobile activity recognition and fall detection system for elderly people using ameva algorithm,” PMC, vol. 34, pp. 3–13, 2017.
  7. N. Maray, A. H. Ngu, J. Ni, M. Debnath, and L. Wang, “Transfer learning on small datasets for improved fall detection,” Sensors, vol. 23, no. 3, p. 1105, 2023.
  8. Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, “Human action recognition from various data modalities: A review,” TPAMI, 2022.
  9. L. Lyu, X. He, Y. W. Law, and M. Palaniswami, “Privacy-preserving collaborative deep learning with application to human activity recognition,” in CIKM, 2017, pp. 1219–1228.
  10. S. Zhang, Y. Li, S. Zhang, F. Shahabi, S. Xia, Y. Deng, and N. Alshurafa, “Deep learning in human activity recognition with wearable sensors: A review on advances,” Sensors, vol. 22, no. 4, p. 1476, 2022.
  11. A. Galán-Mercant, A. Ortiz, E. Herrera-Viedma, M. T. Tomas, B. Fernandes, and J. A. Moral-Munoz, “Assessing physical activity and functional fitness level using convolutional neural networks,” Knowledge-Based Systems, vol. 185, p. 104939, 2019.
  12. K. Van Laerhoven, M. Borazio, and J. H. Burdinski, “Wear is your mobile? investigating phone carrying and use habits with a wearable device,” Frontiers in ICT, vol. 2, p. 10, 2015.
  13. R. Rawassizadeh, B. A. Price, and M. Petre, “Wearables: Has the age of smartwatches finally arrived?” Communications of the ACM, vol. 58, no. 1, pp. 45–47, 2014.
  14. J. Ni, A. H. Ngu, and Y. Yan, “Progressive cross-modal knowledge distillation for human action recognition,” in ACM MM, 2022, pp. 5903–5912.
  15. T. R. Mauldin, M. E. Canby, V. Metsis, A. H. Ngu, and C. C. Rivera, “Smartfall: A smartwatch-based fall detection system using deep learning,” Sensors, vol. 18, no. 10, p. 3363, 2018.
  16. Y. Wang, S. Cang, and H. Yu, “A survey on wearable sensor modality centred human activity recognition in health care,” Expert Systems with Applications, vol. 137, pp. 167–190, 2019.
  17. I. Bornkessel-Schlesewsky, M. Schlesewsky, S. L. Small, and J. P. Rauschecker, “Neurobiological roots of language in primate audition: common computational properties,” Trends in cognitive sciences, vol. 19, no. 3, pp. 142–150, 2015.
  18. L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, and H. Moon, “Sensor-based and vision-based human activity recognition: A comprehensive survey,” Pattern Recognition, vol. 108, p. 107561, 2020.
  19. R. Poppe, “A survey on vision-based human action recognition,” Image and Vision Computing, vol. 28, no. 6, pp. 976–990, 2010.
  20. J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” CSUR, vol. 43, no. 3, pp. 1–43, 2011.
  21. O. D. Incel, M. Kose, and C. Ersoy, “A review and taxonomy of activity recognition on mobile phones,” BioNanoScience, vol. 3, no. 2, pp. 145–171, 2013.
  22. O. D. Lara and M. A. Labrador, “A survey on human activity recognition using wearable sensors,” IEEE communications surveys & tutorials, vol. 15, no. 3, pp. 1192–1209, 2012.
  23. A. Bulling, U. Blanke, and B. Schiele, “A tutorial on human activity recognition using body-worn inertial sensors,” CSUR, vol. 46, no. 3, pp. 1–33, 2014.
  24. D. R. Beddiar, B. Nini, M. Sabokrou, and A. Hadid, “Vision-based human activity recognition: a survey,” Multimedia Tools and Applications, vol. 79, no. 41-42, pp. 30 509–30 555, 2020.
  25. F. Gu, M.-H. Chung, M. Chignell, S. Valaee, B. Zhou, and X. Liu, “A survey on deep learning for human activity recognition,” CSUR, vol. 54, no. 8, pp. 1–34, 2021.
  26. K. Chen, D. Zhang, L. Yao, B. Guo, Z. Yu, and Y. Liu, “Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities,” CSUR, vol. 54, no. 4, pp. 1–40, 2021.
  27. E. Ramanujam, T. Perumal, and S. Padmavathi, “Human activity recognition with smartphone and wearable sensors using deep learning techniques: A review,” IEEE Sensors Journal, vol. 21, no. 12, pp. 13 029–13 040, 2021.
  28. Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” IJCV, vol. 130, no. 5, pp. 1366–1401, 2022.
  29. S. Qiu, H. Zhao, N. Jiang, Z. Wang, L. Liu, Y. An, H. Zhao, X. Miao, R. Liu, and G. Fortino, “Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges,” Information Fusion, vol. 80, pp. 241–265, 2022.
  30. S. G. Dhekane and T. Ploetz, “Transfer learning in human activity recognition: A survey,” arXiv preprint arXiv:2401.10185, 2024.
  31. S. Nerella, S. Bandyopadhyay, J. Zhang, M. Contreras, S. Siegel, A. Bumin, B. Silva, J. Sena, B. Shickel, A. Bihorac et al., “Transformers in healthcare: A survey,” arXiv preprint arXiv:2307.00067, 2023.
  32. L. Yang, O. Amin, and B. Shihada, “Intelligent wearable systems: Opportunities and challenges in health and sports,” CSUR, 2024.
  33. C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in ICCV, 2019, pp. 6202–6211.
  34. J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in ICCV, 2019, pp. 7083–7093.
  35. J. Yang, X. Dong, L. Liu, C. Zhang, J. Shen, and D. Yu, “Recurring the transformer for video action recognition,” in CVPR, 2022, pp. 14 063–14 073.
  36. S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun, and C. Schmid, “Multiview transformers for video recognition,” in CVPR, 2022, pp. 3333–3343.
  37. C. Zhang, A. Gupta, and A. Zisserman, “Helping hands: An object-aware ego-centric video recognition model,” in ICCV, 2023, pp. 13 901–13 912.
  38. T. Shiota, M. Takagi, K. Kumagai, H. Seshimo, and Y. Aono, “Egocentric action recognition by capturing hand-object contact and object state,” in WACV, 2024, pp. 6541–6551.
  39. S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in AAAI, 2018.
  40. L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in CVPR, 2019, pp. 12 026–12 035.
  41. H. Qiu, B. Hou, B. Ren, and X. Zhang, “Spatio-temporal tuples transformer for skeleton-based action recognition,” arXiv preprint arXiv:2201.02849, 2022.
  42. H. Duan, M. Xu, B. Shuai, D. Modolo, Z. Tu, J. Tighe, and A. Bergamo, “Skeletr: Towards skeleton-based action recognition in the wild,” in ICCV, 2023, pp. 13 634–13 644.
  43. G. Laput, K. Ahuja, M. Goel, and C. Harrison, “Ubicoustics: Plug-and-play acoustic activity recognition,” in UIST, 2018, pp. 213–224.
  44. D. Liang and E. Thomaz, “Audio-based activities of daily living (adl) recognition with large-scale acoustic embeddings from online videos,” IMWUT, vol. 3, no. 1, pp. 1–18, 2019.
  45. Z. Chen, L. Zhang, C. Jiang, Z. Cao, and W. Cui, “Wifi csi based passive human activity recognition using attention based blstm,” TMC, vol. 18, no. 11, pp. 2714–2724, 2018.
  46. B. Sheng, F. Xiao, L. Sha, and L. Sun, “Deep spatial–temporal model based cross-scene action recognition using commodity wifi,” IoT-J, vol. 7, no. 4, pp. 3592–3601, 2020.
  47. J. Wang, Q. Long, K. Liu, Y. Xie et al., “Human action recognition on cellphone using compositional bidir-lstm-cnn networks,” in CNCI 2019.   Atlantis Press, 2019, pp. 687–692.
  48. K. Xia, J. Huang, and H. Wang, “Lstm-cnn architecture for human activity recognition,” IEEE Access, vol. 8, pp. 56 855–56 866, 2020.
  49. A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” TPAMI, vol. 23, no. 3, pp. 257–267, 2001.
  50. K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” arXiv preprint arXiv:1406.2199, 2014.
  51. A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action recognition in video sequences using deep bi-directional lstm with cnn features,” IEEE access, vol. 6, pp. 1155–1166, 2017.
  52. Y. Li, Z. Ye, and J. M. Rehg, “Delving into egocentric actions,” in CVPR, 2015, pp. 287–295.
  53. A. Núñez-Marcos, G. Azkune, and I. Arganda-Carreras, “Egocentric vision-based action recognition: A survey,” Neurocomputing, vol. 472, pp. 175–197, 2022.
  54. A. Fathi, A. Farhadi, and J. M. Rehg, “Understanding egocentric activities,” in ICCV.   IEEE, 2011, pp. 407–414.
  55. N. Aboubakr, J. L. Crowley, and R. Ronfard, “Recognizing manipulation actions from state-transformations,” arXiv preprint arXiv:1906.05147, 2019.
  56. T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman, “Ego-topo: Environment affordances from egocentric video,” in CVPR, 2020, pp. 163–172.
  57. Y. Poleg, A. Ephrat, S. Peleg, and C. Arora, “Compact cnn for indexing egocentric videos,” in WACV.   IEEE, 2016, pp. 1–9.
  58. S. Narayan, M. S. Kankanhalli, and K. R. Ramakrishnan, “Action and interaction recognition in first-person videos,” in CVPR Workshops, 2014, pp. 512–518.
  59. Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in CVPR, 2015, pp. 1110–1118.
  60. S. Zhang, X. Liu, and J. Xiao, “On geometric features for skeleton-based action recognition using multilayer lstm networks,” in WACV.   IEEE, 2017, pp. 148–157.
  61. M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction,” TPAMI, 2021.
  62. F. Al Machot, M. R. Elkobaisi, and K. Kyamakya, “Zero-shot human activity recognition using non-visual sensors,” Sensors, vol. 20, no. 3, p. 825, 2020.
  63. J. Chen, A. H. Kam, J. Zhang, N. Liu, and L. Shue, “Bathroom activity monitoring based on sound,” in PERVASIVE.   Springer, 2005, pp. 47–61.
  64. A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,” ICASSP, vol. 14, no. 1, pp. 321–329, 2005.
  65. N. D. Lane, P. Georgiev, and L. Qendro, “Deepear: robust smartphone audio sensing in unconstrained acoustic environments using deep learning,” in UbiComp, 2015, pp. 283–294.
  66. K. Yatani and K. N. Truong, “Bodyscope: a wearable acoustic sensor for activity recognition,” in UbiComp, 2012, pp. 341–350.
  67. X. Wu, Z. Chu, P. Yang, C. Xiang, X. Zheng, and W. Huang, “Tw-see: Human activity recognition through the wall with commodity wi-fi devices,” TVT, vol. 68, no. 1, pp. 306–319, 2018.
  68. J. Wang, X. Zhang, Q. Gao, H. Yue, and H. Wang, “Device-free wireless localization and activity recognition: A deep learning approach,” TVT, vol. 66, no. 7, pp. 6258–6267, 2016.
  69. Q. Gao, J. Wang, X. Ma, X. Feng, and H. Wang, “Csi-based device-free wireless localization and activity recognition using radio image features,” TVT, vol. 66, no. 11, pp. 10 346–10 356, 2017.
  70. Y. Chen and Y. Xue, “A deep learning approach to human activity recognition based on single accelerometer,” in SMC.   IEEE, 2015, pp. 1488–1492.
  71. F. J. Ordóñez and D. Roggen, “Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition,” Sensors, vol. 16, no. 1, p. 115, 2016.
  72. J. Lu and K.-Y. Tong, “Robust single accelerometer-based activity recognition using modified recurrence plot,” IEEE Sensors Journal, vol. 19, no. 15, pp. 6317–6324, 2019.
  73. W. Jiang and Z. Yin, “Human activity recognition using wearable sensors by deep convolutional neural networks,” in ACM MM, 2015, pp. 1307–1310.
  74. C. Chen, R. Jafari, and N. Kehtarnavaz, “Improving human action recognition using fusion of depth camera and inertial sensors,” THMS, vol. 45, no. 1, pp. 51–61, 2014.
  75. N. E. D. Elmadany, Y. He, and L. Guan, “Multimodal learning for human action recognition via bimodal/multimodal hybrid centroid canonical correlation analysis,” TMM, vol. 21, no. 5, pp. 1317–1331, 2018.
  76. U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text-to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022.
  77. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence-video to text,” in ICCV, 2015, pp. 4534–4542.
  78. T. Qiao, J. Zhang, D. Xu, and D. Tao, “Mirrorgan: Learning text-to-image generation by redescription,” in CVPR, 2019, pp. 1505–1514.
  79. T. I. Denk, Y. Takagi, T. Matsuyama, A. Agostinelli, T. Nakai, C. Frank, and S. Nishimoto, “Brain2music: Reconstructing music from human brain activity,” arXiv preprint arXiv:2307.11078, 2023.
  80. W. Peebles and S. Xie, “Scalable diffusion models with transformers,” in ICCV, 2023, pp. 4195–4205.
  81. H. Wei, R. Jafari, and N. Kehtarnavaz, “Fusion of video and inertial sensing for deep learning–based human action recognition,” Sensors, vol. 19, no. 17, p. 3680, 2019.
  82. Z. Ahmad and N. Khan, “Human action recognition using deep multilevel multimodal (M2superscript𝑀2{M}^{2}italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) fusion of depth and inertial sensors,” IEEE Sensors Journal, vol. 20, no. 3, pp. 1445–1455, 2019.
  83. D. Ravì, C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Lo, and G.-Z. Yang, “Deep learning for health informatics,” JBHI, vol. 21, no. 1, pp. 4–21, 2016.
  84. T. Von Marcard, G. Pons-Moll, and B. Rosenhahn, “Human pose estimation from video and imus,” TPAMI, vol. 38, no. 8, pp. 1533–1547, 2016.
  85. T. Li, S. Fong, K. K. Wong, Y. Wu, X.-s. Yang, and X. Li, “Fusing wearable and remote sensing data streams by fast incremental learning with swarm decision table for human activity recognition,” Information Fusion, vol. 60, pp. 41–64, 2020.
  86. C. M. Ranieri, S. MacLeod, M. Dragone, P. A. Vargas, and R. A. F. Romero, “Activity recognition for ambient assisted living with videos, inertial units and ambient sensors,” Sensors, vol. 21, no. 3, p. 768, 2021.
  87. M. Ijaz, R. Diaz, and C. Chen, “Multimodal transformer for nursing activity recognition,” in CVPR Workshop, 2022, pp. 2065–2074.
  88. N. Dawar and N. Kehtarnavaz, “A convolutional neural network-based sensor fusion system for monitoring transition movements in healthcare applications,” in ICCA.   IEEE, 2018, pp. 482–485.
  89. A. Das, P. Sil, P. K. Singh, V. Bhateja, and R. Sarkar, “Mmhar-ensemnet: a multi-modal human activity recognition model,” IEEE Sensors Journal, vol. 21, no. 10, pp. 11 569–11 576, 2020.
  90. Z. Qin, Y. Zhang, S. Meng, Z. Qin, and K.-K. R. Choo, “Imaging and fusing time series for wearable sensor-based human activity recognition,” Information Fusion, vol. 53, pp. 80–87, 2020.
  91. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  92. Q. Kong, Z. Wu, Z. Deng, M. Klinkigt, B. Tong, and T. Murakami, “Mmact: A large-scale dataset for cross modal human action understanding,” in ICCV, 2019, pp. 8658–8667.
  93. J. Ni, R. Sarbajna, Y. Liu, A. H. Ngu, and Y. Yan, “Cross-modal knowledge distillation for vision-to-sensor action recognition,” in ICASSP.   IEEE, 2022, pp. 4448–4452.
  94. S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowledge distillation via teacher assistant,” in AAAI, vol. 34, no. 04, 2020, pp. 5191–5198.
  95. Y. Tian, C. Zhang, Z. Guo, X. Zhang, and N. Chawla, “Learning mlps on graphs: A unified view of effectiveness, robustness, and efficiency,” in ICLR, 2022.
  96. Y. Tian, S. Pei, X. Zhang, C. Zhang, and N. V. Chawla, “Knowledge distillation on graphs: A survey,” arXiv preprint arXiv:2302.00219, 2023.
  97. M. Gabel, R. Gilad-Bachrach, E. Renshaw, and A. Schuster, “Full body gait analysis with kinect,” in EMBC.   IEEE, 2012, pp. 1964–1967.
  98. F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Berkeley mhad: A comprehensive multimodal human action database,” in WACV.   IEEE, 2013, pp. 53–60.
  99. B. Delachaux, J. Rebetez, A. Perez-Uribe, and H. F. Satizábal Mejia, “Indoor activity recognition by combining one-vs.-all neural network classifiers exploiting wearable and depth sensors,” in IWANN.   Springer, 2013, pp. 216–223.
  100. K. Liu, C. Chen, R. Jafari, and N. Kehtarnavaz, “Fusion of inertial and depth sensor data for robust hand gesture recognition,” IEEE Sensors Journal, vol. 14, no. 6, pp. 1898–1903, 2014.
  101. C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in ICIP.   IEEE, 2015, pp. 168–172.
  102. C. Malleson, A. Gilbert, M. Trumble, J. Collomosse, A. Hilton, and M. Volino, “Real-time full-body motion capture from video and imus,” in 3DV.   IEEE, 2017, pp. 449–457.
  103. N. Dawar and N. Kehtarnavaz, “Action detection and recognition in continuous action streams by deep learning-based sensing fusion,” IEEE Sensors Journal, vol. 18, no. 23, pp. 9660–9668, 2018.
  104. A. Manzi, A. Moschetti, R. Limosani, L. Fiorini, and F. Cavallo, “Enhancing activity recognition of self-localized robot through depth camera and wearable sensors,” IEEE Sensors Journal, vol. 18, no. 22, pp. 9324–9331, 2018.
  105. L. Wang, B. Sun, J. Robinson, T. Jing, and Y. Fu, “Ev-action: Electromyography-vision multi-modal action dataset,” in FG.   IEEE, 2020, pp. 160–167.
  106. N. Rai, H. Chen, J. Ji, R. Desai, K. Kozuka, S. Ishizaka, E. Adeli, and J. C. Niebles, “Home action genome: Cooperative compositional action understanding,” in CVPR, 2021, pp. 11 184–11 193.
  107. K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in CVPR, 2022, pp. 18 995–19 012.
  108. D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” IJCV, pp. 1–23, 2022.
  109. M. Martínez-Zarzuela, J. González-Alonso, M. Antón-Rodríguez, F. J. Díaz-Pernas, H. Müller, and C. Simón-Martínez, “Multimodal video and imu kinematic dataset on daily life activities using affordable devices,” Scientific Data, vol. 10, no. 1, p. 648, 2023.
  110. J. Yang, M. N. Nguyen, P. P. San, X. Li, and S. Krishnaswamy, “Deep convolutional neural networks on multichannel time series for human activity recognition.” in IJCAI, vol. 15.   Buenos Aires, Argentina, 2015, pp. 3995–4001.
  111. G. Chetty and M. Yamin, “Intelligent human activity recognition scheme for ehealth applications,” Malaysian Journal of Computer Science, vol. 28, no. 1, pp. 59–69, 2015.
  112. H. Guo, L. Chen, L. Peng, and G. Chen, “Wearable sensor based multimodal human activity recognition exploiting the diversity of classifier ensemble,” in UbiComp, 2016, pp. 1112–1123.
  113. M. Guo, Z. Wang, N. Yang, Z. Li, and T. An, “A multisensor multiclassifier hierarchical fusion model based on entropy weight for human activity recognition using wearable inertial sensors,” THMS, vol. 49, no. 1, pp. 105–111, 2018.
  114. S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher, “Deepsense: A unified deep learning framework for time-series mobile sensing data processing,” in WWW, 2017, pp. 351–360.
  115. J. Sena, J. Barreto, C. Caetano, G. Cramer, and W. R. Schwartz, “Human activity recognition based on smartphone and wearable sensors using multiscale dcnn ensemble,” Neurocomputing, vol. 444, pp. 226–243, 2021.
  116. M. Ullah, H. Ullah, S. D. Khan, and F. A. Cheikh, “Stacked lstm network for human activity recognition using smartphone data,” in EUVIP.   IEEE, 2019, pp. 175–180.
  117. S. Yu and L. Qin, “Human activity recognition with smartphone inertial sensors using bidir-lstm networks,” in ICMCCE.   IEEE, 2018, pp. 219–224.
  118. I. K. Ihianle, A. O. Nwajana, S. H. Ebenuwa, R. I. Otuka, K. Owa, and M. O. Orisatoki, “A deep learning approach for human activities recognition from multimodal sensing devices,” IEEE Access, vol. 8, pp. 179 028–179 038, 2020.
  119. N. Dua, S. N. Singh, and V. B. Semwal, “Multi-input cnn-gru based human activity recognition using wearable sensors,” Computing, vol. 103, pp. 1461–1478, 2021.
  120. W. Gao, L. Zhang, Q. Teng, J. He, and H. Wu, “Danhar: Dual attention network for multimodal human activity recognition using wearable sensors,” Applied Soft Computing, vol. 111, p. 107728, 2021.
  121. Y. Tang, L. Zhang, Q. Teng, F. Min, and A. Song, “Triple cross-domain attention on human activity recognition using wearable sensors,” TETCI, vol. 6, no. 5, pp. 1167–1176, 2022.
  122. M. A. Al-qaness, A. Dahou, M. Abd Elaziz, and A. Helmi, “Multi-resatt: Multilevel residual network with attention for human activity recognition using wearable sensors,” TII, vol. 19, no. 1, pp. 144–152, 2022.
  123. C. Zhang, A. Waghmare, P. Kundra, Y. Pu, S. Gilliland, T. Ploetz, T. E. Starner, O. T. Inan, and G. D. Abowd, “Fingersound: Recognizing unistroke thumb gestures using a ring,” IMWUT, vol. 1, no. 3, pp. 1–19, 2017.
  124. E. Garcia-Ceja, C. E. Galván-Tejada, and R. Brena, “Multi-view stacking for activity recognition with sound and accelerometer data,” Information Fusion, vol. 40, pp. 45–56, 2018.
  125. N. Siddiqui and R. H. Chan, “Multimodal hand gesture recognition using single imu and acoustic measurements at wrist,” Plos one, vol. 15, no. 1, p. e0227039, 2020.
  126. V. Mollyn, K. Ahuja, D. Verma, C. Harrison, and M. Goel, “Samosa: Sensing activities with motion and subsampled audio,” IMWUT, vol. 6, no. 3, pp. 1–19, 2022.
  127. G. Lin, W. Jiang, S. Xu, X. Zhou, X. Guo, Y. Zhu, and X. He, “Human activity recognition using smartphones with wifi signals,” THMS, vol. 53, no. 1, pp. 142–153, 2022.
  128. H. Xu, L. Han, M. Li, and M. Srivastava, “Penetrative ai: Making llms comprehend the physical world,” arXiv preprint arXiv:2310.09605, 2023.
  129. Y. Huang, M. Kaufmann, E. Aksan, M. J. Black, O. Hilliges, and G. Pons-Moll, “Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time,” TOG, vol. 37, no. 6, pp. 1–15, 2018.
  130. Z. Leng, A. Bhattacharjee, H. Rajasekhar, L. Zhang, E. Bruda, H. Kwon, and T. Plötz, “Imugpt 2.0: Language-based cross modality transfer for sensor-based human activity recognition,” arXiv preprint arXiv:2402.01049, 2024.
  131. H. Kwon, C. Tong, H. Haresamudram, Y. Gao, G. D. Abowd, N. D. Lane, and T. Ploetz, “Imutube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition,” IMWUT, vol. 4, no. 3, pp. 1–29, 2020.
  132. W. Chen, S. Lin, E. Thompson, and J. Stankovic, “Sensecollect: We need efficient ways to collect on-body sensor-based human activity data!” IMWUT, vol. 5, no. 3, pp. 1–27, 2021.
  133. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR.   Ieee, 2009, pp. 248–255.
  134. W. Wang and Y. Yang, “Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models,” arXiv preprint arXiv:2403.06098, 2024.
  135. T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y. Fang, H.-Y. Lee, J. Ren, M.-H. Yang et al., “Panda-70m: Captioning 70m videos with multiple cross-modality teachers,” arXiv preprint arXiv:2402.19479, 2024.
  136. C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martín-Martín, C. Wang, G. Levine, W. Ai, B. Martinez et al., “Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation,” arXiv preprint arXiv:2403.09227, 2024.
  137. S. Singh, F. Vargus, D. Dsouza, B. F. Karlsson, A. Mahendiran, W.-Y. Ko, H. Shandilya, J. Patel, D. Mataciunas, L. OMahony et al., “Aya dataset: An open-access collection for multilingual instruction tuning,” arXiv preprint arXiv:2402.06619, 2024.
  138. S. Xia, L. Chu, L. Pei, Z. Zhang, W. Yu, and R. C. Qiu, “Learning disentangled representation for mixed-reality human activity recognition with a single imu sensor,” TIM, vol. 70, pp. 1–14, 2021.
  139. Y. Hao, X. Lou, B. Wang, and R. Zheng, “Cromosim: A deep learning-based cross-modality inertial measurement simulator,” TMC, 2022.
  140. B. Ataseven, A. Madani, B. Semiz, and M. E. Gursoy, “Physical activity recognition using deep transfer learning with convolutional neural networks,” in DASC/PiCom/CBDCom/CyberSciTech.   IEEE, 2022, pp. 1–6.
  141. M. Hashim and R. Amutha, “Deep transfer learning based human activity recognition by transforming imu data to image domain using novel activity image creation method,” Journal of Intelligent & Fuzzy Systems, vol. 43, no. 3, pp. 2883–2890, 2022.
  142. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” in Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 851–866.
  143. H. Yoon, H. Cha, C. H. Nguyen, T. Gong, and S.-J. Lee, “Img2imu: Applying knowledge from large-scale images to imu applications via contrastive learning,” arXiv preprint arXiv:2209.00945, 2022.
  144. R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, “Learning regularity in skeleton trajectories for anomaly detection in videos,” in CVPR, 2019, pp. 11 996–12 004.
  145. H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in CVPR, 2022, pp. 2969–2978.
  146. J. Ni, H. Tang, A. H. Ngu, G. Liu, and Y. Yan, “Physical-aware cross-modal adversarial network for wearable sensor-based human action recognition,” arXiv preprint arXiv:2307.03638, 2023.
  147. H. Kwon, B. Wang, G. D. Abowd, and T. Plötz, “Approaching the real-world: Supporting activity recognition training with virtual imu data,” IMWUT, vol. 5, no. 3, pp. 1–32, 2021.
  148. Y. Jain, H. Kwon, and T. Ploetz, “On the effectiveness of virtual imu data for eating detection with wrist sensors,” in UbiComp/ISWC, 2022, pp. 50–52.
  149. I. Gavier, Y. Liu, and S. I. Lee, “Virtualimu: Generating virtual wearable inertial data from video for deep learning applications,” in BSN.   IEEE, 2023, pp. 1–4.
  150. J. Li, L. Huang, S. Shah, S. J. Jones, Y. Jin, D. Wang, A. Russell, S. Choi, Y. Gao, J. Yuan et al., “Signring: Continuous american sign language recognition using imu rings and virtual imu data,” IMWUT, vol. 7, no. 3, pp. 1–29, 2023.
  151. P. S. Santhalingam, P. Pathak, H. Rangwala, and J. Kosecka, “Synthetic smartwatch imu data generation from in-the-wild asl videos,” IMWUT, vol. 7, no. 2, pp. 1–34, 2023.
  152. Z. Leng, H. Kwon, and T. Plötz, “Generating virtual on-body accelerometer data from virtual textual descriptions for human activity recognition,” arXiv preprint arXiv:2305.03187, 2023.
  153. V. Fortes Rey, K. K. Garewal, and P. Lukowicz, “Translating videos into synthetic training data for wearable sensor-based activity recognition systems using residual deep convolutional networks,” Applied Sciences, vol. 11, no. 7, p. 3094, 2021.
  154. C. Xia and Y. Sugiura, “Virtual imu data augmentation by spring-joint model for motion exercises recognition without using real data,” in ISWC, 2022, pp. 79–83.
  155. C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He et al., “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,” arXiv preprint arXiv:2302.09419, 2023.
  156. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  157. S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” arXiv preprint arXiv:2309.05519, 2023.
  158. J. Li, C. Liu, S. Cheng, R. Arcucci, and S. Hong, “Frozen language model helps ecg zero-shot learning,” in Medical Imaging with Deep Learning.   PMLR, 2024, pp. 402–415.
  159. X. Liu, D. McDuff, G. Kovacs, I. Galatzer-Levy, J. Sunshine, J. Zhan, M.-Z. Poh, S. Liao, P. Di Achille, and S. Patel, “Large language models are few-shot health learners,” arXiv preprint arXiv:2305.15525, 2023.
  160. Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
  161. Z. Zhang, H. Amiri, Z. Liu, A. Züfle, and L. Zhao, “Large language models for spatial trajectory patterns mining,” arXiv preprint arXiv:2310.04942, 2023.
  162. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  163. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” vol. 27, 2014.
  164. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” vol. 33, pp. 6840–6851, 2020.
  165. G. Aggarwal and D. Parikh, “Dance2music: Automatic dance-driven music generation,” arXiv preprint arXiv:2107.06252, 2021.
  166. J. Wang, Y. Chen, Y. Gu, Y. Xiao, and H. Pan, “Sensorygans: An effective generative adversarial framework for sensor-based human activity recognition,” in IJCNN.   IEEE, 2018, pp. 1–8.
  167. M. R. Siyal, M. Ebrahim, S. H. Adil, and K. Raza, “Human action recognition using convlstm with gan and transfer learning,” in ICCI.   IEEE, 2020, pp. 311–316.
  168. X. Li, J. Luo, and R. Younes, “Activitygan: Generative adversarial networks for data augmentation in sensor-based human activity recognition,” in UbiComp/ISWC, 2020, pp. 249–254.
  169. J. Wang, Y. Chen, and Y. Gu, “A wearable-har oriented sensory data generation method based on spatio-temporal reinforced conditional gans,” Neurocomputing, vol. 493, pp. 548–567, 2022.
  170. X. Li, V. Metsis, H. Wang, and A. H. H. Ngu, “Tts-gan: A transformer-based time-series generative adversarial network,” in AIME.   Springer, 2022, pp. 133–143.
  171. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
  172. S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in ICCV, 2019, pp. 6023–6032.
  173. C. I. Tang, I. Perez-Pozuelo, D. Spathis, S. Brage, N. Wareham, and C. Mascolo, “Selfhar: Improving human activity recognition through self-training with unlabeled data,” IMWUT, vol. 5, no. 1, pp. 1–30, 2021.
  174. S. Zhang and N. Alshurafa, “Deep generative cross-modal on-body accelerometer data synthesis from videos,” in UbiComp/ISWC, 2020, pp. 223–227.
  175. G. Iglesias, E. Talavera, Á. González-Prieto, A. Mozo, and S. Gómez-Canaval, “Data augmentation techniques in time series domain: a survey and taxonomy,” arXiv preprint arXiv:2206.13508, 2022.
  176. E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation policies from data,” arXiv preprint arXiv:1805.09501, 2018.
  177. A. Renduchintala, S. Ding, M. Wiesner, and S. Watanabe, “Multi-modal data augmentation for end-to-end asr,” arXiv preprint arXiv:1803.10299, 2018.
  178. A. Falcon, G. Serra, and O. Lanz, “A feature-space multimodal data augmentation technique for text-video retrieval,” in ACM MM, 2022, pp. 4385–4394.
  179. D. Oneata and H. Cucu, “Improving multimodal speech recognition by data augmentation and speech representations,” arXiv preprint arXiv:2204.13206, 2022.
  180. N. Xu, W. Mao, P. Wei, and D. Zeng, “Mda: Multimodal data augmentation framework for boosting performance on sentiment/emotion classification tasks,” IEEE Intelligent Systems, vol. 36, no. 6, pp. 3–12, 2020.
  181. Z. Liu, Z. Tang, X. Shi, A. Zhang, M. Li, A. Shrivastava, and A. G. Wilson, “Learning multimodal data augmentation in feature space,” arXiv preprint arXiv:2212.14453, 2022.
  182. J. Hua, X. Cui, X. Li, K. Tang, and P. Zhu, “Multimodal fake news detection through data augmentation-based contrastive learning,” Applied Soft Computing, vol. 136, p. 110125, 2023.
  183. A. Josi, M. Alehdaghi, R. M. Cruz, and E. Granger, “Multimodal data augmentation for visual-infrared person reid with corrupted data,” in WACV, 2023, pp. 32–41.
  184. D. Wang and S. Karout, “Fine-grained multi-modal self-supervised learning,” arXiv preprint arXiv:2112.12182, 2021.
  185. Y. Asano, M. Patrick, C. Rupprecht, and A. Vedaldi, “Labelling unlabelled videos from scratch with multi-modal self-supervision,” vol. 33, pp. 4660–4671, 2020.
  186. A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in CVPR, 2020, pp. 9879–9889.
  187. J.-B. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Ramapuram, J. De Fauw, L. Smaira, S. Dieleman, and A. Zisserman, “Self-supervised multimodal versatile networks,” vol. 33, pp. 25–37, 2020.
  188. Y. Li, H. Liu, and H. Tang, “Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking,” in AAAI, vol. 36, no. 2, 2022, pp. 1456–1463.
  189. A. Saeed, T. Ozcelebi, and J. Lukkien, “Multi-task self-supervised learning for human activity detection,” IMWUT, vol. 3, no. 2, pp. 1–30, 2019.
  190. H. Yuan, S. Chan, A. P. Creagh, C. Tong, D. A. Clifton, and A. Doherty, “Self-supervised learning for human activity recognition using 700,000 person-days of wearable data,” arXiv preprint arXiv:2206.02909, 2022.
  191. Y. Jain, C. I. Tang, C. Min, F. Kawsar, and A. Mathur, “Collossl: Collaborative self-supervised learning for human activity recognition,” IMWUT, vol. 6, no. 1, pp. 1–28, 2022.
  192. H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong, “Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text,” vol. 34, pp. 24 206–24 221, 2021.
  193. H. Choi, A. Beedu, and I. Essa, “Multimodal contrastive learning with hard negative sampling for human activity recognition,” arXiv preprint arXiv:2309.01262, 2023.
  194. R. Brinzea, B. Khaertdinov, and S. Asteriadis, “Contrastive learning with cross-modal knowledge mining for multimodal human activity recognition,” in IJCNN.   IEEE, 2022, pp. 01–08.
  195. S. Deldari, D. Spathis, M. Malekzadeh, F. Kawsar, F. Salim, and A. Mathur, “Latent masking for multimodal self-supervised learning in health timeseries,” arXiv preprint arXiv:2307.16847, 2023.
  196. S. Moon, A. Madotto, Z. Lin, A. Dirafzoon, A. Saraf, A. Bearman, and B. Damavandi, “Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text,” arXiv preprint arXiv:2210.14395, 2022.
  197. K. Xia, W. Li, S. Gan, and S. Lu, “Ts2act: Few-shot human activity sensing with cross-modal co-learning,” IMWUT, vol. 7, no. 4, pp. 1–22, 2024.
  198. R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in CVPR, 2023, pp. 15 180–15 190.
  199. W. C. Sleeman IV, R. Kapoor, and P. Ghosh, “Multimodal classification: Current landscape, taxonomy and future directions,” CSUR, vol. 55, no. 7, pp. 1–31, 2022.
  200. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” JAIR, vol. 16, pp. 321–357, 2002.
  201. M. Junaid, S. Ali, F. Eid, S. El-Sappagh, and T. Abuhmed, “Explainable machine learning models based on multimodal time-series data for the early detection of parkinson’s disease,” Computer Methods and Programs in Biomedicine, vol. 234, p. 107495, 2023.
  202. G. Douzas and F. Bacao, “Effective data generation for imbalanced learning using conditional generative adversarial networks,” Expert Systems with applications, vol. 91, pp. 464–471, 2018.
  203. H. K. Lee, J. Lee, and S. B. Kim, “Boundary-focused generative adversarial networks for imbalanced and multimodal time series,” TKDE, vol. 34, no. 9, pp. 4102–4118, 2022.
  204. Q. Li, G. Yu, J. Wang, and Y. Liu, “A deep multimodal generative and fusion framework for class-imbalanced multimodal data,” Multimedia Tools and Applications, vol. 79, pp. 25 023–25 050, 2020.
  205. Y. Guo, Y. Chu, B. Jiao, J. Cheng, Z. Yu, N. Cui, and L. Ma, “Evolutionary dual-ensemble class imbalance learning for human activity recognition,” TETCI, vol. 6, no. 4, pp. 728–739, 2021.
  206. N. C. Garcia, S. A. Bargal, V. Ablavsky, P. Morerio, V. Murino, and S. Sclaroff, “Distillation multiple choice learning for multimodal action recognition,” in WACV, 2021, pp. 2755–2764.
  207. S. G. Dhekane, H. Haresamudram, M. Thukral, and T. Plötz, “How much unlabeled data is really needed for effective self-supervised human activity recognition?” in ISWC, 2023, pp. 66–70.
  208. Y. Wei, Y. Zhu, C. Leung, Y. Song, and Q. Yang, “Instilling social to physical: Co-regularized heterogeneous transfer learning,” in AAAI, vol. 30, no. 1, 2016.
  209. J. Wang, V. W. Zheng, Y. Chen, and M. Huang, “Deep transfer learning for cross-domain activity recognition,” in ICCSE, 2018, pp. 1–8.
  210. J. Wang, Y. Chen, L. Hu, X. Peng, and S. Y. Philip, “Stratified transfer learning for cross-domain activity recognition,” in PerCom.   IEEE, 2018, pp. 1–10.
  211. W. Lu, Y. Chen, J. Wang, and X. Qin, “Cross-domain activity recognition via substructural optimal transport,” Neurocomputing, vol. 454, pp. 65–75, 2021.
  212. X. Yuan, Z. Lin, J. Kuen, J. Zhang, Y. Wang, M. Maire, A. Kale, and B. Faieta, “Multimodal contrastive training for visual representation learning,” in CVPR, 2021, pp. 6995–7004.
  213. Y.-L. Sung, J. Cho, and M. Bansal, “Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks,” in CVPR, 2022, pp. 5227–5237.
  214. B. Khaertdinov and S. Asteriadis, “Temporal feature alignment in contrastive self-supervised learning for human activity recognition,” in IJCB.   IEEE, 2022, pp. 1–9.
  215. X. Geng, H. Liu, L. Lee, D. Schuurmans, S. Levine, and P. Abbeel, “Multimodal masked autoencoders learn transferable representations,” arXiv preprint arXiv:2205.14204, 2022.
  216. W. Lu, J. Wang, Y. Chen, S. J. Pan, C. Hu, and X. Qin, “Semantic-discriminative mixup for generalizable sensor-based cross-domain activity recognition,” IMWUT, vol. 6, no. 2, pp. 1–19, 2022.
  217. M. Thukral, H. Haresamudram, and T. Ploetz, “Cross-domain har: Few shot transfer learning for human activity recognition,” arXiv preprint arXiv:2310.14390, 2023.
  218. R. Bao, Y. Sun, Y. Gao, J. Wang, Q. Yang, H. Chen, Z.-H. Mao, X. Xie, and Y. Ye, “A survey on heterogeneous transfer learning,” arXiv preprint arXiv:2310.08459, 2023.
  219. X. Qin, Y. Chen, J. Wang, and C. Yu, “Cross-dataset activity recognition via adaptive spatial-temporal transfer learning,” IMWUT, vol. 3, no. 4, pp. 1–25, 2019.
  220. J. Hoffman, S. Gupta, and T. Darrell, “Learning with side information through modality hallucination,” in CVPR, 2016, pp. 826–834.
  221. N. C. Garcia, P. Morerio, and V. Murino, “Modality distillation with multiple stream networks for action recognition,” in ECCV, 2018, pp. 103–118.
  222. A. Andonian, S. Chen, and R. Hamid, “Robust cross-modal representation learning with progressive self-distillation,” in CVPR, 2022, pp. 16 430–16 441.
  223. Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” vol. 29, pp. 892–900, 2016.
  224. F. M. Thoker and J. Gall, “Cross-modal knowledge distillation for action recognition,” in ICIP, 2019, pp. 6–10.
  225. Z. Quan, Q. Chen, W. Wang, M. Zhang, X. Li, Y. Li, and Z. Liu, “Smtdkd: A semantic-aware multimodal transformer fusion decoupled knowledge distillation method for action recognition,” IEEE Sensors Journal, 2023.
  226. X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,” arXiv preprint arXiv:2308.07633, 2023.
  227. J. K. Eshraghian, M. Ward, E. O. Neftci, X. Wang, G. Lenz, G. Dwivedi, M. Bennamoun, D. S. Jeong, and W. D. Lu, “Training spiking neural networks using lessons from deep learning,” Proceedings of the IEEE, 2023.
  228. T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” Journal of Machine Learning Research, vol. 20, no. 55, pp. 1–21, 2019.
  229. T. Mauldin, A. H. Ngu, V. Metsis, and M. E. Canby, “Ensemble deep learning on wearables using small datasets,” HEALTH, vol. 2, no. 1, pp. 1–30, 2020.
  230. S. Nooruddin, M. M. Islam, F. Karray, and G. Muhammad, “A multi-resolution fusion approach for human activity recognition from video data in tiny edge devices,” Information Fusion, vol. 100, p. 101953, 2023.
  231. Y. Zhou, H. Zhao, Y. Huang, T. Riedel, M. Hefenbrock, and M. Beigl, “Tinyhar: A lightweight deep learning model designed for human activity recognition,” in UbiComp, 2022, pp. 89–93.
  232. Z. Gao, Y. Wang, J. Chen, J. Xing, S. Patel, X. Liu, and Y. Shi, “Mmtsa: Multi-modal temporal segment attention network for efficient human activity recognition,” IMWUT, vol. 7, no. 3, pp. 1–26, 2023.
  233. Q. Cai, X. Liu, K. Zhang, X. Xie, X. Tong, and K. Li, “Acf: An adaptive compression framework for multimodal network in embedded devices,” TMC, 2023.
  234. M. Bock, A. Hölzemann, M. Moeller, and K. Van Laerhoven, “Improving deep learning for har with shallow lstms,” in UbiComp, 2021, pp. 7–12.
  235. C. Luo, X. He, J. Zhan, L. Wang, W. Gao, and J. Dai, “Comparison and benchmarking of ai models and frameworks on mobile devices,” arXiv preprint arXiv:2005.05085, 2020.
  236. S. Ö. Bursa, Ö. Durmaz İncel, and G. Işıklar Alptekin, “Building lightweight deep learning models with tensorflow lite for human activity recognition on mobile devices,” Annals of Telecommunications, vol. 78, no. 11, pp. 687–702, 2023.
  237. V. Mazzia, S. Angarano, F. Salvetti, F. Angelini, and M. Chiaberge, “Action transformer: A self-attention model for short-time pose-based human action recognition,” Pattern Recognition, vol. 124, p. 108487, 2022.
  238. M.-S. Kang, D. Kang, and H. Kim, “Efficient skeleton-based action recognition via joint-mapping strategies,” in WACV, 2023, pp. 3403–3412.
  239. Y.-T. Hsieh, K. Anjum, and D. Pompili, “Ultra-low power analog recurrent neural network design approximation for wireless health monitoring,” in MASS.   IEEE, 2022, pp. 211–219.
  240. N. Sengupta, C. B. McNabb, N. Kasabov, and B. R. Russell, “Integrating space, time, and orientation in spiking neural networks: a case study on multimodal brain data modeling,” TNNLS, vol. 29, no. 11, pp. 5249–5263, 2018.
  241. Q. Liu, D. Xing, L. Feng, H. Tang, and G. Pan, “Event-based multimodal spiking neural network with attention mechanism,” in ICASSP.   IEEE, 2022, pp. 8922–8926.
  242. X. Wang, Z. Wu, Y. Rong, L. Zhu, B. Jiang, J. Tang, and Y. Tian, “Sstformer: bridging spiking neural network and memory support transformer for frame-event based recognition,” arXiv preprint arXiv:2308.04369, 2023.
  243. Y. Wang, B. Dong, Y. Zhang, Y. Zhou, H. Mei, Z. Wei, and X. Yang, “Event-enhanced multi-modal spiking neural network for dynamic obstacle avoidance,” in ACM MM, 2023, pp. 3138–3148.
  244. L. Guo, Z. Gao, J. Qu, S. Zheng, R. Jiang, Y. Lu, and H. Qiao, “Transformer-based spiking neural networks for multimodal audio-visual classification,” TCDS, 2023.
  245. V. Fra, E. Forno, R. Pignari, T. C. Stewart, E. Macii, and G. Urgese, “Human activity recognition: suitability of a neuromorphic approach for on-edge aiot applications,” Neuromorphic Computing and Engineering, vol. 2, no. 1, p. 014006, 2022.
  246. A. R. Khan, H. U. Manzoor, F. Ayaz, M. A. Imran, and A. Zoha, “A privacy and energy-aware federated framework for human activity recognition,” Sensors, vol. 23, no. 23, p. 9339, 2023.
  247. Y. Li, X. Sun, Z. Yang, and H. Huang, “Snnauth: Sensor-based continuous authentication on smartphones using spiking neural networks,” IoT-J, 2024.
  248. P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, and X. Wang, “A comprehensive survey of neural architecture search: Challenges and solutions,” CSUR, vol. 54, no. 4, pp. 1–34, 2021.
  249. Y.-T. Hsieh, K. Anjum, S. Huang, I. Kulkarni, and D. Pompili, “Neural network design via voltage-based resistive processing unit and diode activation function-a new architecture,” in MWSCAS.   IEEE, 2021, pp. 59–62.
  250. J.-M. Pérez-Rúa, V. Vielzeuf, S. Pateux, M. Baccouche, and F. Jurie, “Mfas: Multimodal fusion architecture search,” in CVPR, 2019, pp. 6966–6975.
  251. Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian, “Deep multimodal neural architecture search,” in ACM MM, 2020, pp. 3743–3752.
  252. Z. Xu, D. R. So, and A. M. Dai, “Mufasa: Multimodal fusion architecture search for electronic health records,” in AAAI, vol. 35, no. 12, 2021, pp. 10 532–10 540.
  253. X. Shi, P. Zhou, W. Chen, and L. Xie, “Efficient gradient-based neural architecture search for end-to-end asr,” in ICMI, 2021, pp. 91–96.
  254. D. Si, Q. Ye, J. Lv, Y. Zhou, and J. Lv, “Violence-mfas: Audio-visual violence detection using multimodal fusion architecture search,” in ICONIP.   Springer, 2023, pp. 205–216.
  255. X. Wang, X. Wang, T. Lv, L. Jin, and M. He, “Harnas: human activity recognition based on automatic neural architecture search using evolutionary algorithms,” Sensors, vol. 21, no. 20, p. 6927, 2021.
  256. W.-S. Lim, W. Seo, D.-W. Kim, and J. Lee, “Efficient human activity recognition using lookup table-based neural architecture search for mobile devices,” IEEE Access, 2023.
  257. X. Wang, J.-F. Hu, J.-H. Lai, J. Zhang, and W.-S. Zheng, “Progressive teacher-student learning for early action prediction,” in CVPR, 2019, pp. 3556–3565.
  258. N. Zheng, X. Song, T. Su, W. Liu, Y. Yan, and L. Nie, “Egocentric early action prediction via adversarial knowledge distillation,” TOMM, vol. 19, no. 2, pp. 1–21, 2023.
  259. J. Su, C. Jiang, X. Jin, Y. Qiao, T. Xiao, H. Ma, R. Wei, Z. Jing, J. Xu, and J. Lin, “Large language models for forecasting and anomaly detection: A systematic literature review,” arXiv preprint arXiv:2402.10350, 2024.
  260. D. Zhang, Y. Yu, C. Li, J. Dong, D. Su, C. Chu, and D. Yu, “Mm-llms: Recent advances in multimodal large language models,” arXiv preprint arXiv:2401.13601, 2024.
  261. H. Gammulle, D. Ahmedt-Aristizabal, S. Denman, L. Tychsen-Smith, L. Petersson, and C. Fookes, “Continuous human action recognition for human-machine interaction: a review,” CSUR, vol. 55, no. 13s, pp. 1–38, 2023.
  262. L. Zhu, F. Wei, and Y. Lu, “Beyond text: Frozen large language models in visual signal comprehension,” arXiv preprint arXiv:2403.07874, 2024.
  263. Z. Pan, Y. Jiang, S. Garg, A. Schneider, Y. Nevmyvaka, and D. Song, “S𝑆Sitalic_S22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTip-llm: Semantic space informed prompt learning with llm for time series forecasting,” arXiv preprint arXiv:2403.05798, 2024.
  264. A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  265. J. Ma, F. Li, and B. Wang, “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” arXiv preprint arXiv:2401.04722, 2024.
  266. D. Liang, X. Zhou, X. Wang, X. Zhu, W. Xu, Z. Zou, X. Ye, and X. Bai, “Pointmamba: A simple state space model for point cloud analysis,” arXiv preprint arXiv:2402.10739, 2024.
  267. K. Li, X. Li, Y. Wang, Y. He, Y. Wang, L. Wang, and Y. Qiao, “Videomamba: State space model for efficient video understanding,” arXiv preprint arXiv:2403.06977, 2024.
  268. B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV et al., “Rwkv: Reinventing rnns for the transformer era,” arXiv preprint arXiv:2305.13048, 2023.
  269. Y. Duan, W. Wang, Z. Chen, X. Zhu, L. Lu, T. Lu, Y. Qiao, H. Li, J. Dai, and W. Wang, “Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures,” arXiv preprint arXiv:2403.02308, 2024.
  270. K. An and S. Zhang, “Exploring rwkv for memory efficient and low latency streaming asr,” arXiv preprint arXiv:2309.14758, 2023.
  271. H. Hou and F. R. Yu, “Rwkv-ts: Beyond traditional recurrent neural network for time series tasks,” arXiv preprint arXiv:2401.09093, 2024.
  272. J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi, “Unified-io: A unified model for vision, language, and multi-modal tasks,” arXiv preprint arXiv:2206.08916, 2022.
  273. J. Zhan, J. Dai, J. Ye, Y. Zhou, D. Zhang, Z. Liu, X. Zhang, R. Yuan, G. Zhang, L. Li et al., “Anygpt: Unified multimodal llm with discrete sequence modeling,” arXiv preprint arXiv:2402.12226, 2024.
  274. D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, X. Chang, J. Shi, S. Zhao, J. Bian, X. Wu et al., “Uniaudio: An audio foundation model toward universal audio generation,” arXiv preprint arXiv:2310.00704, 2023.
  275. A. Jain, P. Katara, N. Gkanatsios, A. W. Harley, G. Sarch, K. Aggarwal, V. Chaudhary, and K. Fragkiadaki, “Odin: A single model for 2d and 3d perception,” arXiv preprint arXiv:2401.02416, 2024.
  276. Y. Yuan, J. Ding, J. Feng, D. Jin, and Y. Li, “Unist: A prompt-empowered universal model for urban spatio-temporal prediction,” arXiv preprint arXiv:2402.11838, 2024.
  277. S. Gao, T. Koker, O. Queen, T. Hartvigsen, T. Tsiligkaridis, and M. Zitnik, “Units: Building a unified time series model,” arXiv preprint arXiv:2403.00131, 2024.
  278. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  279. S. Mo and Y. Tian, “Av-sam: Segment anything model meets audio-visual localization and segmentation,” arXiv preprint arXiv:2305.01836, 2023.
  280. Z. Xiao, X. Xu, H. Xing, F. Song, X. Wang, and B. Zhao, “A federated learning system with enhanced feature extraction for human activity recognition,” Knowledge-Based Systems, vol. 229, p. 107338, 2021.
  281. L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic style,” arXiv preprint arXiv:1508.06576, 2015.
Citations (3)

Summary

We haven't generated a summary for this paper yet.