Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition (2210.09222v2)

Published 14 Oct 2022 in cs.CV and cs.LG

Abstract: Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements 11.13% of cross-subject F1-score on the MMAct dataset than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Deep learning approach for human action recognition in infrared images. Cognitive Systems Research 50 (2018), 146–154.
  2. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.
  3. IoT wearable sensor and deep learning: An integrated approach for personalized human activity recognition in a smart home environment. IEEE Internet of Things Journal 6, 5 (2019), 8553–8562.
  4. João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 4724–4733. https://doi.org/10.1109/CVPR.2017.502
  5. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Computing Surveys (CSUR) 54, 4 (2021), 1–40.
  6. Vahid Ashkani Chenarlogh and Farbod Razzazi. 2019. Multi-stream 3D CNN structure for human action recognition trained by limited data. IET Computer Vision 13, 3 (2019), 338–344.
  7. Two-stream video classification with cross-modality attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0.
  8. Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition. arXiv preprint arXiv:2211.04331 (2022).
  9. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  10. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202–6211.
  11. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1933–1941.
  12. A similarity analysis of audio signal to develop a human activity recognition using similarity networks. Sensors 17, 11 (2017), 2688.
  13. Andrey Ignatov. 2018. Real-time human activity recognition from accelerometer data using Convolutional Neural Networks. Applied Soft Computing 62 (2018), 915–922.
  14. Md Mofijul Islam and Tariq Iqbal. 2020. Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10285–10292.
  15. Md Mofijul Islam and Tariq Iqbal. 2021. Multi-gat: A graphical attention-based hierarchical multimodal representation learning approach for human activity recognition. IEEE Robotics and Automation Letters 6, 2 (2021), 1729–1736.
  16. Md Mofijul Islam and Tariq Iqbal. 2022. MuMu: Cooperative multitask learning-based guided multimodal fusion,”. AAAI.
  17. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13289–13299.
  18. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725–1732.
  19. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492–5501.
  20. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).
  21. MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding. In The IEEE International Conference on Computer Vision (ICCV).
  22. Activity recognition using cell phone accelerometers. ACM SigKDD Explorations Newsletter 12, 2 (2011), 74–82.
  23. Temporal convolutional networks: A unified approach to action segmentation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer, 47–54.
  24. DualRing: Enabling Subtle and Expressive Hand Interaction with Dual IMU Rings. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 3, Article 115 (sep 2021), 27 pages. https://doi.org/10.1145/3478114
  25. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083–7093.
  26. Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing 30 (2021), 5573–5588.
  27. Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  28. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
  29. Jianjie Lu and Kai-Yu Tong. 2019. Robust single accelerometer-based activity recognition using modified recurrence plot. IEEE Sensors Journal 19, 15 (2019), 6317–6324.
  30. Subhas Chandra Mukhopadhyay. 2014. Wearable sensors for human activity monitoring: A review. IEEE sensors journal 15, 3 (2014), 1321–1330.
  31. Abdulmajid Murad and Jae-Young Pyun. 2017. Deep recurrent neural networks for human activity recognition. Sensors 17, 11 (2017), 2556.
  32. Berkeley mhad: A comprehensive multimodal human action database. In 2013 IEEE workshop on applications of computer vision (WACV). IEEE, 53–60.
  33. CNN based approach for activity recognition using a wrist-worn accelerometer. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2438–2441.
  34. On the difficulty of training recurrent neural networks. In International conference on machine learning. PMLR, 1310–1318.
  35. Egocentric activity recognition on a budget. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5967–5976.
  36. Yoli Shavit and Itzik Klein. 2021. Boosting inertial-based human activity recognition with transformers. IEEE Access 9 (2021), 53540–53547.
  37. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27 (2014).
  38. Survey on human activity recognition based on acceleration data. International Journal of Advanced Computer Science and Applications 10, 3 (2019).
  39. Activity recognition in egocentric life-logging videos. In Asian conference on computer vision. Springer, 445–458.
  40. Multimodal multi-stream deep learning for egocentric activity recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 24–31.
  41. Egocentric activity recognition with multimodal fisher vector. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2717–2721.
  42. Odongo Steven Eyobu and Dong Seog Han. 2018. Feature representation and data augmentation for human activity classification based on wearable IMU sensor data using a deep LSTM neural network. Sensors 18, 9 (2018), 2892.
  43. Float: One-Handed and Touch-Free Target Selection on Smartwatches. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 692–704. https://doi.org/10.1145/3025453.3026027
  44. Lattice long short-term memory for human action recognition. In Proceedings of the IEEE international conference on computer vision. 2147–2156.
  45. A Hybrid Deep Model Using Deep Learning and Dense Optical Flow Approaches for Human Activity Recognition. IEEE Access 8 (2020), 19799–19809. https://doi.org/10.1109/ACCESS.2020.2968529
  46. Zero-shot learning for imu-based activity recognition using video embeddings. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1–23.
  47. Learning Spatiotemporal Features With 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  48. Attention is all you need. Advances in neural information processing systems 30 (2017).
  49. Deep learning for sensor-based activity recognition: A survey. Pattern recognition letters 119 (2019), 3–11.
  50. Human action recognition on cellphone using compositional bidir-lstm-cnn networks. In 2019 International Conference on Computer, Network, Communication and Information Systems (CNCI 2019). Atlantis Press, 687–692.
  51. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
  52. FaceOri: Tracking Head Position and Orientation Using Ultrasonic Ranging on Earphones. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 290, 12 pages. https://doi.org/10.1145/3491102.3517698
  53. Zhiguang Wang and Tim Oates. 2015. Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. In Workshops at the twenty-ninth AAAI conference on artificial intelligence.
  54. Fusion of video and inertial sensing for deep learning–based human action recognition. Sensors 19, 17 (2019), 3680.
  55. Fastformer: Additive attention can be all you need. arXiv preprint arXiv:2108.09084 (2021).
  56. A comparison of 1-D and 2-D deep convolutional neural networks in ECG classification. arXiv preprint arXiv:1810.07088 (2018).
  57. LIMU-BERT: Unleashing the Potential of Unlabeled Data for IMU Sensing Applications. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 220–233.
  58. Recurring the transformer for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14063–14073.
  59. Convolutional neural networks for human activity recognition using mobile sensors. In 6th international conference on mobile computing, applications and services. IEEE, 197–205.
  60. EarCough: Enabling Continuous Subject Cough Event Detection on Hearables. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI EA ’23). Association for Computing Machinery, New York, NY, USA, Article 94, 6 pages. https://doi.org/10.1145/3544549.3585903
Citations (3)

Summary

We haven't generated a summary for this paper yet.