Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection (2405.08204v1)

Published 13 May 2024 in cs.CV

Abstract: This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words, the proposed sequence-based temporal attention is designed to determine both the differences and similarities between video frames that jointly define the meaning of actions. The proposed approach outperforms the state-of-the-art solutions on four spatiotemporal action datasets: AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Spatiotemporal localization and categorization of human actions in unsegmented image sequences. IEEE transactions on Image Processing, 20(4):1126–1140, 2010.
  2. Du Tran and Junsong Yuan. Max-margin structured output regression for spatio-temporal action localization. Advances in neural information processing systems, 25, 2012.
  3. Learning to track for spatio-temporal action localization. In Proceedings of the IEEE international conference on computer vision, pages 3164–3172, 2015.
  4. Video action transformer network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 244–253, 2019.
  5. Multitask non-autoregressive model for human motion prediction. IEEE Transactions on Image Processing, 30:2562–2574, 2020.
  6. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  7. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Ddgcn: A dynamic directed graph convolutional network for action recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 761–776. Springer, 2020.
  10. Semantics-enhanced early action detection using dynamic dilated convolution. Pattern Recognition, 140:109595, 2023.
  11. Action recognition from video using feature covariance matrices. IEEE Transactions on Image Processing, 22(6):2479–2494, 2013.
  12. Semantic pyramids for gender and action recognition. IEEE transactions on image processing, 23(8):3633–3645, 2014.
  13. Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 4405–4413, 2017.
  14. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE transactions on pattern analysis and machine intelligence, 2020.
  15. Actor-context-actor relation network for spatio-temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 464–474, 2021.
  16. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  17. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
  18. Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Transactions on Multimedia, 20(3):634–644, 2017.
  19. Progressive cross-stream cooperation in spatial and temporal domain for action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12):4477–4490, 2020.
  20. Interaction-aware spatio-temporal pyramid attention networks for action classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7010–7028, 2022.
  21. Online real-time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 3637–3646, 2017.
  22. Step: Spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 264–272, 2019.
  23. Spatio-temporal action detection under large motion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6009–6018, 2023.
  24. Fastflownet: A lightweight network for fast optical flow estimation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 10310–10316. IEEE, 2021.
  25. A multi-modal transformer network for action detection. Pattern Recognition, 142:109713, 2023.
  26. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 31:5427–5441, 2022.
  27. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
  28. Transformer in transformer. Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
  29. Taa-gcn: A temporally aware adaptive graph convolutional network for age estimation. Pattern Recognition, 134:109066, 2023.
  30. Empirical evaluation of gated recurrent neural network architectures in aviation delay prediction. In 2020 5th International Conference on Computing, Communication and Security (ICCCS), pages 1–7. IEEE, 2020.
  31. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  32. Ammu: a survey of transformer-based biomedical pretrained language models. Journal of biomedical informatics, page 103982, 2021.
  33. Probabilistic attention for interactive segmentation. Advances in Neural Information Processing Systems, 34:4448–4460, 2021.
  34. Improving transformers with probabilistic attention keys. In International Conference on Machine Learning, pages 16595–16621. PMLR, 2022.
  35. Non-parametric expectation maximization: a learning automata approach. In SMC’03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme-System Security and Assurance (Cat. No. 03CH37483), volume 3, pages 2996–3001. IEEE, 2003.
  36. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4040–4048, 2016.
  37. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018.
  38. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  39. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2018.
  40. Asynchronous interaction aggregation for action detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 71–87. Springer, 2020.
  41. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
  42. Object level visual reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 105–121, 2018.
  43. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022.
  44. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
  45. Object-region video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3148–3159, 2022.
  46. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  47. A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3299–3309, 2021.
  48. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
  49. Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.
  50. Watch only once: An end-to-end video action detection framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8178–8187, 2021.
  51. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1884–1894, 2021.
  52. Tuber: Tubelet transformer for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13598–13607, 2022.
  53. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  54. Holistic interaction transformer network for action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3340–3350, 2023.
  55. Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV), pages 318–334, 2018.
  56. D3d: Distilled 3d networks for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 625–634, 2020.
  57. Pose and joint-aware action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3850–3860, 2022.
  58. Tacnet: Transition-aware context network for spatio-temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11987–11995, 2019.
  59. Hierarchical self-attention network for action localization in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 61–70, 2019.
  60. A structured model for action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9975–9984, 2019.
  61. Actions as moving points. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 68–84. Springer, 2020.
  62. Acdnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation. Pattern Recognition Letters, 145:118–126, 2021.
  63. End-to-end semi-supervised learning for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14700–14710, 2022.
  64. Action recognition based on attention mechanism and depthwise separable residual module. Signal, Image and Video Processing, pages 1–9, 2022.
  65. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34:14200–14213, 2021.
  66. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  67. Keeping your eye on the ball: Trajectory attention in video transformers. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  68. Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16020–16030, 2021.
  69. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3333–3343, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Matthew Korban (4 papers)
  2. Peter Youngs (2 papers)
  3. Scott T. Acton (18 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com