Papers
Topics
Authors
Recent
2000 character limit reached

Multimodal Action Quality Assessment (2402.09444v3)

Published 31 Jan 2024 in eess.SP, cs.AI, and cs.CV

Abstract: Action quality assessment (AQA) is to assess how well an action is performed. Previous works perform modelling by only the use of visual information, ignoring audio information. We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy, especially for sports with background music, such as figure skating and rhythmic gymnastics. To leverage multimodal information for AQA, i.e., RGB, optical flow and audio information, we propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information. Our model consists of with three modality-specific branches that independently explore modality-specific information and a mixed-modality branch that progressively aggregates the modality-specific information from the modality-specific branches. To build the bridge between modality-specific branches and the mixed-modality branch, three novel modules are proposed. First, a Modality-specific Feature Decoder module is designed to selectively transfer modality-specific information to the mixed-modality branch. Second, when exploring the interaction between modality-specific information, we argue that using an invariant multimodal fusion policy may lead to suboptimal results, so as to take the potential diversity in different parts of an action into consideration. Therefore, an Adaptive Fusion Module is proposed to learn adaptive multimodal fusion policies in different parts of an action. This module consists of several FusionNets for exploring different multimodal fusion strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a module called Cross-modal Feature Decoder is designed to transfer cross-modal features generated by Adaptive Fusion Module to the mixed-modality branch.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. L.-A. Zeng, F.-T. Hong, W.-S. Zheng, Q.-Z. Yu, W. Zeng, Y.-W. Wang, and J.-H. Lai, “Hybrid dynamic-static context-aware attention network for action assessment in long videos,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2526–2534.
  2. H. Pirsiavash, C. Vondrick, and A. Torralba, “Assessing the quality of actions,” in European Conference on Computer Vision.   Springer, 2014, pp. 556–571.
  3. A. Xu, L.-A. Zeng, and W.-S. Zheng, “Likert scoring with grade decoupling for long-term action assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3232–3241.
  4. P. Parmar and B. Tran Morris, “Learning to score olympic events,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 20–28.
  5. C. Xu, Y. Fu, B. Zhang, Z. Chen, Y.-G. Jiang, and X. Xue, “Learning to score figure skating sport videos,” IEEE transactions on circuits and systems for video technology, vol. 30, no. 12, pp. 4578–4590, 2019.
  6. Y. Tang, Z. Ni, J. Zhou, D. Zhang, J. Lu, Y. Wu, and J. Zhou, “Uncertainty-aware score distribution learning for action quality assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9839–9848.
  7. J.-H. Pan, J. Gao, and W.-S. Zheng, “Action assessment by joint relation graphs,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6331–6340.
  8. J. Gao, W.-S. Zheng, J.-H. Pan, C. Gao, Y. Wang, W. Zeng, and J. Lai, “An asymmetric modeling for action assessment,” in European Conference on Computer Vision.   Springer, 2020, pp. 222–238.
  9. X. Yu, Y. Rao, W. Zhao, J. Lu, and J. Zhou, “Group-aware contrastive regression for action quality assessment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7919–7928.
  10. D. Liu, Q. Li, T. Jiang, Y. Wang, R. Miao, F. Shan, and Z. Li, “Towards unified surgical skill assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9522–9531.
  11. T. Nagai, S. Takeda, M. Matsumura, S. Shimizu, and S. Yamamoto, “Action quality assessment with ignoring scene context,” in 2021 IEEE International Conference on Image Processing (ICIP).   IEEE, 2021, pp. 1189–1193.
  12. M. Nekoui, F. O. T. Cruz, and L. Cheng, “Eagle-eye: Extreme-pose action grader using detail bird’s-eye view,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 394–402.
  13. J.-H. Pan, J. Gao, and W.-S. Zheng, “Adaptive action assessment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  14. S.-J. Zhang, J.-H. Pan, J. Gao, and W.-S. Zheng, “Semi-supervised action quality assessment with self-supervised segment feature recovery,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  15. Y. Bai, D. Zhou, S. Zhang, J. Wang, E. Ding, Y. Guan, Y. Long, and J. Wang, “Action quality assessment with temporal parsing transformer,” arXiv preprint arXiv:2207.09270, 2022.
  16. J. Xu, Y. Rao, X. Yu, G. Chen, J. Zhou, and J. Lu, “Finediving: A fine-grained dataset for procedure-aware action quality assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 2949–2958.
  17. A. Zia, Y. Sharma, V. Bettadapura, E. L. Sarin, and I. Essa, “Video and accelerometer-based motion analysis for automated surgical skills assessment,” International journal of computer assisted radiology and surgery, vol. 13, no. 3, pp. 443–455, 2018.
  18. H. Doughty, D. Damen, and W. Mayol-Cuevas, “Who’s better? who’s best? pairwise deep ranking for skill determination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6057–6066.
  19. H. Doughty, W. Mayol-Cuevas, and D. Damen, “The pros and cons: Rank-aware temporal attention for skill determination in long videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7862–7871.
  20. Z. Li, Y. Huang, M. Cai, and Y. Sato, “Manipulation-skill assessment from videos with spatial attention network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  21. P. Parmar, A. Gharat, and H. Rhodin, “Domain knowledge-informed self-supervised representations for workout form assessment,” arXiv preprint arXiv:2202.14019, 2022.
  22. A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 631–648.
  23. F. Xiao, Y. J. Lee, K. Grauman, J. Malik, and C. Feichtenhofer, “Audiovisual slowfast networks for video recognition,” arXiv preprint arXiv:2001.08740, 2020.
  24. R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 457–10 467.
  25. Z. Shi, J. Liang, Q. Li, H. Zheng, Z. Gu, J. Dong, and B. Zheng, “Multi-modal multi-action video recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 13 678–13 687.
  26. S. Alfasly, J. Lu, C. Xu, and Y. Zou, “Learnable irrelevant modality dropout for multimodal action recognition on modality-specific annotated videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 20 208–20 217.
  27. D. Liu, T. Jiang, and Y. Wang, “Completeness modeling and context separation for weakly supervised temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1298–1307.
  28. Y. Wu, L. Zhu, Y. Yan, and Y. Yang, “Dual attention matching for audio-visual event localization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6292–6300.
  29. J.-T. Lee, M. Jain, H. Park, and S. Yun, “Cross-attentional audio-visual fusion for weakly-supervised action localization,” in International Conference on Learning Representations, 2020.
  30. Y. Xia and Z. Zhao, “Cross-modal background suppression for audio-visual event localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 19 989–19 998.
  31. H. Jiang, C. Murdock, and V. K. Ithapu, “Egocentric deep multi-channel audio-visual active speaker localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 544–10 552.
  32. R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu, “Multi-modal multi-channel target speech separation,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 530–541, 2020.
  33. Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 247–263.
  34. D. Hu, Y. Wei, R. Qian, W. Lin, R. Song, and J.-R. Wen, “Class-aware sounding objects localization via audiovisual correspondence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  35. P. Wang, J. Li, M. Ma, and X. Fan, “Distributed audio-visual parsing based on multimodal transformer and deep joint source channel coding,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 4623–4627.
  36. R. Tao, R. K. Das, and H. Li, “Audio-visual speaker recognition with a cross-modal discriminative network,” arXiv preprint arXiv:2008.03894, 2020.
  37. S. Ma, Z. Zeng, D. McDuff, and Y. Song, “Contrastive learning of global and local audio-visual representations,” arXiv preprint arXiv:2104.05418, 2021.
  38. J. F. Montesinos, V. S. Kadandale, and G. Haro, “Vovit: Low latency graph-based audio-visual voice separation transformer,” arXiv preprint arXiv:2203.04099, 2022.
  39. T. Badamdorj, M. Rochan, Y. Wang, and L. Cheng, “Joint visual and audio learning for video highlight detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 8127–8137.
  40. L. Su, C. Hu, G. Li, and D. Cao, “Msaf: Multimodal split attention fusion,” arXiv preprint arXiv:2012.07175, 2020.
  41. Y. Liu, S. Li, Y. Wu, C. W. Chen, Y. Shan, and X. Qie, “UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022.   IEEE, 2022, pp. 3032–3041. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.00305
  42. F.-T. Hong, X. Huang, W.-H. Li, and W.-S. Zheng, “Mini-net: Multiple instance ranking network for video highlight detection,” in European Conference on Computer Vision.   Springer, 2020, pp. 345–360.
  43. E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
  44. S. Wang, D. Yang, P. Zhai, C. Chen, and L. Zhang, “Tsa-net: Tube self-attention network for action quality assessment,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4902–4910.
  45. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  46. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  47. C. J. Maddison, D. Tarlow, and T. Minka, “A* sampling,” Advances in Neural Information Processing Systems, vol. 27, 2014.
  48. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  49. Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742.
  50. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32.   Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  51. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
  52. Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” in Proc. Interspeech 2021, 2021, pp. 571–575.
  53. J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2017, pp. 776–780.
  54. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  55. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989.
  56. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  57. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  58. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  59. K. Li, Y. Wang, Y. Li, Y. Wang, Y. He, L. Wang, and Y. Qiao, “Unmasked teacher: Towards training-efficient video foundation models,” 2023.
  60. W. Zhu and M. Omar, “Multiscale audio spectrogram transformer for efficient audio classification,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  61. R. Panda, C.-F. R. Chen, Q. Fan, X. Sun, K. Saenko, A. Oliva, and R. Feris, “Adamml: Adaptive multi-modal learning for efficient video recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7576–7585.
  62. H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo, and B. Guo, “Unsupervised extraction of video highlights via robust recurrent auto-encoders,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4633–4641.
  63. M. Gygli, Y. Song, and L. Cao, “Video2gif: Automatic generation of animated gifs from video,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1001–1009.
  64. M. Sun, A. Farhadi, and S. Seitz, “Ranking domain-specific highlights by analyzing edited videos,” in European conference on computer vision.   Springer, 2014, pp. 787–802.
  65. B. Xiong, Y. Kalantidis, D. Ghadiyaram, and K. Grauman, “Less is more: Learning highlight detection from video duration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1258–1267.
  66. L. Wang, D. Liu, R. Puri, and D. N. Metaxas, “Learning trailer moments in full-length movies with co-contrastive attention,” in European Conference on Computer Vision.   Springer, 2020, pp. 300–316.
  67. F. Wei, B. Wang, T. Ge, Y. Jiang, W. Li, and L. Duan, “Learning pixel-level distinctions for video highlight detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 3073–3082.
  68. S. Li, F. Zhang, K. Yang, L. Liu, S. Liu, J. Hou, and S. Yi, “Probing visual-audio representation for video highlight detection via hard-pairs guided contrastive learning,” in 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022.   BMVA Press, 2022, p. 709. [Online]. Available: https://bmvc2022.mpi-inf.mpg.de/709/
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: