Uncertainty-Driven Action Quality Assessment (2207.14513v2)
Abstract: Automatic action quality assessment (AQA) has attracted increasing attention due to its wide applications. However, most existing AQA methods employ deterministic models to predict the final score for each action, while overlooking the subjectivity and diversity among expert judges during the scoring process. In this paper, we propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to utilize and capture the diversity among multiple judge scores. Specifically, we design a Conditional Variational Auto-Encoder (CVAE)-based module to encode the uncertainty in expert assessment, where multiple judge scores can be produced by sampling latent features from the learned latent space multiple times. To further utilize the uncertainty, we generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss, effectively reducing the influence of uncertain samples during training. Moreover, we further design an uncertainty-guided training strategy to dynamically adjust the learning order of the samples from low uncertainty to high uncertainty. The experiments show that our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
- Y. Tang, Z. Ni, J. Zhou, D. Zhang, J. Lu, Y. Wu, and J. Zhou, “Uncertainty-aware score distribution learning for action quality assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9839–9848.
- M. Nekoui, F. O. T. Cruz, and L. Cheng, “Eagle-eye: Extreme-pose action grader using detail bird’s-eye view,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 394–402.
- P. Parmar and B. Morris, “Action quality assessment across multiple actions,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 2019, pp. 1468–1476.
- Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar, D. D. Yuh et al., “Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling,” in MICCAI workshop: M2cai, vol. 3, 2014, p. 3.
- I. Funke, S. T. Mees, J. Weitz, and S. Speidel, “Video-based surgical skill assessment using 3d convolutional neural networks,” International journal of computer assisted radiology and surgery, vol. 14, no. 7, pp. 1217–1225, 2019.
- J. L. Lavanchy, J. Zindel, K. Kirtac, I. Twick, E. Hosgor, D. Candinas, and G. Beldi, “Automation of surgical skill assessment using a three-stage machine learning algorithm,” Scientific reports, vol. 11, no. 1, pp. 1–9, 2021.
- T. Wang, Y. Wang, and M. Li, “Towards accurate and interpretable surgical skill assessment: A video-based method incorporating recognized surgical gestures and skill levels,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2020, pp. 668–678.
- P. Parmar, J. Reddy, and B. Morris, “Piano skills assessment,” arXiv preprint arXiv:2101.04884, 2021.
- H. Doughty, W. Mayol-Cuevas, and D. Damen, “The pros and cons: Rank-aware temporal attention for skill determination in long videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7862–7871.
- H. R. Doughty, “Skill determination from long videos,” Ph.D. dissertation, University of Bristol, 2021.
- Z. Li, Y. Huang, M. Cai, and Y. Sato, “Manipulation-skill assessment from videos with spatial attention network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
- H. Pirsiavash, C. Vondrick, and A. Torralba, “Assessing the quality of actions,” in European Conference on Computer Vision. Springer, 2014, pp. 556–571.
- P. Parmar and B. Tran Morris, “Learning to score olympic events,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 20–28.
- J. Wang, Z. Du, A. Li, and Y. Wang, “Assessing action quality via attentive spatio-temporal convolutional networks,” in Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2020, pp. 3–16.
- P. Parmar and B. T. Morris, “What and how well you performed? a multitask learning approach to action quality assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 304–313.
- J.-H. Pan, J. Gao, and W.-S. Zheng, “Action assessment by joint relation graphs,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6331–6340.
- L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision. Springer, 2016, pp. 20–36.
- Z. Wang, Q. She, and A. Smolic, “Action-net: Multipath excitation for action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 214–13 223.
- C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 591–600.
- Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang, “Tea: Temporal excitation and aggregation for action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 909–918.
- Y. Mou, X. Jiang, K. Xu, T. Sun, and Z. Wang, “Compressed video action recognition with dual-stream and dual-modal transformer,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- Z. Shen, X.-J. Wu, and T. Xu, “Fexnet: Foreground extraction network for human action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 5, pp. 3141–3151, 2021.
- S.-J. Zhang, J.-H. Pan, J. Gao, and W.-S. Zheng, “Semi-supervised action quality assessment with self-supervised segment feature recovery,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 9, pp. 6017–6028, 2022.
- K. Zhou, Y. Ma, H. P. Shum, and X. Liang, “Hierarchical graph convolutional networks for action quality assessment,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- M. Li, H.-B. Zhang, Q. Lei, Z. Fan, J. Liu, and J.-X. Du, “Pairwise contrastive learning network for action quality assessment,” in European Conference on Computer Vision. Springer, 2022, pp. 457–473.
- P.-X. Lian and Z.-G. Shao, “Improving action quality assessment with across-staged temporal reasoning on imbalanced data,” Applied Intelligence, pp. 1–12, 2023.
- J.-H. Pan, J. Gao, and W.-S. Zheng, “Adaptive action assessment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8779–8795, 2021.
- L.-J. Dong, H.-B. Zhang, Q. Shi, Q. Lei, J.-X. Du, and S. Gao, “Learning and fusing multiple hidden substages for action quality assessment,” Knowledge-Based Systems, vol. 229, p. 107388, 2021.
- X. Yu, Y. Rao, W. Zhao, J. Lu, and J. Zhou, “Group-aware contrastive regression for action quality assessment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7919–7928.
- Y. Bai, D. Zhou, S. Zhang, J. Wang, E. Ding, Y. Guan, Y. Long, and J. Wang, “Action quality assessment with temporal parsing transformer,” in European Conference on Computer Vision. Springer, 2022, pp. 422–438.
- K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” in Advances in Neural Information Processing Systems, 2015, pp. 3483–3491.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
- Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in International Conference on Machine Learning, 2009, pp. 41–48.
- M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Advances in Neural Information Processing Systems, 2010, pp. 1189–1197.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, pp. 4489–4497.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2011, pp. 3361–3368.
- K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International Conference on Machine Learning. PMLR, 2020, pp. 1597–1607.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. Maier-Hein, S. Eslami, D. Jimenez Rezende, and O. Ronneberger, “A probabilistic u-net for segmentation of ambiguous images,” in Advances in Neural Information Processing Systems, 2018, pp. 6965–6975.
- C. F. Baumgartner, K. C. Tezcan, K. Chaitanya, A. M. Hötker, U. J. Muehlematter, K. Schawkat, A. S. Becker, O. Donati, and E. Konukoglu, “Phiseg: Capturing uncertainty in medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 119–127.
- M. Gantenbein, E. Erdil, and E. Konukoglu, “Revphiseg: A memory-efficient neural network for uncertainty quantification in medical image segmentation,” in Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in Biomedical Image Analysis. Springer, 2020, pp. 13–22.
- Y. Du, J. Xu, X. Zhen, M.-M. Cheng, and L. Shao, “Conditional variational image deraining,” IEEE Transactions on Image Processing, vol. 29, pp. 6288–6301, 2020.
- J. Zhang, D.-P. Fan, Y. Dai, S. Anwar, F. S. Saleh, T. Zhang, and N. Barnes, “Uc-net: Uncertainty inspired rgb-d saliency detection via conditional variational autoencoders,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8582–8591.
- J. Zhang, D.-P. Fan, Y. Dai, S. Anwar, F. Saleh, S. Aliakbarian, and N. Barnes, “Uncertainty inspired rgb-d saliency detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- A. F. Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018.
- X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- J. Xu, Y. Rao, X. Yu, G. Chen, J. Zhou, and J. Lu, “Finediving: A fine-grained dataset for procedure-aware action quality assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2949–2958.
- A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
- W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- J. Gao, W.-S. Zheng, J.-H. Pan, C. Gao, Y. Wang, W. Zeng, and J. Lai, “An asymmetric modeling for action assessment,” in European Conference on Computer Vision. Springer, 2020, pp. 222–238.