Domain-Guided Masked Autoencoders for Unique Player Identification (2403.11328v1)
Abstract: Unique player identification is a fundamental module in vision-driven sports analytics. Identifying players from broadcast videos can aid with various downstream tasks such as player assessment, in-game analysis, and broadcast production. However, automatic detection of jersey numbers using deep features is challenging primarily due to: a) motion blur, b) low resolution video feed, and c) occlusions. With their recent success in various vision tasks, masked autoencoders (MAEs) have emerged as a superior alternative to conventional feature extractors. However, most MAEs simply zero-out image patches either randomly or focus on where to mask rather than how to mask. Motivated by human vision, we devise a novel domain-guided masking policy for MAEs termed d-MAE to facilitate robust feature extraction in the presence of motion blur for player identification. We further introduce a new spatio-temporal network leveraging our novel d-MAE for unique player identification. We conduct experiments on three large-scale sports datasets, including a curated baseball dataset, the SoccerNet dataset, and an in-house ice hockey dataset. We preprocess the datasets using an upgraded keyframe identification (KfID) module by focusing on frames containing jersey numbers. Additionally, we propose a keyframe-fusion technique to augment keyframes, preserving spatial and temporal context. Our spatio-temporal network showcases significant improvements, surpassing the current state-of-the-art by 8.58%, 4.29%, and 1.20% in the test set accuracies, respectively. Rigorous ablations highlight the effectiveness of our domain-guided masking approach and the refined KfID module, resulting in performance enhancements of 1.48% and 1.84% respectively, compared to original architectures.
- K. Vats, W. J. McNally, P. Walters, D. A. Clausi, and J. S. Zelek, “Ice hockey player identification via transformers and weakly supervised learning,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3450–3459, 2021.
- B. Balaji, J. Bright, H. Prakash, Y. Chen, D. A. Clausi, and J. Zelek, “Jersey number recognition using keyframe identification from low-resolution broadcast videos,” in Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, ser. MMSports ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 123–130. [Online]. Available: https://doi.org/10.1145/3606038.3616162
- H. Liu and B. Bhanu, “Pose-guided r-cnn for jersey number recognition in sports,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2457–2466, 2019.
- G. Li, S. Xu, X. Liu, L. Li, and C. Wang, “Jersey number recognition with semi-supervised spatial transformer network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 1864–18 647.
- H. Liu, C. Aderon, N. Wagon, H. Liu, S. MacCall, and Y. Gan, “Deep learning-based automatic player identification and logging in american football videos,” arXiv preprint arXiv:2204.13809, 2022.
- K. Vats, P. Walters, M. Fani, D. A. Clausi, and J. S. Zelek, “Player tracking and identification in ice hockey,” ArXiv, vol. abs/2110.03090, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:238419671
- W.-L. Lu, J.-A. Ting, J. Little, and K. P. Murphy, “Ieee transactions on pattern analysis and machine intelligence learning to track and identify players from broadcast sports videos.” [Online]. Available: https://api.semanticscholar.org/CorpusID:7428154
- K. Vats, M. Fani, D. A. Clausi, and J. Zelek, “Multi-task learning for jersey number recognition in ice hockey,” in Proceedings of the 4th International Workshop on Multimedia Content Analysis in Sports, 2021, pp. 11–15.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.
- A. Chan, M. D. Levine, and M. Javan, “Player identification in hockey broadcast videos,” Expert Syst. Appl., vol. 165, p. 113891, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017. [Online]. Available: http://arxiv.org/abs/1706.03762
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, 1997.
- A. Radford and K. Narasimhan, “Improving language understanding by generative pre-training,” 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:49313245
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in North American Chapter of the Association for Computational Linguistics, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:52967399
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ArXiv, vol. abs/2010.11929, 2020.
- K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick, “Masked autoencoders are scalable vision learners,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15 979–15 988, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:243985980
- W. G. C. Bandara, N. Patel, A. Gholami, M. Nikkhah, M. Agrawal, and V. M. Patel, “Adamae: Adaptive masking for efficient spatiotemporal learning with masked autoencoders,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14 507–14 517, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:253553494
- I. Kakogeorgiou, S. Gidaris, B. Psomas, Y. Avrithis, A. Bursuc, K. Karantzalos, and N. Komodakis, “What to hide from your students: Attention-guided masked image modeling,” in European Conference on Computer Vision, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247627906
- H. Chen, W. Zhang, Y. Wang, and X. Yang, “Improving masked autoencoders by learning where to mask,” ArXiv, vol. abs/2303.06583, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257496764
- M. Bertalmío, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” Proceedings of the 27th annual conference on Computer graphics and interactive techniques, 2000. [Online]. Available: https://api.semanticscholar.org/CorpusID:308278
- S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, “An iterative regularization method for total variation-based image restoration,” Multiscale Model. Simul., vol. 4, pp. 460–489, 2005. [Online]. Available: https://api.semanticscholar.org/CorpusID:618185
- C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patchmatch: a randomized correspondence algorithm for structural image editing,” ACM Trans. Graph., vol. 28, p. 24, 2009. [Online]. Available: https://api.semanticscholar.org/CorpusID:26169625
- A. A. Efros and T. K. Leung, “Texture synthesis by non-parametric sampling,” Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1033–1038 vol.2, 1999. [Online]. Available: https://api.semanticscholar.org/CorpusID:221583955
- P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, 2010. [Online]. Available: https://api.semanticscholar.org/CorpusID:17804904
- M. Chen, A. Radford, J. Wu, H. Jun, P. Dhariwal, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” in International Conference on Machine Learning, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219781060
- H. Bao, L. Dong, and F. Wei, “Beit: Bert pre-training of image transformers,” ArXiv, vol. abs/2106.08254, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235436185
- S. Gerke, K. Müller, and R. Schäfer, “Soccer jersey number recognition using convolutional neural networks,” 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 734–741, 2015.
- D. Bhargavi, E. P. Coyotl, and S. Gholami, “Knock, knock. who’s there? – identifying football player jersey numbers with synthetic data,” 2022.
- G. Jocher, A. Chaurasia, A. Stoken, J. Borovec, NanoCode012, Y. Kwon, K. Michael, TaoXie, J. Fang, imyhxy, Lorna, . Yifu), C. Wong, A. V, D. Montes, Z. Wang, C. Fati, J. Nadar, Laughing, UnglvKitDe, V. Sonck, tkianai, yxNONG, P. Skalski, A. Hogan, D. Nair, M. Strobel, and M. Jain, “ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation,” Nov. 2022. [Online]. Available: https://doi.org/10.5281/zenodo.7347926
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” ArXiv, vol. abs/2010.04159, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:222208633
- C. Weng, B. Curless, P. P. Srinivasan, J. T. Barron, and I. Kemelmacher-Shlizerman, “Humannerf: Free-viewpoint rendering of moving people from monocular video,” CoRR, vol. abs/2201.04127, 2022. [Online]. Available: https://arxiv.org/abs/2201.04127
- S. Hu and Z. Liu, “Gauhuman: Articulated gaussian splatting from monocular human videos,” 2023.
- J. Bright, Y. Chen, and J. Zelek, “Mitigating motion blur for robust 3d baseball player pose modeling for pitch analysis,” in Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, ser. MMSports ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 63–71. [Online]. Available: https://doi.org/10.1145/3606038.3616163
- Bavesh Balaji (5 papers)
- Jerrin Bright (9 papers)
- Sirisha Rambhatla (27 papers)
- Yuhao Chen (84 papers)
- Alexander Wong (230 papers)
- John Zelek (31 papers)
- David A Clausi (26 papers)