Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting (2404.09951v1)

Published 15 Apr 2024 in cs.CV

Abstract: Sports videos pose complex challenges, including cluttered backgrounds, camera angle changes, small action-representing objects, and imbalanced action class distribution. Existing methods for detecting actions in sports videos heavily rely on global features, utilizing a backbone network as a black box that encompasses the entire spatial frame. However, these approaches tend to overlook the nuances of the scene and struggle with detecting actions that occupy a small portion of the frame. In particular, they face difficulties when dealing with action classes involving small objects, such as balls or yellow/red cards in soccer, which only occupy a fraction of the screen space. To address these challenges, we introduce a novel approach that analyzes and models scene entities using an adaptive attention mechanism. Particularly, our model disentangles the scene content into the global environment feature and local relevant scene entities feature. To efficiently extract environmental features while considering temporal information with less computational cost, we propose the use of a 2D backbone network with a time-shift mechanism. To accurately capture relevant scene entities, we employ a Vision-LLM in conjunction with the adaptive attention mechanism. Our model has demonstrated outstanding performance, securing the 1st place in the SoccerNet-v2 Action Spotting, FineDiving, and FineGym challenge with a substantial performance improvement of 1.6, 2.0, and 1.3 points in avg-mAP compared to the runner-up methods. Furthermore, our approach offers interpretability capabilities in contrast to other deep learning models, which are often designed as black boxes. Our code and models are released at: https://github.com/Fsoft-AIC/unifying-global-local-feature.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. ICCW, pages 3166–3176, 2020.
  2. Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR, June 2016.
  3. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, June 2015.
  4. X-detr: A versatile architecture for instance-wise vision-language tasks. ECCV, 2022.
  5. Faster-tad: Towards temporal action detection with proposal generation and classification in a unified network. ArXiv, abs/2204.02674, 2022.
  6. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS, 2014.
  7. A context-aware loss function for action spotting in soccer videos. In CVPR, June 2020.
  8. Camera calibration and player localization in soccernet-v2 and investigation of their representations for action spotting. In CVPR, pages 4537–4546, June 2021.
  9. Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In CVPR, pages 4508–4519, June 2021.
  10. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186. ACL, 2019.
  11. B. D. Dian Shao, Yue Zhao et al. Finegym: A hierarchical video dataset for fine-grained action understanding. CVPR, pages 2613–2622, 2020.
  12. Multiscale vision transformers. ICCV, pages 6804–6815, 2021.
  13. Y. A. Farha and J. Gall. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. CVPR, pages 3570–3579, 2019.
  14. Slowfast networks for video recognition. In ICCV, October 2019.
  15. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, pages 540–557, Cham, 2022. Springer Nature Switzerland.
  16. Soccernet: A scalable dataset for action spotting in soccer videos. In CVPR, June 2018.
  17. S. Giancola and B. Ghanem. Temporally-aware feature pooling for action spotting in soccer broadcasts, 2021.
  18. Open-vocabulary object detection via vision and language knowledge distillation. ICLR, 2022.
  19. Gta: Global temporal attention for video action understanding, 2021.
  20. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  21. Deep residual learning for image recognition. In CVPR, June 2016.
  22. Spotting temporally precise, fine-grained events in video. In ECCV, 2022.
  23. Scaling up visual and vision-language representation learning with noisy text supervision. In ICLR, pages 4904–4916. PMLR, 2021.
  24. The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017.
  25. Language-driven semantic segmentation. In ICLR, 2022.
  26. Grounded language-image pre-training. In CVPR, 2022.
  27. Focal loss for dense object detection. In ICCV, Oct 2017.
  28. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  29. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, October 2021.
  30. I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In ICLR, 2017.
  31. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019.
  32. Learning visual question answering by bootstrapping hard attention. In ECCV, September 2018.
  33. P. Nakkiran. Learning rate annealing can provably help generalization, even for convex problems, 2020.
  34. Video transformer network, 2021.
  35. Learning transferable visual models from natural language supervision. In ICML, volume 139, pages 8748–8763. PMLR, 2021.
  36. Designing network design spaces. In CVPR, June 2020.
  37. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pages 18082–18091, June 2022.
  38. You only look once: Unified, real-time object detection. CVPR, pages 779–788, 2016.
  39. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE TPAMI, 39:1137–1149, 2015.
  40. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
  41. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  42. Temporally precise action spotting in soccer videos using dense detection anchors. In ICIP, pages 2796–2800, 2022.
  43. Gate-shift networks for video action recognition. In CVPR, June 2020.
  44. M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks, 2020.
  45. Z. Teed and J. Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
  46. Rms-net: Regression and masking for soccer event spotting, 2021.
  47. Video classification with channel-separated convolutional networks. In ICCV, October 2019.
  48. Attention is all you need. In NIPS, volume 30. Curran Associates, Inc., 2017.
  49. Aei: Actors-environment interaction with adaptive attention for temporal action proposals generation. BMVC, 2021.
  50. Aoe-net: Entities interactions modeling with adaptive attention mechanism for temporal action proposals generation. IJCV, Oct 2022.
  51. Finediving: A fine-grained dataset for procedure-aware action quality assessment. In CVPR, pages 2949–2958, June 2022.
  52. Vltint: visual-linguistic transformer-in-transformer for coherent video paragraph captioning. In AAAI, volume 37, pages 3081–3090, 2023.
  53. Temporal pyramid network for action recognition. In CVPR, June 2020.
  54. Unified contrastive learning in image-text-label space. In CVPR, pages 19163–19173, 2022.
  55. Asformer: Transformer for action segmentation. In BMVC, 2021.
  56. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  57. Regionclip: Region-based language-image pretraining. In CVPR, pages 16793–16803, 2022.
  58. Feature combination meets attention: Baidu soccer embeddings and transformer based temporal detection, 2021.
  59. A transformer-based system for action spotting in soccer videos. ACM MMSport, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kim Hoang Tran (4 papers)
  2. Phuc Vuong Do (1 paper)
  3. Ngoc Quoc Ly (4 papers)
  4. Ngan Le (84 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.