Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Guided Entity-aware Video Captioning and A Basketball Benchmark (2401.13888v2)

Published 25 Jan 2024 in cs.CV

Abstract: Despite the recent emergence of video captioning models, how to generate the text description with specific entity names and fine-grained actions is far from being solved, which however has great applications such as basketball live text broadcast. In this paper, a new multimodal knowledge graph supported basketball benchmark for video captioning is proposed. Specifically, we construct a multimodal basketball game knowledge graph (KG_NBA_2022) to provide additional knowledge beyond videos. Then, a multimodal basketball game video captioning (VC_NBA_2022) dataset that contains 9 types of fine-grained shooting events and 286 players' knowledge (i.e., images and names) is constructed based on KG_NBA_2022. We develop a knowledge guided entity-aware video captioning network (KEANet) based on a candidate player list in encoder-decoder form for basketball live text broadcast. The temporal contextual information in video is encoded by introducing the bi-directional GRU (Bi-GRU) module. And the entity-aware module is designed to model the relationships among the players and highlight the key players. Extensive experiments on multiple sports benchmarks demonstrate that KEANet effectively leverages extera knowledge and outperforms advanced video captioning models. The proposed dataset and corresponding codes will be publicly available soon

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  2. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
  3. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2634–2641, 2013.
  4. Concept propagation via attentional knowledge graph reasoning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4789–4800, 2022.
  5. Video2commonsense: Generating commonsense descriptions to enrich video captioning. arXiv preprint arXiv:2003.05162, 2020.
  6. Text with knowledge graph augmented transformer for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18941–18951, 2023.
  7. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  8. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  9. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
  10. Value: A multi-task benchmark for video-and-language understanding evaluation. arXiv preprint arXiv:2106.04632, 2021.
  11. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  12. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958, 2022.
  13. O2na: An object-oriented non-autoregressive approach for controllable video captioning. arXiv preprint arXiv:2108.02359, 2021.
  14. Sibnet: Sibling convolutional encoder for video captioning. In Proceedings of the 26th ACM international conference on Multimedia, pages 1425–1434, 2018.
  15. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  16. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  17. Clip4videocap: Rethinking clip for video captioning with multiscale temporal fusion and commonsense knowledge. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  18. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  19. Goal: A challenging knowledge-grounded video captioning benchmark for real-time soccer commentary generation. arXiv preprint arXiv:2303.14655, 2023.
  20. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  21. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  22. Stephen Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503–520, 2004.
  23. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  24. Ray Smith. An overview of the tesseract ocr engine. In Ninth international conference on document analysis and recognition (ICDAR 2007), pages 629–633. IEEE, 2007.
  25. Ranking domain-specific highlights by analyzing edited videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 787–802. Springer, 2014.
  26. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4858–4862, 2021.
  27. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  28. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  29. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
  30. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
  31. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  32. Social adaptive module for weakly-supervised group activity recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 208–224. Springer, 2020.
  33. Clip meets video captioning: Concept-aware representation learning does matter. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 368–381. Springer, 2022.
  34. Fine-grained video captioning for sports narrative. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6006–6015, 2018.
  35. Movie101: A new movie understanding benchmark. arXiv preprint arXiv:2305.12140, 2023.
  36. Open-book video captioning with retrieve-copy-generate network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9837–9846, 2021.
  37. Multi-modal knowledge graph construction and application: A survey. IEEE Transactions on Knowledge and Data Engineering, 2022.
  38. Zero-shot video classification with appropriate web and task knowledge transfer. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5761–5772, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com