Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM4VG: Large Language Models Evaluation for Video Grounding (2312.14206v3)

Published 21 Dec 2023 in cs.CV

Abstract: Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  3. Video2text: Learning to annotate video content. In 2009 IEEE International Conference on Data Mining Workshops, pages 144–151. IEEE, 2009.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Grounding large language models in interactive environments with online reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023.
  6. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023a.
  7. Curriculum-listener: Consistency-and complementarity-aware audio-enhanced temporal sentence grounding. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3117–3128, 2023b.
  8. Temporally grounding natural sentence in video. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 162–171, 2018.
  9. Multimedia cognition and evaluation in open environments. In Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice, pages 9–18, 2023.
  10. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
  11. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer, 2020.
  12. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017.
  13. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10867–10877, 2023.
  14. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8393–8400, 2019.
  15. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7366–7375, 2018.
  16. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  17. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  18. A survey on temporal sentence grounding in videos. ACM Transactions on Multimedia Computing, Communications and Applications, 19(2):1–33, 2023.
  19. LAVIS: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 31–41, Toronto, Canada, 2023. Association for Computational Linguistics.
  20. How long can open-source llms truly promise on context length?, 2023.
  21. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8928–8937, 2019.
  22. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  23. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  24. Ruotian Luo. Goal-driven text descriptions for images. arXiv preprint arXiv:2108.12575, 2021.
  25. Discriminability objective for training descriptive captions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6964–6974, 2018.
  26. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
  27. Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424, 2023.
  28. Interventional video grounding with dual contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2765–2775, 2021.
  29. OpenAI. Openai: Introducing chatgpt, 2022.
  30. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  32. Improving language understanding by generative pre-training.
  33. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017.
  34. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 510–526. Springer, 2016.
  35. Vlg-net: Video-language graph matching network for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3224–3234, 2021.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  37. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  38. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  39. Video grounding and its generalization. In Proceedings of the 30th ACM International Conference on Multimedia, pages 7377–7379, 2022.
  40. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  41. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9062–9069, 2019.
  42. Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16442–16453, 2022.
  43. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  44. A closer look at temporal sentence grounding in videos: Dataset and metric. In Proceedings of the 2nd international workshop on human-centric multimedia analysis, pages 13–21, 2021.
  45. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10287–10296, 2020.
  46. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  47. Llm4dyg: Can large language models solve problems on dynamic graphs? arXiv preprint arXiv:2310.17110, 2023b.
  48. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023c.
  49. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  50. Multimedia intelligence: When multimedia meets artificial intelligence. IEEE Transactions on Multimedia, 22(7):1823–1835, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Wei Feng (208 papers)
  2. Xin Wang (1306 papers)
  3. Hong Chen (230 papers)
  4. Zeyang Zhang (28 papers)
  5. Zihan Song (4 papers)
  6. Yuwei Zhou (9 papers)
  7. Wenwu Zhu (104 papers)
  8. Houlun Chen (4 papers)
  9. Yuekui Yang (10 papers)
  10. Haiyang Wu (11 papers)
Citations (8)