Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering (2311.14906v2)

Published 25 Nov 2023 in cs.CV

Abstract: We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-LLMs in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation. 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator. Furthermore, we assess the performance of eight large vision-LLMs on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2%. However, there is still substantial room for improvement compared to human accuracy of 72.8%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses. Code is available at https://github.com/Xiuyuan-Chen/AutoEval-Video.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  3. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  5. Revisiting the “video” in video-language understanding. In CVPR, pages 2917–2927, 2022.
  6. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  8. Lasot: A high-quality benchmark for large-scale single object tracking. In CVPR, pages 5374–5383, 2019.
  9. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
  10. The “something something” video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017a.
  11. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017b.
  12. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, pages 18995–19012, 2022.
  13. Agqa: A benchmark for compositional spatio-temporal reasoning. In CVPR, pages 11287–11297, 2021.
  14. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, pages 6047–6056, 2018.
  15. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In CVPR, pages 7366–7375, 2018.
  16. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  17. Dense-captioning events in videos. In ICCV, pages 706–715, 2017.
  18. Hmdb: a large video database for human motion recognition. In ICCV, pages 2556–2563. IEEE, 2011.
  19. Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022.
  20. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  22. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023c.
  23. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
  24. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  25. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In ACL, 2004.
  26. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv preprint arXiv:2305.13711, 2023.
  27. Reducing the vision and language bias for temporal sentence grounding. ACM MM, 2022.
  28. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  29. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  30. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
  31. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023d.
  32. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  33. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  34. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
  35. Perception test: A diagnostic benchmark for multimodal video models. arXiv preprint arXiv:2305.13786, 2023.
  36. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  37. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
  38. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
  39. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, pages 4581–4591, 2019.
  40. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
  41. Online object tracking: A benchmark. In CVPR, pages 2411–2418, 2013.
  42. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
  43. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023.
  44. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023a.
  45. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
  46. Anetqa: A large-scale benchmark for fine-grained compositional reasoning over untrimmed videos. In CVPR, pages 23191–23200, 2023c.
  47. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  48. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
  49. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023b.
  50. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023.
Citations (15)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com