AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering (2311.14906v2)
Abstract: We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-LLMs in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation. 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator. Furthermore, we assess the performance of eight large vision-LLMs on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2%. However, there is still substantial room for improvement compared to human accuracy of 72.8%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses. Code is available at https://github.com/Xiuyuan-Chen/AutoEval-Video.
- Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Revisiting the “video” in video-language understanding. In CVPR, pages 2917–2927, 2022.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Lasot: A high-quality benchmark for large-scale single object tracking. In CVPR, pages 5374–5383, 2019.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
- The “something something” video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017a.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017b.
- Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, pages 18995–19012, 2022.
- Agqa: A benchmark for compositional spatio-temporal reasoning. In CVPR, pages 11287–11297, 2021.
- Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, pages 6047–6056, 2018.
- What makes a video a video: Analyzing temporal information in video understanding models and datasets. In CVPR, pages 7366–7375, 2018.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Dense-captioning events in videos. In ICCV, pages 706–715, 2017.
- Hmdb: a large video database for human motion recognition. In ICCV, pages 2556–2563. IEEE, 2011.
- Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023c.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In ACL, 2004.
- Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv preprint arXiv:2305.13711, 2023.
- Reducing the vision and language bias for temporal sentence grounding. ACM MM, 2022.
- Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023d.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
- Perception test: A diagnostic benchmark for multimodal video models. arXiv preprint arXiv:2305.13786, 2023.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
- Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, pages 4581–4591, 2019.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
- Online object tracking: A benchmark. In CVPR, pages 2411–2418, 2013.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023a.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
- Anetqa: A large-scale benchmark for fine-grained compositional reasoning over untrimmed videos. In CVPR, pages 23191–23200, 2023c.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023b.
- Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023.