TempCompass: Do Video LLMs Really Understand Videos? (2403.00476v3)
Abstract: Recently, there is a surge in interest surrounding video LLMs (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of them are unable to distinguish between different temporal aspects (e.g., speed, direction) and thus cannot reflect the nuanced performance on these specific aspects. On the other hand, they are limited in the diversity of task formats (e.g., only multi-choice QA), which hinders the understanding of how temporal perception performance may vary across different types of tasks. Motivated by these two problems, we propose the \textbf{TempCompass} benchmark, which introduces a diversity of temporal aspects and task formats. To collect high-quality test data, we devise two novel strategies: (1) In video collection, we construct conflicting videos that share the same static content but differ in a specific temporal aspect, which prevents Video LLMs from leveraging single-frame bias or language priors. (2) To collect the task instructions, we propose a paradigm where humans first annotate meta-information for a video and then an LLM generates the instruction. We also design an LLM-based approach to automatically and accurately evaluate the responses from Video LLMs. Based on TempCompass, we comprehensively evaluate 8 state-of-the-art (SOTA) Video LLMs and 3 Image LLMs, and reveal the discerning fact that these models exhibit notably poor temporal perception ability. Our data will be available at https://github.com/llyx97/TempCompass.
- Test of time: Instilling video-language models with a sense of time. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2503–2516.
- Qwen technical report. ArXiv, abs/2309.16609.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. ArXiv, abs/2308.12966.
- Touchstone: Evaluating vision-language models by language models. ArXiv, abs/2308.16890.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1708–1718.
- Language models are few-shot learners. In NeurIPS.
- Revisiting the “video” in video-language understanding. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2907–2917.
- Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. ArXiv, abs/2311.14906.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929.
- Eva: Exploring the limits of masked visual representation learning at scale. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv, abs/2306.13394.
- Imagebind one embedding space to bind them all. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190.
- Rohit Girdhar and Deva Ramanan. 2019. Cater: A diagnostic dataset for compositional actions and temporal reasoning. ArXiv, abs/1910.04744.
- The “something something” video database for learning and evaluating visual common sense. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5843–5851.
- Localizing moments in video with temporal language. In EMNLP, pages 1380–1390. Association for Computational Linguistics.
- Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685.
- What makes a video a video: Analyzing temporal information in video understanding models and datasets. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7366–7375.
- Tgif-qa: Toward spatio-temporal reasoning in visual question answering. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1359–1367.
- Revealing single frame bias for video-and-language learning. ArXiv, abs/2206.03428.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning.
- Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355.
- Mvbench: A comprehensive multi-modal video understanding benchmark. ArXiv, abs/2311.17005.
- Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. ArXiv, abs/2311.17404.
- Vlm-eval: A general evaluation on video large language models. ArXiv, abs/2311.11865.
- Llama-vid: An image is worth 2 tokens in large language models. ArXiv, abs/2311.17043.
- Video-llava: Learning united visual representation by alignment before projection. ArXiv, abs/2311.10122.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. ArXiv, abs/2311.07575.
- Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744.
- Visual instruction tuning. ArXiv, abs/2304.08485.
- Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3032–3041.
- Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281.
- A convnet for the 2020s. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11966–11976.
- Valley: Video assistant with large language model enhanced ability. ArXiv, abs/2306.07207.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv, abs/2306.05424.
- Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. ArXiv, abs/2311.16103.
- OpenAI. 2022. Introducing chatgpt. CoRR.
- Dinov2: Learning robust visual features without supervision. ArXiv, abs/2304.07193.
- Perception test: A diagnostic benchmark for multimodal video models. ArXiv, abs/2305.13786.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
- Testa: Temporal-spatial token aggregation for long-form video-language understanding. ArXiv, abs/2310.19060.
- Timechat: A time-sensitive multimodal large language model for long video understanding. ArXiv, abs/2312.02051.
- Only time can tell: Discovering temporal data for temporal modeling. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 535–544.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. ArXiv, abs/2303.17580.
- Pandagpt: One model to instruction-follow them all. ArXiv, abs/2305.16355.
- Vipergpt: Visual inference via python execution for reasoning. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11854–11864.
- Stanford alpaca: An instruction-following llama model.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv, abs/2303.04671.
- Video question answering via gradually refined attention over appearance and motion. Proceedings of the 25th ACM international conference on Multimedia.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. ArXiv, abs/2306.09265.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXiv, abs/2303.11381.
- mplug-owl: Modularization empowers large language models with multimodality. ArXiv, abs/2304.14178.
- CLEVRER: collision events for video representation and reasoning. In ICLR. OpenReview.net.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134. AAAI Press.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. ArXiv, abs/2306.02858.
- Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. ArXiv, abs/2310.01852.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592.