Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TempCompass: Do Video LLMs Really Understand Videos? (2403.00476v3)

Published 1 Mar 2024 in cs.CV

Abstract: Recently, there is a surge in interest surrounding video LLMs (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of them are unable to distinguish between different temporal aspects (e.g., speed, direction) and thus cannot reflect the nuanced performance on these specific aspects. On the other hand, they are limited in the diversity of task formats (e.g., only multi-choice QA), which hinders the understanding of how temporal perception performance may vary across different types of tasks. Motivated by these two problems, we propose the \textbf{TempCompass} benchmark, which introduces a diversity of temporal aspects and task formats. To collect high-quality test data, we devise two novel strategies: (1) In video collection, we construct conflicting videos that share the same static content but differ in a specific temporal aspect, which prevents Video LLMs from leveraging single-frame bias or language priors. (2) To collect the task instructions, we propose a paradigm where humans first annotate meta-information for a video and then an LLM generates the instruction. We also design an LLM-based approach to automatically and accurately evaluate the responses from Video LLMs. Based on TempCompass, we comprehensively evaluate 8 state-of-the-art (SOTA) Video LLMs and 3 Image LLMs, and reveal the discerning fact that these models exhibit notably poor temporal perception ability. Our data will be available at https://github.com/llyx97/TempCompass.

Evaluating the Temporal Perception of Video LLMs with TempCompass

Introduction to TempCompass

In the expanding domain of LLMs with video understanding capabilities, popularly known as Video LLMs, the newly introduced TempCompass benchmark emerges as a comprehensive framework for evaluating these models' temporal perception abilities. Differentiating itself from existing benchmarks, TempCompass assesses Video LLMs' capacity to understand diverse temporal aspects such as action, speed, direction, attribute change, and event order through a variety of task formats including Multi-Choice Question-Answering (QA), Yes/No QA, Caption Matching, and Caption Generation. The benchmark is meticulously designed to challenge models beyond single-frame biases and language priors, thus providing a more holistic view of a model's video understanding capabilities.

Addressing the Gap in Video LLM Evaluation

The TempCompass benchmark was developed in response to certain limitations observed in previous approaches to evaluating Video LLMs:

  • Limited Temporal Aspect Differentiation: Prior benchmarks often conflated different temporal dynamics, hindering a nuanced evaluation of models' understanding of specific temporal properties.
  • Constrained Task Format Variety: Most benchmarks primarily utilized multi-choice QA formats, overlooking the potential insights offered by diverse evaluation methods.

TempCompass aims to fill these gaps by incorporating a range of temporal aspects and task formats, facilitating a detailed assessment of Video LLMs. This diversity not only challenges models across multiple dimensions of temporal understanding but also enables a richer analysis of their performance.

Benchmark Creation and Methodology

The creation of TempCompass involved several innovative strategies:

  • Conflicting Video Pair/Triplets: To counteract the reliance on single-frame biases and language priors, TempCompass includes videos with identical static content but varying in specific temporal axes. This design ensures that accurate task completion relies on genuine video understanding.
  • Hybrid Data Collection Approach: Combining human annotations with LLM-generated content, TempCompass achieves a balance of efficiency and quality in its dataset. Task instructions were primarily generated by an LLM, with human oversight ensuring relevance and clarity.

Another cornerstone of TempCompass is its automatic evaluation methodology, leveraging a well-tuned LLM to assess Video LLM responses. This approach, especially relevant for free-response formats, showcases the potential for sophisticated, automated evaluation frameworks in AI research.

Insights from TempCompass Evaluation

Evaluating 8 state-of-the-art Video LLMs and 3 Image LLMs, TempCompass unveiled several key findings:

  • Underdeveloped Temporal Perception: Across the board, Video LLMs demonstrated limited abilities in interpreting temporal dynamics, struggling even against Image LLMs in certain aspects.
  • Aspect and Task-Specific Performance Variance: The benchmark highlighted not only the varying levels of models' proficiency across different temporal aspects but also the significance of task format on performance.

These results underline the essential need for further advancements in Video LLM technology, with a particular emphasis on improving temporal perception.

Future Directions and Limitations

While TempCompass marks a significant step towards better evaluating Video LLMs, the research acknowledges inherent limitations. The enduring effects of single-frame bias and language priors, despite efforts to mitigate them, and the challenges in fully automating evaluation, particularly for caption generation tasks, are notable concerns. Looking forward, addressing these limitations will be crucial in refining the benchmark.

Conclusion

TempCompass introduces a rigorous and nuanced framework for evaluating the temporal perception abilities of Video LLMs. Its innovative design and evaluation strategies not only advance the state-of-the-art in AI benchmarks but also highlight critical areas for future research in video understanding. As Video LLMs continue to evolve, benchmarks like TempCompass will play an indispensable role in guiding their development towards more sophisticated levels of temporal and video comprehension.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Test of time: Instilling video-language models with a sense of time. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2503–2516.
  2. Qwen technical report. ArXiv, abs/2309.16609.
  3. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. ArXiv, abs/2308.12966.
  4. Touchstone: Evaluating vision-language models by language models. ArXiv, abs/2308.16890.
  5. Frozen in time: A joint video and image encoder for end-to-end retrieval. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1708–1718.
  6. Language models are few-shot learners. In NeurIPS.
  7. Revisiting the “video” in video-language understanding. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2907–2917.
  8. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. ArXiv, abs/2311.14906.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929.
  12. Eva: Exploring the limits of masked visual representation learning at scale. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369.
  13. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv, abs/2306.13394.
  14. Imagebind one embedding space to bind them all. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190.
  15. Rohit Girdhar and Deva Ramanan. 2019. Cater: A diagnostic dataset for compositional actions and temporal reasoning. ArXiv, abs/1910.04744.
  16. The “something something” video database for learning and evaluating visual common sense. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5843–5851.
  17. Localizing moments in video with temporal language. In EMNLP, pages 1380–1390. Association for Computational Linguistics.
  18. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685.
  19. What makes a video a video: Analyzing temporal information in video understanding models and datasets. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7366–7375.
  20. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1359–1367.
  21. Revealing single frame bias for video-and-language learning. ArXiv, abs/2206.03428.
  22. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning.
  24. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355.
  25. Mvbench: A comprehensive multi-modal video understanding benchmark. ArXiv, abs/2311.17005.
  26. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. ArXiv, abs/2311.17404.
  27. Vlm-eval: A general evaluation on video large language models. ArXiv, abs/2311.11865.
  28. Llama-vid: An image is worth 2 tokens in large language models. ArXiv, abs/2311.17043.
  29. Video-llava: Learning united visual representation by alignment before projection. ArXiv, abs/2311.10122.
  30. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. ArXiv, abs/2311.07575.
  31. Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744.
  32. Visual instruction tuning. ArXiv, abs/2304.08485.
  33. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3032–3041.
  34. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281.
  35. A convnet for the 2020s. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11966–11976.
  36. Valley: Video assistant with large language model enhanced ability. ArXiv, abs/2306.07207.
  37. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv, abs/2306.05424.
  38. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. ArXiv, abs/2311.16103.
  39. OpenAI. 2022. Introducing chatgpt. CoRR.
  40. Dinov2: Learning robust visual features without supervision. ArXiv, abs/2304.07193.
  41. Perception test: A diagnostic benchmark for multimodal video models. ArXiv, abs/2305.13786.
  42. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
  43. Testa: Temporal-spatial token aggregation for long-form video-language understanding. ArXiv, abs/2310.19060.
  44. Timechat: A time-sensitive multimodal large language model for long video understanding. ArXiv, abs/2312.02051.
  45. Only time can tell: Discovering temporal data for temporal modeling. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 535–544.
  46. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. ArXiv, abs/2303.17580.
  47. Pandagpt: One model to instruction-follow them all. ArXiv, abs/2305.16355.
  48. Vipergpt: Visual inference via python execution for reasoning. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11854–11864.
  49. Stanford alpaca: An instruction-following llama model.
  50. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
  51. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  52. Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv, abs/2303.04671.
  53. Video question answering via gradually refined attention over appearance and motion. Proceedings of the 25th ACM international conference on Multimedia.
  54. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. ArXiv, abs/2306.09265.
  55. Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXiv, abs/2303.11381.
  56. mplug-owl: Modularization empowers large language models with multimodality. ArXiv, abs/2304.14178.
  57. CLEVRER: collision events for video representation and reasoning. In ICLR. OpenReview.net.
  58. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490.
  59. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134. AAAI Press.
  60. Video-llama: An instruction-tuned audio-visual language model for video understanding. ArXiv, abs/2306.02858.
  61. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. ArXiv, abs/2310.01852.
  62. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuanxin Liu (28 papers)
  2. Shicheng Li (23 papers)
  3. Yi Liu (543 papers)
  4. Yuxiang Wang (57 papers)
  5. Shuhuai Ren (30 papers)
  6. Lei Li (1293 papers)
  7. Sishuo Chen (13 papers)
  8. Xu Sun (194 papers)
  9. Lu Hou (50 papers)
Citations (45)
X Twitter Logo Streamline Icon: https://streamlinehq.com