Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (2311.17005v4)

Published 28 Nov 2023 in cs.CV
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Abstract: With the rapid development of Multi-modal LLMs (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

Insights into MVBench: A Benchmark for Multi-Modal Video Understanding

The paper entitled "MVBench: A Comprehensive Multi-modal Video Understanding Benchmark" presents an in-depth exploration of the limitations of current multi-modal LLMs (MLLMs) and proposes an innovative benchmark designed to address existing deficiencies in video understanding tasks.

Overview of MVBench

The development of MVBench is driven by the realization that current diagnostic benchmarks fall short in evaluating the temporal comprehension capabilities of MLLMs. Where traditional benchmarks concentrate predominantly on static image-based tasks, MVBench innovatively transitions these tasks to a dynamic video context. This shift introduces 20 video tasks encompassing a wide array of temporal skills, encompassing both perception and cognition. A distinctive static-to-dynamic method enables precise transformation of image tasks into video tasks, providing a more comprehensive challenge that extends beyond mere frame-based analysis.

Automatic QA Conversion and Evaluation Paradigm

An important aspect of the MVBench methodology is the automated conversion of existing public video annotations into a multiple-choice question-answering format. This automation minimizes manual intervention and ensures ambiguity-free evaluation leveraging ground truth video annotations. The use of a robust system prompt alongside a simplified answer prompt ensures precision in responses and maximizes the evaluation's robustness and fairness.

Video MLLM Baseline: VideoChat2

Amidst the observed inadequacies in existing MLLMs, especially regarding temporal understanding, the paper introduces VideoChat2. This baseline employs progressive multi-modal training strategies to improve temporal performance significantly. VideoChat2 aligns video attributes with LLMs through a novel architecture that integrates vision encoders and LLMs with a streamlined QFormer. This model boasts a considerable improvement, surpassing leading models by over 15% on MVBench.

Results and Implications

The evaluations conducted on MVBench reveal crucial insights into present video MLLM capabilities. Surprisingly, many top-performing models lag in tasks necessitating temporal reasoning. VideoChat2 addresses these deficiencies effectively by providing a significant leap in performance, particularly in action, object, scene, pose, and attribute tasks. However, there remain challenges, especially in tasks related to position, count, and character recognition.

The findings suggest substantial implications for future MLLM development. Attention should be directed towards enhancing grounding and reasoning capabilities and exploring deeper integration of multi-modal data. The presented results posit that future advancements could focus on additional modalities such as depth and audio, further augmenting video comprehension.

Future Directions

The framework provided by MVBench paves the way for comprehensive evaluation and development of MLLMs capable of nuanced temporal understanding. There remains a vast potential for innovation in video-based AI models, including refining data annotations and extending the range of evaluation strategies. As research continues, MVBench will likely play an essential role in the progression toward more sophisticated, generalized video understanding models.

Overall, the paper provides a foundational contribution to the field of multi-modal AI by realigning evaluation benchmarks with the dynamic realities of video content. As AI continues to evolve, benchmarks like MVBench will be critical in guiding the design and training of next-generation video understanding models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (97)
  1. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  4. Scene text visual question answering. In ICCV, 2019.
  5. Language models are few-shot learners. In NeurIPS, 2020.
  6. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  7. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
  8. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. ArXiv, abs/2310.09478, 2023a.
  9. Shikra: Unleashing multimodal llm’s referential dialogue magic. ArXiv, abs/2306.15195, 2023b.
  10. Palm: Scaling language modeling with pathways. JMLR, 2022.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
  12. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
  13. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013.
  14. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2018.
  16. Palm-e: An embodied multimodal language model. In ICML, 2023.
  17. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv, abs/2306.13394, 2023.
  18. Mist : Multi-modal iterative spatial-temporal transformer for long-form video question answering. In CVPR, 2022.
  19. Tall: Temporal activity localization via language query. In ICCV, 2017.
  20. Multimodal-gpt: A vision and language model for dialogue with humans. ArXiv, abs/2305.04790, 2023.
  21. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017a.
  22. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017b.
  23. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
  24. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
  25. Movienet: A holistic dataset for movie understanding. In ECCV, 2020.
  26. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045, 2023.
  27. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  28. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017.
  29. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
  30. The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017.
  31. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In ECCV, 2020.
  32. A hierarchical approach for generating descriptive image paragraphs. In CVPR, 2017.
  33. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  34. Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
  35. Viquae, a dataset for knowledge-based visual question answering about named entities. In SIGIR, 2022.
  36. Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125, 2023a.
  37. Otter: A multi-modal model with in-context instruction tuning. ArXiv, abs/2305.03726, 2023b.
  38. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2022a.
  39. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. ArXiv, abs/2211.09552, 2022b.
  40. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023c.
  41. Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023d.
  42. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv, abs/2306.04387, 2023e.
  43. Evaluating object hallucination in large vision-language models. ArXiv, abs/2305.10355, 2023f.
  44. Microsoft coco: Common objects in context. In ECCV, 2014.
  45. Visual instruction tuning. In NeurIPS, 2023a.
  46. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. TPAMI, 2020.
  47. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023b.
  48. Valley: Video assistant with large language model enhanced ability. ArXiv, abs/2306.07207, 2023.
  49. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv, abs/2306.05424, 2023.
  50. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  51. Docvqa: A dataset for vqa on document images. In WACV, 2021.
  52. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
  53. Moments in time dataset: One million videos for event understanding. TPAMI, 2020.
  54. OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023a.
  55. OpenAI. Gpt-4v(ision) system card. https://api.semanticscholar.org/CorpusID:263218031, 2023b.
  56. Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
  57. Perception test : A diagnostic benchmark for multimodal models. In NeurIPS, 2023.
  58. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
  59. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  60. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
  61. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  62. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020.
  63. Towards vqa models that can read. In CVPR, 2019.
  64. Eva-clip: Improved training techniques for clip at scale. ArXiv, abs/2303.15389, 2023.
  65. Visualmrc: Machine reading comprehension on document images. In AAAI, 2021.
  66. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023a.
  67. Vicuna Team. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality. https://vicuna.lmsys.org/, 2023b.
  68. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a.
  69. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023b.
  70. All in one: Exploring unified video-language pre-training. In CVPR, 2023a.
  71. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  72. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023b.
  73. Internvideo: General video foundation models via generative and discriminative learning. ArXiv, abs/2212.03191, 2022.
  74. Internvid: A large-scale video-text dataset for multimodal understanding and generation. ArXiv, 2023c.
  75. Paxion: Patching action knowledge in video-language foundation models. In NeurIPS, 2023d.
  76. Finetuned language models are zero-shot learners. In ICLR, 2021.
  77. Chain of thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
  78. Star: A benchmark for situated reasoning in real-world videos. In NeurIPS, 2021.
  79. A large cross-modal video retrieval dataset with reading comprehension. ArXiv, abs/2305.03347, 2023.
  80. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
  81. Funqa: Towards surprising video comprehension. ArXiv, abs/2306.14899, 2023.
  82. Video question answering via gradually refined attention over appearance and motion. In ICME, 2017.
  83. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  84. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. ArXiv, abs/2306.09265, 2023.
  85. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021.
  86. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
  87. Hitea: Hierarchical temporal-aware video-language pre-training. In ICCV, 2023a.
  88. mplug-owl: Modularization empowers large language models with multimodality. ArXiv, abs/2304.14178, 2023b.
  89. Clevrer: Collision events for video representation and reasoning. In ICLR, 2020.
  90. Self-chained image-language model for video localization and question answering. In NeurIPS, 2023a.
  91. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490, 2023b.
  92. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019a.
  93. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019b.
  94. Glm-130b: An open bilingual pre-trained model. In ICLR, 2022.
  95. Video-llama: An instruction-tuned audio-visual language model for video understanding. ArXiv, abs/2306.02858, 2023a.
  96. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ArXiv, abs/2303.16199, 2023b.
  97. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Yali Wang (78 papers)
  2. Yinan He (34 papers)
  3. Yizhuo Li (21 papers)
  4. Yi Wang (1038 papers)
  5. Yi Liu (543 papers)
  6. Zun Wang (42 papers)
  7. Jilan Xu (32 papers)
  8. Guo Chen (107 papers)
  9. Ping Luo (340 papers)
  10. Limin Wang (221 papers)
  11. Yu Qiao (563 papers)
  12. KunChang Li (43 papers)
Citations (198)