Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences (2401.10529v2)

Published 19 Jan 2024 in cs.CV, cs.AI, cs.CL, and cs.LG
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Abstract: Multimodal LLMs (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.

Evaluation of Multimodal LLM Reasoning with the Mementos Benchmark

The paper presents "Mementos," a new and comprehensive benchmark designed to assess the capabilities of Multimodal LLMs (MLLMs) in reasoning over image sequences. Existing benchmarks mainly focus on single-image reasoning capabilities, lacking the evaluation of time-varying object behaviors or events in real-world scenarios. This oversight limits a deeper understanding of the reasoning capabilities of MLLMs. The Mementos benchmark addresses this gap by incorporating 4,761 diverse image sequences with varying lengths sourced from domains such as daily life, robotics, and comics.

The paper's contribution is twofold. First, it introduces a novel benchmark for sequential image reasoning, emphasizing the assessment of MLLMs' ability to interpret dynamic contexts and sequential visual information. Second, it uses a GPT-4-assisted evaluation method that considers possible hallucinations in MLLM outputs, specifically object and behavioral hallucinations. Hallucinations here refer to inaccuracies in descriptions where MLLMs may invent actions or misrepresent objects and their behaviors within a sequence.

Key Findings

The paper examines the performance of nine recent MLLMs, including GPT-4V and Gemini, using the Mementos benchmark. The results reveal that MLLMs struggle significantly with accurately describing dynamic information from image sequences, frequently leading to hallucinations. Particularly, the paper identifies three primary factors contributing to reasoning failures in MLLMs:

  1. Correlation between Object and Behavioral Hallucinations: MLLMs often produce errors due to incorrect object identification, which cascades into behavior misinterpretations.
  2. Impact of Co-Occurring Behaviors: Behaviors that frequently occur together can cause models to infer nonexistent behaviors in sequences, which highlights a pattern-driven rather than a context-driven reasoning approach.
  3. Compounding Impact of Behavioral Hallucinations: Initial misinterpretations can accumulate, causing subsequent frames or events to be inaccurately described, exacerbated by the temporal nature of sequences.

Implications and Future Developments

This work highlights significant practical and theoretical challenges in developing MLLMs with robust reasoning capabilities across sequences of images. Practically, the findings suggest the need for more refined approaches in training MLLMs to better handle dynamic, context-rich, and sequential data without falling into common pitfalls of hallucination. Theoretically, these results prompt a reevaluation of how current MLLMs understand temporal sequencing and logical connections, signaling a need for improved architectural designs that account for such complexities.

Future research could expand the Mementos benchmark by introducing more varied data, such as first-person navigation experiences or sequential medical datasets, potentially increasing the benchmark's complexity and relevance. Additionally, refining the evaluation process beyond keyword matching to consider deeper semantic understanding could lead to advancements in assessing MLLM capabilities. Furthermore, targeted strategies focusing on reducing hallucinations and enhancing reasoning abilities could significantly benefit both the development and application of MLLMs in diverse fields like robotics and interactive systems.

In summary, this paper presents a critical advancement in evaluating MLLMs and offers insightful directions for enhancing the reasoning capabilities of future AI models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Combating the compounding-error problem with a multi-step model.
  2. Vision-language models as a source of rewards. arXiv preprint arXiv:2312.09187.
  3. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. arXiv preprint arXiv:2311.14906.
  4. Hallucination detection: Robustly discerning reliable answers in large language models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 245–255.
  5. Mitigating hallucination in visual language models with visual supervision. arXiv preprint arXiv:2311.16479.
  6. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864.
  7. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  9. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911.
  10. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
  11. Hallucination augmented contrastive learning for multimodal large language model. arXiv preprint arXiv:2312.06968.
  12. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046.
  13. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922.
  14. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
  15. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
  16. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
  17. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
  18. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566.
  19. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565.
  20. Improved baselines with visual instruction tuning.
  21. C-disentanglement: Discovering causally-independent generative factors under an inductive bias of confounder. arXiv preprint arXiv:2310.17325.
  22. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  23. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895.
  24. Liv: Language-image representations and rewards for robotic control. arXiv preprint arXiv:2306.00958.
  25. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
  26. OpenAI. 2023a. Gpt-4 technical report.
  27. OpenAI. 2023b. Gpt-4v(ision) system card.
  28. Vision-language models are zero-shot reward models for reinforcement learning. arXiv preprint arXiv:2310.12921.
  29. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156.
  30. Roboclip: One demonstration is enough to learn robot policies. In Thirty-seventh Conference on Neural Information Processing Systems.
  31. Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
  32. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126.
  33. Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. arXiv preprint arXiv:2312.01701.
  34. Coplanner: Plan to roll out conservatively but to explore optimistically for model-based rl.
  35. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786.
  36. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
  37. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
  38. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687.
  39. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. arXiv preprint arXiv:2311.13614.
  40. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
  41. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502.
  42. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
  43. How language model hallucinations can snowball.
  44. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  45. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839.
  46. Minigpt-5: Interleaved vision-and-language generation via generative vokens.
  47. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754.
  48. Scalable prompt generation for semi-supervised learning with language models. arXiv preprint arXiv:2302.09236.
  49. Explore spurious correlations at the concept level in language models for text classification. arXiv preprint arXiv:2311.08648.
  50. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Xiyao Wang (26 papers)
  2. Yuhang Zhou (52 papers)
  3. Xiaoyu Liu (138 papers)
  4. Hongjin Lu (3 papers)
  5. Yuancheng Xu (17 papers)
  6. Feihong He (11 papers)
  7. Jaehong Yoon (43 papers)
  8. Taixi Lu (3 papers)
  9. Gedas Bertasius (55 papers)
  10. Mohit Bansal (304 papers)
  11. Huaxiu Yao (103 papers)
  12. Furong Huang (150 papers)
Citations (48)
Github Logo Streamline Icon: https://streamlinehq.com