Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models (2410.08182v1)

Published 10 Oct 2024 in cs.CV, cs.AI, and cs.CL
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Abstract: Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-LLMs (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs' ability to utilize retrieved visual knowledge more effectively.

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

The paper "MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models" introduces a new benchmark designed to evaluate the capabilities of large vision-LLMs (LVLMs) in utilizing visually augmented knowledge. The benchmark, MRAG-Bench, aims to address scenarios where visual information retrieval is more pertinent than textual data, thereby providing a systematic assessment of vision-centric knowledge retrieval.

Core Contributions

MRAG-Bench focuses on scenarios where visual retrieval is advantageous due to the inherent benefits or accessibility of visual data compared to text. The benchmark consists of 16,130 images and 1,353 multiple-choice questions spanning nine distinct real-world scenarios. These scenarios are meticulously categorized into two main aspects:

  1. Perspective Understanding: Evaluates model performance when visual entities are presented from different viewpoints or when only partial images are available. It includes four scenarios: Angle, Partial, Scope, and Occlusion.
  2. Transformative Understanding: Assesses the model's ability to comprehend visual transformations, such as biological changes or physical deformations. It includes four scenarios: Temporal, Deformation, Incomplete, and Biological.

MRAG-Bench provides the first benchmark concentrating on vision-centric multimodal retrieval-augmented generation (RAG), emphasizing the retrieval of visual knowledge over text retrieval. This provides novel insights into how LVLMs can leverage externally sourced visual data to improve their reasoning and generation capabilities.

Experimental Findings

The authors evaluated 14 advanced LVLMs, including both open-source and proprietary models. The evaluation highlights significant insights:

  • All models demonstrated improved performance when augmented with visually retrieved knowledge compared to textual retrieval.
  • The top-performing model, GPT-4o, achieved only a 5.82% performance improvement when supplemented with ground-truth knowledge, compared to a 33.16% improvement observed in human evaluations.
  • Open-source models struggled to effectively differentiate between high-quality and noisy retrieved examples, whereas proprietary models exhibited an emerging ability to discern usable visual knowledge from noise.

Analysis and Insights

The analysis revealed several critical findings:

  • Visual knowledge provided greater performance enhancements for LVLMs than textual knowledge, suggesting an inherent advantage of visual RAG in scenarios defined by MRAG-Bench.
  • The model's performance correlated positively with the accuracy of the retrieval mechanism, emphasizing the importance of effective multimodal retrievers.
  • Although the benchmark challenges LVLMs to process an average of 20.4 ground-truth examples, the performance generally plateaued with around 10 visual examples, indicating an area for potential optimization in multimodal integration strategies.

Theoretical and Practical Implications

The findings underscore the necessity of developing LVLMs equipped to handle complex vision-centric tasks through effective integration of externally retrieved visual data. This has theoretical implications in expanding the architectural design of LVLMs for improved contextual reasoning and practical implications in deploying these models in real-world applications where visual content is abundant and accessible.

Future Directions

The paper opens several avenues for future research:

  • Refinement of LVLM architectures to better utilize diverse and noisy visual datasets.
  • Exploration of adaptive strategies for determining the optimal quantity of visual examples needed for effective knowledge integration.
  • Expansion into broader multimodal contexts, incorporating not just images but potentially video or 3D graphics to further enrich model interactions.

In conclusion, MRAG-Bench is a significant contribution to the field of AI, presenting rigorous evaluation metrics for understanding and improving the integration of visual knowledge in LVLMs. The benchmark encourages further exploration in retrieval-augmented multimodal reasoning, ultimately advancing the capabilities of AI systems in handling vision-intensive tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Flamingo: a visual language model for few-shot learning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html.
  2. Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet, 2024.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. ArXiv preprint, abs/2308.01390, 2023. URL https://arxiv.org/abs/2308.01390.
  4. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. ArXiv preprint, abs/2308.12966, 2023. URL https://arxiv.org/abs/2308.12966.
  5. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms, 2024. URL https://arxiv.org/abs/2404.15406.
  6. Webqa: Multihop and multimodal QA. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp.  16474–16483. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01600. URL https://doi.org/10.1109/CVPR52688.2022.01600.
  7. Shikra: Unleashing multimodal llm’s referential dialogue magic. ArXiv preprint, abs/2306.15195, 2023a. URL https://arxiv.org/abs/2306.15195.
  8. MuRAG: Multimodal retrieval-augmented generator for open question answering over images and text. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5558–5570, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.375. URL https://aclanthology.org/2022.emnlp-main.375.
  9. Can pre-trained vision and language models answer visual information-seeking questions? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  14948–14968, Singapore, 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.925. URL https://aclanthology.org/2023.emnlp-main.925.
  10. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. ArXiv preprint, abs/2404.16821, 2024. URL https://arxiv.org/abs/2404.16821.
  11. MORE: Multi-mOdal REtrieval augmented generative commonsense reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics ACL 2024, pp.  1178–1192, Bangkok, Thailand and virtual meeting, 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.69. URL https://aclanthology.org/2024.findings-acl.69.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/9a6a435e75419a836fe47ab6793623e6-Abstract-Conference.html.
  13. Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag, 2024. URL https://arxiv.org/abs/2406.11147.
  14. Manymodalqa: Modality disambiguation and qa over diverse inputs. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7879–7886, Apr. 2020. doi: 10.1609/aaai.v34i05.6294. URL https://ojs.aaai.org/index.php/AAAI/article/view/6294.
  15. Mqt-llava: Matryoshka query transformer for large vision-language models. In The 38th Conference on Neural Information Processing Systems (NeurIPS), 2024a.
  16. BLIVA: A simple multimodal LLM for better handling of text-rich visual questions. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pp.  2256–2264. AAAI Press, 2024b. doi: 10.1609/AAAI.V38I3.27999. URL https://doi.org/10.1609/aaai.v38i3.27999.
  17. Language is not all you need: Aligning perception with language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/e425b75bac5742a008d643826428787c-Abstract-Conference.html.
  18. Mantis: Interleaved multi-image instruction tuning. ArXiv preprint, abs/2405.01483, 2024a. URL https://arxiv.org/abs/2405.01483.
  19. E5-v: Universal embeddings with multimodal large language models, 2024b. URL https://arxiv.org/abs/2407.12580.
  20. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, 2013.
  21. OBELICS: an open web-scale filtered dataset of interleaved image-text documents. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/e2cfb719f58585f779d0a4f9f07bd618-Abstract-Datasets_and_Benchmarks.html.
  22. What matters when building vision-language models?, 2024.
  23. ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 2022. doi: 10.1145/3477495.3531753. URL https://universite-paris-saclay.hal.science/hal-03650618.
  24. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
  25. Llava-onevision: Easy visual task transfer, 2024a. URL https://arxiv.org/abs/2408.03326.
  26. Seed-bench-2: Benchmarking multimodal large language models. ArXiv preprint, abs/2311.17092, 2023. URL https://arxiv.org/abs/2311.17092.
  27. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024b. URL https://arxiv.org/abs/2407.07895.
  28. Vila: On pre-training for visual language models, 2023.
  29. Improved baselines with visual instruction tuning, 2023a. URL https://arxiv.org/abs/2310.03744.
  30. Visual instruction tuning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers.nips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html.
  31. Mmbench: Is your multi-modal model an all-around player? ArXiv preprint, abs/2307.06281, 2023c. URL https://arxiv.org/abs/2307.06281.
  32. Deepseek-vl: Towards real-world vision-language understanding, 2024a.
  33. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  6774–6786, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.528. URL https://aclanthology.org/2021.acl-long.528.
  34. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022a. URL http://papers.nips.cc/paper_files/paper/2022/hash/11332b6b6cf4485b84afadb1352d3a9a-Abstract-Conference.html.
  35. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations, 2024b.
  36. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8086–8098, Dublin, Ireland, 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
  37. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp.  3195–3204. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00331. URL http://openaccess.thecvf.com/content_CVPR_2019/html/Marino_OK-VQA_A_Visual_Question_Answering_Benchmark_Requiring_External_Knowledge_CVPR_2019_paper.html.
  38. Mm1: Methods, analysis & insights from multimodal llm pre-training. ArXiv preprint, abs/2403.09611, 2024. URL https://arxiv.org/abs/2403.09611.
  39. Encyclopedic VQA: visual questions about detailed properties of fine-grained categories. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp.  3090–3101. IEEE, 2023. doi: 10.1109/ICCV51070.2023.00289. URL https://doi.org/10.1109/ICCV51070.2023.00289.
  40. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
  41. OpenAI. Gpt-4 technical report, 2023.
  42. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
  43. Geode: a geographically diverse evaluation dataset for object recognition. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/d08b6801f24dda81199079a3371d77f9-Abstract-Datasets_and_Benchmarks.html.
  44. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  45. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pp.  146–162. Springer, 2022.
  46. Kvqa: Knowledge-aware visual question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):8876–8884, Jul. 2019. doi: 10.1609/aaai.v33i01.33018876. URL https://ojs.aaai.org/index.php/AAAI/article/view/4915.
  47. REPLUG: Retrieval-augmented black-box language models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.  8371–8384, Mexico City, Mexico, 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.naacl-long.463.
  48. Generative multimodal models are in-context learners. ArXiv preprint, abs/2312.13286, 2023. URL https://arxiv.org/abs/2312.13286.
  49. Multimodal{qa}: complex question answering over text, tables and images. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=ee6W5UgQLa.
  50. Gemini: a family of highly capable multimodal models. ArXiv preprint, abs/2312.11805, 2023. URL https://arxiv.org/abs/2312.11805.
  51. Mistral AI Team. Announcing pixtral 12b. https://mistral.ai/news/pixtral-12b/, 2024.
  52. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024.
  53. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/3255a7554605a88800f4e120b3a929e1-Abstract-Conference.html.
  54. Lemma: Towards lvlm-enhanced multimodal misinformation detection with external knowledge augmentation, 2024. URL https://arxiv.org/abs/2402.11943.
  55. Retrieval-augmented multimodal language modeling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  39755–39769. PMLR, 2023. URL https://proceedings.mlr.press/v202/yasunaga23a.html.
  56. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models, 2024. URL https://arxiv.org/abs/2408.04840.
  57. A survey on multimodal large language models. ArXiv preprint, abs/2306.13549, 2023. URL https://arxiv.org/abs/2306.13549.
  58. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. ArXiv preprint, abs/2404.16006, 2024. URL https://arxiv.org/abs/2404.16006.
  59. Ramm: Retrieval-augmented biomedical visual question answering with multi-modal pre-training. Proceedings of the 31st ACM International Conference on Multimedia, 2023.
  60. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. ArXiv preprint, abs/2311.16502, 2023. URL https://arxiv.org/abs/2311.16502.
  61. Magiclens: Self-supervised image retrieval with open-ended instructions. In The Forty-first International Conference on Machine Learning (ICML), pp.  to appear, 2024a.
  62. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? ArXiv preprint, abs/2403.14624, 2024b. URL https://arxiv.org/abs/2403.14624.
  63. Why are visually-grounded language models bad at image classification? ArXiv preprint, abs/2405.18415, 2024c. URL https://arxiv.org/abs/2405.18415.
  64. Retrieving multimodal information for augmented generation: A survey. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  4736–4756, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.314. URL https://aclanthology.org/2023.findings-emnlp.314.
  65. Vista: Visualized text embedding for universal multi-modal retrieval, 2024. URL https://arxiv.org/abs/2406.04292.
  66. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv preprint, abs/2304.10592, 2023a. URL https://arxiv.org/abs/2304.10592.
  67. Multimodal C4: an open, billion-scale corpus of images interleaved with text. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers.nips.cc/paper_files/paper/2023/hash/1c6bed78d3813886d3d72595dbecb80b-Abstract-Datasets_and_Benchmarks.html.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Wenbo Hu (55 papers)
  2. Jia-Chen Gu (42 papers)
  3. Zi-Yi Dou (33 papers)
  4. Mohsen Fayyaz (31 papers)
  5. Pan Lu (42 papers)
  6. Kai-Wei Chang (292 papers)
  7. Nanyun Peng (205 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com