Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 34 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 130 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM (2403.04735v1)

Published 7 Mar 2024 in cs.CV

Abstract: Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric VQA. This task aims to test the models' capabilities in identifying entities and providing detailed, entity-specific knowledge. We have developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA datasets: (1) It encompasses a wide range of categorized entities, each represented by images and explicitly named in the answers; (2) It features QA pairs that require extensive knowledge for accurate responses. The dataset is organized into 22 major categories, containing 7,568 unique entities in total. For each entity, we curated 10 illustrative images and crafted 10 knowledge-intensive QA pairs. To address this novel task, we devised a scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5\% improvement in the BELURT score. We will soon make the dataset and the source code publicly accessible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198.
  2. Openflamingo: An open-source framework for training large autoregressive vision-language models. ArXiv, abs/2308.01390.
  3. Webqa: Multihop and multimodal qa. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16474–16483.
  4. Can pre-trained vision and language models answer visual information-seeking questions? In EMNLP.
  5. Aakanksha Chowdhery et al. 2022. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500.
  7. Michael J. Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In WMT@ACL.
  8. Qlora: Efficient finetuning of quantized llms. ArXiv, abs/2305.14314.
  9. Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378–382.
  10. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127:398 – 414.
  11. Kat: A knowledge augmented transformer for vision-and-language. In North American Chapter of the Association for Computational Linguistics.
  12. Realm: Retrieval-augmented language model pre-training. ArXiv, abs/2002.08909.
  13. Manymodalqa: Modality disambiguation and qa over diverse inputs. In AAAI Conference on Artificial Intelligence.
  14. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. ArXiv, abs/2302.11154.
  15. Drew A. Hudson and Christopher D. Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6693–6702.
  16. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7:535–547.
  17. M. G. Kendall. 1938. A new measure of rank correlation. Biometrika, 30:81–93.
  18. Kendall’s advanced theory of statistics. Journal of the American Statistical Association, 90:398.
  19. William Knight. 1966. A computer method for calculating kendall’s tau with ungrouped data. Journal of the American Statistical Association, 61:436–439.
  20. Grounding language models to images for multimodal inputs and outputs.
  21. Viquae, a dataset for knowledge-based visual question answering about named entities. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597.
  23. Grounded language-image pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10955–10965.
  24. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL 2004.
  25. Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744.
  26. Visual instruction tuning. ArXiv, abs/2304.08485.
  27. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3190–3199.
  28. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3090–3101.
  29. Anymal: An efficient and scalable any-modality augmented language model. ArXiv, abs/2309.16058.
  30. Bleu: a method for automatic evaluation of machine translation. In ACL.
  31. Learning compact metrics for mt. In Conference on Empirical Methods in Natural Language Processing.
  32. Filtering, distillation, and hard negatives for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6967–6977.
  33. Filtering, distillation, and hard negatives for vision-language pre-training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6967–6977.
  34. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
  35. A survey of hallucination in large foundation models. ArXiv, abs/2309.05922.
  36. Laion-5b: An open large-scale dataset for training next generation image-text models. ArXiv, abs/2210.08402.
  37. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision.
  38. Bleurt: Learning robust metrics for text generation. In Annual Meeting of the Association for Computational Linguistics.
  39. Mimoqa: Multimodal input multimodal output question answering. In North American Chapter of the Association for Computational Linguistics.
  40. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
  41. Quill: Query intent with large language models using retrieval augmentation and multi-stage distillation. In Conference on Empirical Methods in Natural Language Processing.
  42. Head-to-tail: How knowledgeable are large language models (llm)? a.k.a. will llms replace knowledge graphs? ArXiv, abs/2308.10168.
  43. Multimodalqa: Complex question answering over text, tables and images. ArXiv, abs/2104.06039.
  44. Hugo Touvron et al. 2023. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  45. Multimodal few-shot learning with frozen language models. In Neural Information Processing Systems.
  46. Fvqa: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:2413–2427.
  47. Cogvlm: Visual expert for pretrained language models. ArXiv, abs/2311.03079.
  48. Next-gpt: Any-to-any multimodal llm. ArXiv, abs/2309.05519.
  49. Inference with reference: Lossless acceleration of large language models. ArXiv, abs/2304.04487.
  50. Re-vilm: Retrieval-augmented visual language model for zero and few-shot image captioning. ArXiv, abs/2302.04858.
  51. Retrieval-augmented multimodal language modeling. ArXiv, abs/2211.12561.
  52. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.
  53. A survey on multimodal large language models. ArXiv, abs/2306.13549.
  54. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 posts and received 1 like.