Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? (2404.12866v2)

Published 19 Apr 2024 in cs.CL, cs.AI, and cs.CV

Abstract: The increase in parameter size of multimodal LLMs (MLLMs) introduces significant capabilities, particularly in-context learning, where MLLMs enhance task performance without updating pre-trained parameters. This effectiveness, however, hinges on the appropriate selection of in-context examples, a process that is currently biased towards visual data, overlooking textual information. Furthermore, the area of supervised retrievers for MLLMs, crucial for optimal in-context example selection, continues to be uninvestigated. Our study offers an in-depth evaluation of the impact of textual information on the unsupervised selection of in-context examples in multimodal contexts, uncovering a notable sensitivity of retriever performance to the employed modalities. Responding to this, we introduce a novel supervised MLLM-retriever MSIER that employs a neural network to select examples that enhance multimodal in-context learning efficiency. This approach is validated through extensive testing across three distinct tasks, demonstrating the method's effectiveness. Additionally, we investigate the influence of modalities on our supervised retrieval method's training and pinpoint factors contributing to our model's success. This exploration paves the way for future advancements, highlighting the potential for refined in-context learning in MLLMs through the strategic use of multimodal data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. In-context examples selection for machine translation.
  2. Flamingo: a visual language model for few-shot learning.
  3. How do in-context examples affect compositional generalization?
  4. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  5. Language models are few-shot learners.
  6. Understanding and improving in-context learning on vision-language models.
  7. Murag: Multimodal retrieval-augmented generator for open question answering over images and text.
  8. Sinc: Self-supervised in-context learning for vision-language tasks.
  9. Meta-in-context learning in large language models.
  10. A survey on in-context learning.
  11. Towards robust prompts on vision-language models.
  12. Scaling up visual and vision-language representation learning with noisy text supervision.
  13. The hateful memes challenge: Detecting hate speech in multimodal memes.
  14. Otter: A multi-modal model with in-context instruction tuning.
  15. Microsoft coco: Common objects in context.
  16. What makes good in-context examples for gpt-3333?
  17. Roberta: A robustly optimized bert pretraining approach.
  18. Ok-vqa: A visual question answering benchmark requiring external knowledge.
  19. Metaicl: Learning to learn in context.
  20. Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP.
  21. Kosmos-2: Grounding multimodal large language models to the world.
  22. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  23. Retrieval-augmented image captioning. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3666–3681, Dubrovnik, Croatia. Association for Computational Linguistics.
  24. Learning to retrieve prompts for in-context learning.
  25. Laion-5b: An open large-scale dataset for training next generation image-text models.
  26. Cider: Consensus-based image description evaluation.
  27. Label words are anchors: An information flow perspective for understanding in-context learning.
  28. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering.
  29. An empirical study of gpt-3 for few-shot knowledge-based vqa.
  30. Retrieval-augmented multimodal language modeling.
  31. Compositional exemplars for in-context learning.
  32. A survey on multimodal large language models.
  33. What makes good examples for visual in-context learning?
  34. Mmicl: Empowering vision-language model with multi-modal in-context learning.
  35. Multimodal c4: An open, billion-scale corpus of images interleaved with text.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets