Papers
Topics
Authors
Recent
2000 character limit reached

Visual In-Context Learning for Large Vision-Language Models (2402.11574v1)

Published 18 Feb 2024 in cs.CV and cs.CL

Abstract: In Large Visual LLMs (LVLMs), the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. To overcome these challenges, we introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations that reduce token count and alleviate cross-modal interaction problem. Experimental evaluations on five visual reasoning datasets demonstrate the effectiveness of our method. Moreover, our extensive experiments leverage information flow analysis to elucidate the effectiveness of our method, and investigate the impact of length and position of demonstrations for LVLM. The use of in-context unlearning further shows promise in resetting specific model knowledge without retraining.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966.
  3. Videocon: Robust video-language alignment via contrast captions. CoRR, abs/2311.10111.
  4. Machine unlearning. In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021, pages 141–159. IEEE.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Understanding and improving in-context learning on vision-language models. CoRR, abs/2311.18021.
  7. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 1931–1942. PMLR.
  8. Li Deng. 2012. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag., 29(6):141–142.
  9. A survey for in-context learning. CoRR, abs/2301.00234.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  11. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  12. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10867–10878. PMLR.
  13. Data minimization for GDPR compliance in machine learning models. AI Ethics, 2(3):477–491.
  14. Michael Hahn and Navin Goyal. 2023. A theory of emergent in-context learning as implicit structure induction. CoRR, abs/2303.07971.
  15. Explaining emergent in-context learning as kernel regression.
  16. Korbinian Koch and Marcus Soll. 2023. No matter how you slice it: Machine unlearning with sisa comes at the expense of minority classes. 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 622–637.
  17. Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical report.
  18. Few-shot in-context learning on knowledge base question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 6966–6980. Association for Computational Linguistics.
  19. Visual instruction tuning. CoRR, abs/2304.08485.
  20. Lost in the middle: How language models use long contexts. CoRR, abs/2307.03172.
  21. Unifying image processing as visual prompting question answering. CoRR, abs/2310.10513.
  22. Are emergent abilities in large language models just in-context learning? CoRR, abs/2309.01809.
  23. A data generation perspective to the mechanism of in-context learning. CoRR, abs/2402.02212.
  24. A very preliminary analysis of DALL-E 2. CoRR, abs/2204.13807.
  25. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 14014–14024.
  26. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 11048–11064. Association for Computational Linguistics.
  27. Adaptive machine translation with large language models. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023, Tampere, Finland, 12-15 June 2023, pages 227–237. European Association for Machine Translation.
  28. A survey of machine unlearning. CoRR, abs/2209.02299.
  29. OpenAi. 2023. Gpt-4v(ision) system card.
  30. Finding and editing multi-modal neurons in pre-trained transformer. CoRR, abs/2311.07470.
  31. Contemplating visual emotions: Understanding and overcoming dataset bias. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, volume 11206 of Lecture Notes in Computer Science, pages 594–612. Springer.
  32. In-context unlearning: Language models as few shot unlearners. CoRR, abs/2310.07579.
  33. A mixed bag of emotions: Model, predict, and transfer emotion distributions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 860–868. IEEE Computer Society.
  34. ICD-LM: configuring vision-language in-context demonstrations by language modeling. CoRR, abs/2312.10104.
  35. In-context learning with iterative demonstration selection. CoRR, abs/2310.09881.
  36. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  37. Improving language understanding by generative pre-training.
  38. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  40. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  41. Multimodal neurons in pretrained text-only transformers. In IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Workshops, Paris, France, October 2-6, 2023, pages 2854–2859. IEEE.
  42. Exploring effective factors for improving visual in-context learning. CoRR, abs/2304.04748.
  43. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  44. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 35151–35174. PMLR.
  45. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9840–9855. Association for Computational Linguistics.
  46. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022.
  47. Chain-of-thought prompting elicits reasoning in large language models.
  48. An explanation of in-context learning as implicit bayesian inference. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  49. Improv: Inpainting-based multimodal prompting for computer vision tasks. CoRR, abs/2312.01771.
  50. Emoset: A large-scale visual emotion dataset with rich attributes. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 20326–20337. IEEE.
  51. A survey on multimodal large language models. CoRR, abs/2306.13549.
  52. A review on machine unlearning. SN Comput. Sci., 4(4):337.
  53. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. CoRR, abs/2309.15112.
  54. What makes good examples for visual in-context learning? CoRR, abs/2301.13670.
  55. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  56. Thread of thought unraveling chaotic contexts. CoRR, abs/2311.08734.
  57. Towards robust ranker for text retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5387–5401. Association for Computational Linguistics.
  58. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592.
Citations (38)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.