Papers
Topics
Authors
Recent
2000 character limit reached

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning (2305.06500v2)

Published 11 May 2023 in cs.CV and cs.LG

Abstract: Large-scale pre-training and instruction tuning have been successful at creating general-purpose LLMs with broad competence. However, building general-purpose vision-LLMs is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Chatgpt. https://openai.com/blog/chatgpt, 2023.
  2. Vicuna. https://github.com/lm-sys/FastChat, 2023.
  3. nocaps: novel object captioning at scale. In ICCV, pages 8948–8957, 2019.
  4. Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, NeurIPS, 2022.
  5. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  6. Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779, 2021.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Visual dialog. In CVPR, 2017.
  9. Palm-e: An embodied multimodal language model, 2023.
  10. Eva: Exploring the limits of masked visual representation learning at scale. ArXiv, abs/2211.07636, 2022.
  11. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, July 2017.
  12. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
  13. Unnatural instructions: Tuning language models with (almost) no human labor. ArXiv, abs/2212.09689, 2022.
  14. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  15. Promptcap: Prompt-guided task-aware image captioning, 2023.
  16. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  17. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  18. The hateful memes challenge: Detecting hate speech in multimodal memes. In NeurIPS, 2020.
  19. Lavis: A library for language-vision intelligence, 2022.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  21. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  22. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
  23. Microsoft coco: Common objects in context. In ECCV, 2014.
  24. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023.
  25. Visual instruction tuning. 2023.
  26. Decoupled weight decay regularization. In ICLR, 2019.
  27. 12-in-1: Multi-task vision and language representation learning. In CVPR, 2020.
  28. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
  29. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In NeurIPS Track on Datasets and Benchmarks, 2021.
  30. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  31. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
  32. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, ECCV, 2020.
  33. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
  35. Multitask prompted training enables zero-shot task generalization. In ICLR, 2022.
  36. A-okvqa: A benchmark for visual question answering using world knowledge. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, ECCV, 2022.
  37. Prompting large language models with answer heuristics for knowledge-based visual question answering. Computer Vision and Pattern Recognition (CVPR), 2023.
  38. Textcaps: a dataset for image captioningwith reading comprehension. 2020.
  39. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019.
  40. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  42. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2015.
  43. Git: A generative image-to-text transformer for vision and language, 2022.
  44. Self-instruct: Aligning language model with self generated instructions. ArXiv, abs/2212.10560, 2022.
  45. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In EMNLP, 2022.
  46. Finetuned language models are zero-shot learners. In ICLR, 2022.
  47. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia, page 1645–1653, 2017.
  48. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. ArXiv, abs/2212.10773, 2022.
  49. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pages 1686–1697, 2021.
  50. mplug-owl: Modularization empowers large language models with multimodality. 2023.
  51. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 2014.
  52. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
Citations (1,503)

Summary

  • The paper demonstrates that instruction tuning on 26 datasets significantly improves zero-shot performance in vision-language tasks.
  • It employs an instruction-aware Q-Former within a frozen BLIP-2 architecture to extract tailored visual features for diverse tasks.
  • Ablation studies show that both instruction-aware feature extraction and data balancing are key to enhancing visual reasoning and multi-turn conversations.

InstructBLIP: Vision-LLMs Enhanced by Instruction Tuning

Introduction

The paper "InstructBLIP: Towards General-purpose Vision-LLMs with Instruction Tuning" addresses the complexity and diversity inherent in vision-language tasks through a novel framework leveraging instruction tuning. Vision-LLMs must handle inputs that are both textual and visual, making them significantly more challenging than traditional NLP tasks. This paper presents InstructBLIP, a model that utilizes instruction tuning to improve zero-shot generalization in vision-language tasks and demonstrates improved performance over previous models.

Vision-Language Instruction Tuning Methodology

The central innovation of InstructBLIP is its approach towards instruction tuning, which involves training a model on data formatted in natural language instructions. The paper employed a comprehensive collection of 26 publicly available datasets, encapsulating a wide array of tasks and then transformed them into an instruction tuning format. Figure 1

Figure 1: Tasks and their corresponding datasets used for vision-language instruction tuning. The held-in datasets are indicated by yellow and the held-out datasets by white.

The architecture of InstructBLIP utilizes a pretrained BLIP-2 model, consisting of an image encoder, an LLM, and a Query Transformer (Q-Former). A key innovation is the instruction-aware Q-Former, which extracts visual features tailored to the specific instructions provided, optimizing the model's ability to follow varied instructions across differing tasks. Figure 2

Figure 2: Model architecture of InstructBLIP. The Q-Former extracts instruction-aware visual features from the output embeddings of the frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM.

Experimental Results and Analysis

The InstructBLIP models achieved state-of-the-art performance in zero-shot evaluation tasks. Notably, InstructBLIP outperformed earlier models like Flamingo and BLIP-2 across a diverse range of datasets, such as video QA and visual reasoning tasks. These results underscore the efficacy of using instruction tuning to enhance the versatility and accuracy of vision-LLMs. Figure 3

Figure 3: A few qualitative examples generated by our InstructBLIP Vicuna model. Here, a range of its diverse capabilities are demonstrated, including complex visual scene understanding and reasoning, knowledge-grounded image description, multi-turn visual conversation, etc.

An ablation paper was conducted to assess the contributions of instruction-aware visual features and data balancing strategies. The findings confirmed substantial improvements in performance due to these techniques, especially in datasets requiring intricate visual reasoning.

Instruction Tuning Versus Multitask Learning

The comparison between instruction tuning and multitask learning highlights a crucial insight: while multitask learning can perform well on seen tasks, instruction tuning proves to be superior in terms of generalizing to unseen tasks. This disparity indicates that the instructional format plays a critical role in enhancing a model's adaptability and understanding. Figure 4

Figure 4: Comparison of instruction tuning and multitask training based on BLIP-2 FlanT5\textsubscript{XL}.

Practical Implications and Future Directions

InstructBLIP's architecture and methodology implies potential for wide application in various domains that require robust vision-language integration, from automated captioning to complex visual reasoning in real-time settings. The instruction-aware approach could expand further into dynamic, context-aware AI systems for even broader task generalization.

As research progresses, an exciting area of future development lies in further refining instruction tuning within multimodal settings, possibly integrating more sophisticated feedback mechanisms and broader datasets that encapsulate highly diverse contexts.

Conclusion

InstructBLIP represents a notable advancement in the field of vision-LLMs, demonstrating the power of instruction tuning to navigate the complex landscape of these tasks. By open-sourcing their models, the authors have provided a valuable resource for future research, encouraging the development of general-purpose multimodal AI systems that possess enhanced understanding and versatility.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 25 tweets with 0 likes about this paper.