Emergent Mind

Visual Instruction Tuning

(2304.08485)
Published Apr 17, 2023 in cs.CV , cs.AI , cs.CL , and cs.LG

Abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

YouTube
References
  1. Langchain. https://github.com/hwchase17/langchain

  2. Flamingo: a Visual Language Model for Few-Shot Learning
  3. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition
  4. A General Language Assistant as a Laboratory for Alignment
  5. Openflamingo, March 2023
  6. InstructPix2Pix: Learning to Follow Image Editing Instructions
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  8. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023
  10. PaLM: Scaling Language Modeling with Pathways
  11. Scaling Instruction-Finetuned Language Models
  12. CVinW. Computer vision in the wild. https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings

  13. PaLM-E: An Embodied Multimodal Language Model
  14. Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
  15. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
  16. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision
  17. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks
  18. Visual Programming: Compositional visual reasoning without training
  19. Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR
  20. Language Is Not All You Need: Aligning Perception with Language Models
  21. Openclip. July 2021. If you use this software, please cite it as below.
  22. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
  23. Visual prompt tuning. In ECCV
  24. Grounding Language Models to Images for Multimodal Inputs and Outputs
  25. Language-driven semantic segmentation. ICLR
  26. Multimodal Foundation Models: From Specialists to General-Purpose Assistants
  27. ELEVATER: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks
  28. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  29. Grounded language-image pre-training. In CVPR
  30. GLIGEN: Open-Set Grounded Text-to-Image Generation
  31. Microsoft COCO: Common objects in context. In ECCV
  32. Improved baselines with visual instruction tuning
  33. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
  34. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems
  35. OpenAI. ChatGPT. https://openai.com/blog/chatgpt/

  36. OpenAI. Gpt-4 technical report
  37. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  38. Instruction Tuning with GPT-4
  39. Combined Scaling for Zero-shot Transfer Learning
  40. Learning Transferable Visual Models From Natural Language Supervision
  41. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research
  42. Hierarchical Text-Conditional Image Generation with CLIP Latents
  43. High-resolution image synthesis with latent diffusion models. CVPR, pages 10674–10685
  44. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
  45. LAION-5B: An open large-scale dataset for training next generation image-text models
  46. ViperGPT: Visual Inference via Python Execution for Reasoning
  47. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems (NeurIPS)
  48. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  49. LLaMA: Open and Efficient Foundation Language Models
  50. GIT: A Generative Image-to-text Transformer for Vision and Language
  51. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  52. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
  53. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  54. Unified contrastive learning in image-text-label space. CVPR
  55. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
  56. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
  57. Florence: A New Foundation Model for Computer Vision
  58. A Simple Framework for Open-Vocabulary Segmentation and Detection
  59. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
  60. OPT: Open Pre-trained Transformer Language Models
  61. Multimodal Chain-of-Thought Reasoning in Language Models
  62. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803
  63. Generalized Decoding for Pixel, Image, and Language

Show All 63

Test Your Knowledge

You answered out of questions correctly.

Well done!