Emergent Mind

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

(2306.17107)
Published Jun 29, 2023 in cs.CV and cs.CL

Abstract

Instruction tuning unlocks the superior capability of LLMs (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Training language models to follow instructions with human feedback
  2. Scaling instruction-finetuned language models
  3. Visual instruction tuning, 2023a
  4. Otter: A Multi-Modal Model with In-Context Instruction Tuning
  5. Large Multimodal Models: Notes on CVPR 2023 Tutorial
  6. An image is worth 16x16 words: Transformers for image recognition at scale
  7. Learning transferable visual models from natural language supervision
  8. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
  9. Microsoft coco: Common objects in context
  10. On the hidden mystery of ocr in large multimodal models, 2023b
  11. Llama-adapter v2: Parameter-efficient visual instruction model
  12. OpenAI. Gpt-4 technical report
  13. LAION-5B: An open large-scale dataset for training next generation image-text models
  14. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022a
  15. Self-instruct: Aligning language models with self-generated instructions, 2022b
  16. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  17. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/.

  18. Baize: An open-source chat model with parameter-efficient tuning on self-chat data
  19. Llama: Open and efficient foundation language models
  20. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023c
  21. Large language models are not fair evaluators
  22. Alpacafarm: A simulation framework for methods that learn from human feedback
  23. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
  24. Video-chatgpt: Towards detailed video understanding via large vision and language models
  25. AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
  26. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities, 2023b
  27. Minigpt-4: Enhancing vision-language understanding with advanced large language models
  28. Llama-adapter: Efficient fine-tuning of language models with zero-init attention, 2023c
  29. mplug-owl: Modularization empowers large language models with multimodality
  30. Instructblip: Towards general-purpose vision-language models with instruction tuning
  31. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE
  32. Cream: Visually-situated natural language understanding with contrastive reading model and frozen large language models
  33. Dit: Self-supervised pre-training for document image transformer. Proceedings of the 30th ACM International Conference on Multimedia, Oct 2022. doi: 10.1145/3503161.3547911. http://dx.doi.org/10.1145/3503161.3547911.
  34. Evaluation of deep convolutional nets for document image classification and retrieval
  35. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b
  36. Openflamingo, March 2023. https://doi.org/10.5281/zenodo.7733589.

  37. Language models are few-shot learners
  38. Promptcap: Prompt-guided task-aware image captioning
  39. Learn to explain: Multimodal reasoning via thought chains for science question answering
  40. Multimodal chain-of-thought reasoning in language models, 2023d
  41. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  42. Clippo: Image-and-language understanding from pixels only
  43. Icdar 2019 competition on scene text visual question answering. 2019 International Conference on Document Analysis and Recognition (ICDAR), Sep 2019. doi: 10.1109/icdar.2019.00251. http://dx.doi.org/10.1109/ICDAR.2019.00251.
  44. Towards vqa models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019. doi: 10.1109/cvpr.2019.00851. http://dx.doi.org/10.1109/CVPR.2019.00851.
  45. Docvqa: A dataset for vqa on document images
  46. Doremi: Optimizing data mixtures speeds up language model pretraining
  47. Visual genome: Connecting language and vision using crowdsourced dense image annotations
  48. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  49. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages

  50. Judging a book by its cover

Show All 50