Emergent Mind

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

(2306.17107)
Published Jun 29, 2023 in cs.CV and cs.CL

Abstract

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. Training language models to follow instructions with human feedback
  2. Scaling instruction-finetuned language models
  3. Visual instruction tuning, 2023a
  4. Otter: A Multi-Modal Model with In-Context Instruction Tuning
  5. Large Multimodal Models: Notes on CVPR 2023 Tutorial
  6. An image is worth 16x16 words: Transformers for image recognition at scale
  7. Learning transferable visual models from natural language supervision
  8. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
  9. Microsoft coco: Common objects in context
  10. On the hidden mystery of ocr in large multimodal models, 2023b
  11. Llama-adapter v2: Parameter-efficient visual instruction model
  12. OpenAI. Gpt-4 technical report
  13. LAION-5B: An open large-scale dataset for training next generation image-text models
  14. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022a
  15. Self-instruct: Aligning language models with self-generated instructions, 2022b
  16. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  17. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/.

  18. Baize: An open-source chat model with parameter-efficient tuning on self-chat data
  19. Llama: Open and efficient foundation language models
  20. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023c
  21. Large language models are not fair evaluators
  22. Alpacafarm: A simulation framework for methods that learn from human feedback
  23. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
  24. Video-chatgpt: Towards detailed video understanding via large vision and language models
  25. AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
  26. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities, 2023b
  27. Minigpt-4: Enhancing vision-language understanding with advanced large language models
  28. Llama-adapter: Efficient fine-tuning of language models with zero-init attention, 2023c
  29. mplug-owl: Modularization empowers large language models with multimodality
  30. Instructblip: Towards general-purpose vision-language models with instruction tuning
  31. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE
  32. Cream: Visually-situated natural language understanding with contrastive reading model and frozen large language models
  33. Dit: Self-supervised pre-training for document image transformer. Proceedings of the 30th ACM International Conference on Multimedia, Oct 2022. doi: 10.1145/3503161.3547911. http://dx.doi.org/10.1145/3503161.3547911.
  34. Evaluation of deep convolutional nets for document image classification and retrieval
  35. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b
  36. Openflamingo, March 2023. https://doi.org/10.5281/zenodo.7733589.

  37. Language models are few-shot learners
  38. Promptcap: Prompt-guided task-aware image captioning
  39. Learn to explain: Multimodal reasoning via thought chains for science question answering
  40. Multimodal chain-of-thought reasoning in language models, 2023d
  41. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  42. Clippo: Image-and-language understanding from pixels only
  43. Icdar 2019 competition on scene text visual question answering. 2019 International Conference on Document Analysis and Recognition (ICDAR), Sep 2019. doi: 10.1109/icdar.2019.00251. http://dx.doi.org/10.1109/ICDAR.2019.00251.
  44. Towards vqa models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019. doi: 10.1109/cvpr.2019.00851. http://dx.doi.org/10.1109/CVPR.2019.00851.
  45. Docvqa: A dataset for vqa on document images
  46. Doremi: Optimizing data mixtures speeds up language model pretraining
  47. Visual genome: Connecting language and vision using crowdsourced dense image annotations
  48. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  49. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages

  50. Judging a book by its cover

Show All 50