Instruction tuning unlocks the superior capability of LLMs (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.
We're not able to analyze this paper right now due to high demand.
Please check back later (sorry!).
Sign up for a free account or log in to generate a summary of this paper:
We ran into a problem analyzing this paper.
Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/.
Openflamingo, March 2023. https://doi.org/10.5281/zenodo.7733589.
Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages