LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding (2306.17107v2)
Abstract: Instruction tuning unlocks the superior capability of LLMs (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.
- Training language models to follow instructions with human feedback, 2022.
- Scaling instruction-finetuned language models, 2022.
- Visual instruction tuning, 2023a.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Chunyuan Li. Large multimodal models: Notes on cvpr 2023 tutorial. ArXiv, abs/2306.14895, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2020.
- Learning transferable visual models from natural language supervision, 2021.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021.
- Microsoft coco: Common objects in context, 2015.
- On the hidden mystery of ocr in large multimodal models, 2023b.
- Llama-adapter v2: Parameter-efficient visual instruction model, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022a.
- Self-instruct: Aligning language models with self-generated instructions, 2022b.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023.
- Llama: Open and efficient foundation language models, 2023.
- G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023c.
- Large language models are not fair evaluators, 2023.
- Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
- Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023.
- Audiogpt: Understanding and generating speech, music, sound, and talking head. ArXiv, abs/2304.12995, 2023.
- Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities, 2023b.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention, 2023c.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
- Cream: Visually-situated natural language understanding with contrastive reading model and frozen large language models, 2023.
- Dit: Self-supervised pre-training for document image transformer. Proceedings of the 30th ACM International Conference on Multimedia, Oct 2022. doi: 10.1145/3503161.3547911. URL http://dx.doi.org/10.1145/3503161.3547911.
- Evaluation of deep convolutional nets for document image classification and retrieval, 2015.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b.
- Openflamingo, March 2023. URL https://doi.org/10.5281/zenodo.7733589.
- Language models are few-shot learners, 2020.
- Promptcap: Prompt-guided task-aware image captioning, 2022.
- Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
- Multimodal chain-of-thought reasoning in language models, 2023d.
- Chameleon: Plug-and-play compositional reasoning with large language models. ArXiv, abs/2304.09842, 2023.
- Clippo: Image-and-language understanding from pixels only, 2022.
- Icdar 2019 competition on scene text visual question answering. 2019 International Conference on Document Analysis and Recognition (ICDAR), Sep 2019. doi: 10.1109/icdar.2019.00251. URL http://dx.doi.org/10.1109/ICDAR.2019.00251.
- Towards vqa models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019. doi: 10.1109/cvpr.2019.00851. URL http://dx.doi.org/10.1109/CVPR.2019.00851.
- Docvqa: A dataset for vqa on document images, 2020.
- Doremi: Optimizing data mixtures speeds up language model pretraining, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages, 2017.
- Judging a book by its cover, 2016.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.