LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding (2306.17107v2)
Abstract: Instruction tuning unlocks the superior capability of LLMs (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.
- Training language models to follow instructions with human feedback, 2022.
- Scaling instruction-finetuned language models, 2022.
- Visual instruction tuning, 2023a.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Chunyuan Li. Large multimodal models: Notes on cvpr 2023 tutorial. ArXiv, abs/2306.14895, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2020.
- Learning transferable visual models from natural language supervision, 2021.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021.
- Microsoft coco: Common objects in context, 2015.
- On the hidden mystery of ocr in large multimodal models, 2023b.
- Llama-adapter v2: Parameter-efficient visual instruction model, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022a.
- Self-instruct: Aligning language models with self-generated instructions, 2022b.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023.
- Llama: Open and efficient foundation language models, 2023.
- G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023c.
- Large language models are not fair evaluators, 2023.
- Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
- Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023.
- Audiogpt: Understanding and generating speech, music, sound, and talking head. ArXiv, abs/2304.12995, 2023.
- Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities, 2023b.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention, 2023c.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
- Cream: Visually-situated natural language understanding with contrastive reading model and frozen large language models, 2023.
- Dit: Self-supervised pre-training for document image transformer. Proceedings of the 30th ACM International Conference on Multimedia, Oct 2022. doi: 10.1145/3503161.3547911. URL http://dx.doi.org/10.1145/3503161.3547911.
- Evaluation of deep convolutional nets for document image classification and retrieval, 2015.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b.
- Openflamingo, March 2023. URL https://doi.org/10.5281/zenodo.7733589.
- Language models are few-shot learners, 2020.
- Promptcap: Prompt-guided task-aware image captioning, 2022.
- Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
- Multimodal chain-of-thought reasoning in language models, 2023d.
- Chameleon: Plug-and-play compositional reasoning with large language models. ArXiv, abs/2304.09842, 2023.
- Clippo: Image-and-language understanding from pixels only, 2022.
- Icdar 2019 competition on scene text visual question answering. 2019 International Conference on Document Analysis and Recognition (ICDAR), Sep 2019. doi: 10.1109/icdar.2019.00251. URL http://dx.doi.org/10.1109/ICDAR.2019.00251.
- Towards vqa models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019. doi: 10.1109/cvpr.2019.00851. URL http://dx.doi.org/10.1109/CVPR.2019.00851.
- Docvqa: A dataset for vqa on document images, 2020.
- Doremi: Optimizing data mixtures speeds up language model pretraining, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages, 2017.
- Judging a book by its cover, 2016.