LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding (2306.17107v2)

Published 29 Jun 2023 in cs.CV and cs.CL

Abstract: Instruction tuning unlocks the superior capability of LLMs (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.

References (50)

Citations (182)

View on Semantic Scholar

Summary

The paper presents LLaVAR, which significantly improves text-rich image comprehension using OCR-enhanced datasets and a two-stage training process.
The model leverages a CLIP-ViT visual encoder with a Vicuna-language decoder to achieve up to 20% higher accuracy on VQA benchmarks like TextVQA and OCR-VQA.
The approach highlights the impact of high-resolution inputs and extensive data from 422K images for advancing visual instruction tuning in real-world applications.

Enhanced Visual Instruction Tuning for Text-Rich Image Understanding: An Overview of LLaVAR

The paper "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding" presents advancements in the field of visual instruction tuning by addressing the limitations of current models in comprehending textual elements within images. Instruction tuning has significantly improved the utility of LLMs in human interaction tasks. These models, when augmented with visual encoders, possess the potential for comprehensive human-agent interaction based on visual inputs. However, their efficacy diminishes when tasked with dissecting and understanding the textual intricacies present within images - an area crucial for enriched visual comprehension.

Methodology and Data Collection

This paper introduces LLaVAR, a model designed to augment the visual instruction tuning capability of its predecessors by focusing on text-rich images. The authors embark on enhancing the current visual instruction tuning pipeline by accumulating a substantial dataset from the LAION repository, composed of 422K text-rich images such as movie posters and book covers. These images serve as a foundation to collect textual data using Optical Character Recognition (OCR) tools. This dataset aids in overcoming the operational challenges posed by existing models that predominantly train on natural images, which lack embedded textual information.

Moreover, the authors employ GPT-4 to process the gathered OCR data and image captions, generating 16K question-answer conversational pairs tailored for text-rich images. These pairs provide high-quality, instruction-following examples crucial for the model training phases.

Model Architecture and Training Process

LLaVAR utilizes the CLIP-ViT architecture in conjunction with a Vicuna-language decoder. The model undergoes a two-stage training process:

Pre-training Stage: This involves aligning visual features with a language decoder using a trainable projection matrix. Here, the integration of both the newly collected noisy data and existing pre-training datasets lays the groundwork for feature alignment without finetuning the decoder.
Fine-tuning Stage: During this phase, the model incorporates high-quality instruction-following pairs. Both the feature projection matrix and the decoder undergo training to refine question-answering capabilities.

Results and Implications

LLaVAR demonstrates substantial improvement over the original LLaVA and other models across four text-based Visual Question Answering (VQA) datasets, such as ST-VQA, OCR-VQA, TextVQA, and DocVQA, with observable accuracy improvements by up to 20%. The model's high-resolution capability, marked by upgrades from $224^2$ to $336^2$ , further underscores its enhanced proficiency in capturing and interpreting small textual details—a field traditionally challenging for standard models due to resolution constraints.

The strength of LLaVAR lies in its strategic data augmentation and instruction-following training, enhancing the encoding and decoding robustness through OCR and high-resolution inputs. This improvement is crucial in real-world applications, where combination of textual and visual information processing plays a pivotal role—spanning areas from navigating digital content to autonomous vehicle navigation systems that rely on traffic signs.

Future Directions

The progression reflected in LLaVAR paves the way for exploring even higher resolution capabilities and more sophisticated visual encoders. Future endeavors might benefit from expanding datasets further or implementing domain reweighting strategies to maximize data utility. Moreover, enhancing computation efficiency in both high-resolution and multimodal settings remains a pivotal direction for future research endeavors.

Conclusion

By focusing on augmenting the text recognition capabilities of visual instruction models through substantial datasets and strategic model adaptations, LLaVAR represents a meaningful stride in visual language processing. This paper not only highlights the current limitations and avenues for advancement but also sets a definitive blueprint for future AI models underpinned by improved visual-textual comprehension in various computational domains.