Introduction
The field of AI has recently been dominated by research on Large Vision LLMs (LVLM), which are potent tools for tasks at the intersection of Computer Vision (CV) and NLP. However, the extensive parameter sizes of these models often exceed the capabilities of standard consumer GPUs, thus limiting their accessibility. The Vary-toy, a condensed version of the Vary model, emerges as a novel solution to this issue, retaining all features of its predecessor while boasting a more generalizable and improved vision vocabulary.
Challenges and Innovations
LVLMs like the acclaimed CLIP have shown remarkable proficiency, yet they encounter bottlenecks relating to the encoding efficiency of visual signals. Addressing this, Vary-toy opts for an object detection-focused approach in curating its vision vocabulary. This improved methodology replaces negative image samples with positive ones to ensure optimal use of network capacity. By integrating object location data, Vary-toy aptly encodes visual information corresponding to natural objects, enhancing its versatility.
Methodology and Data
Vary-toy's pipeline contains a dual approach: generating an improved vision vocabulary using a structure called Vary-tiny+ and merging this with an existing CLIP-based vocabulary within a 1.8B LLM. The data encompass PDFs, natural images for object detection, and several other multimodal datasets. The system is trained through extensive pre-training across multi-task domains and Specific Fine-Tuning (SFT) using the LLaVA-80K dataset, ensuring robust training without overloading computational resources.
Experimental Results
Vary-toy's performance on benchmarks like DocVQA, ChartQA, RefCOCO, and MMVet confirms its competitive edge, achieving close to or exceeding the accuracies of larger models. Notably, its results on RefCOCO val and MMVet demonstrate its potent natural object perception and general proficiency, respectively. The model's architecture successfully balances between retaining text-perception attributes and broadening its visual understanding, which suggests further areas of exploration and potential developments for Vary-toy.
Conclusion
The development of Vary-toy underscores the potential of small-size LVLMs not only in research but also as practical tools for individuals or institutions with limited computational resources. Through the creation of a more comprehensive vision vocabulary, Vary-toy stands as a testament to advancements in the efficient use of LLMs for multimodal tasks. It holds promise for further contributions to the LVLM domain and may inspire adaptations that render these powerful models more universally accessible.