Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Small Language Model Meets with Reinforced Vision Vocabulary (2401.12503v1)

Published 23 Jan 2024 in cs.CV
Small Language Model Meets with Reinforced Vision Vocabulary

Abstract: Playing Large Vision LLMs (LVLMs) in 2023 is trendy among the AI community. However, the relatively large number of parameters (more than 7B) of popular LVLMs makes it difficult to train and deploy on consumer GPUs, discouraging many researchers with limited resources. Imagine how cool it would be to experience all the features of current LVLMs on an old GTX1080ti (our only game card). Accordingly, we present Vary-toy in this report, a small-size Vary along with Qwen-1.8B as the base ``large'' LLM. In Vary-toy, we introduce an improved vision vocabulary, allowing the model to not only possess all features of Vary but also gather more generality. Specifically, we replace negative samples of natural images with positive sample data driven by object detection in the procedure of generating vision vocabulary, more sufficiently utilizing the capacity of the vocabulary network and enabling it to efficiently encode visual information corresponding to natural objects. For experiments, Vary-toy can achieve 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1% accuracy on RefCOCO, and 29% on MMVet. The code will be publicly available on the homepage.

Introduction

The field of AI has recently been dominated by research on Large Vision LLMs (LVLM), which are potent tools for tasks at the intersection of Computer Vision (CV) and NLP. However, the extensive parameter sizes of these models often exceed the capabilities of standard consumer GPUs, thus limiting their accessibility. The Vary-toy, a condensed version of the Vary model, emerges as a novel solution to this issue, retaining all features of its predecessor while boasting a more generalizable and improved vision vocabulary.

Challenges and Innovations

LVLMs like the acclaimed CLIP have shown remarkable proficiency, yet they encounter bottlenecks relating to the encoding efficiency of visual signals. Addressing this, Vary-toy opts for an object detection-focused approach in curating its vision vocabulary. This improved methodology replaces negative image samples with positive ones to ensure optimal use of network capacity. By integrating object location data, Vary-toy aptly encodes visual information corresponding to natural objects, enhancing its versatility.

Methodology and Data

Vary-toy's pipeline contains a dual approach: generating an improved vision vocabulary using a structure called Vary-tiny+ and merging this with an existing CLIP-based vocabulary within a 1.8B LLM. The data encompass PDFs, natural images for object detection, and several other multimodal datasets. The system is trained through extensive pre-training across multi-task domains and Specific Fine-Tuning (SFT) using the LLaVA-80K dataset, ensuring robust training without overloading computational resources.

Experimental Results

Vary-toy's performance on benchmarks like DocVQA, ChartQA, RefCOCO, and MMVet confirms its competitive edge, achieving close to or exceeding the accuracies of larger models. Notably, its results on RefCOCO val and MMVet demonstrate its potent natural object perception and general proficiency, respectively. The model's architecture successfully balances between retaining text-perception attributes and broadening its visual understanding, which suggests further areas of exploration and potential developments for Vary-toy.

Conclusion

The development of Vary-toy underscores the potential of small-size LVLMs not only in research but also as practical tools for individuals or institutions with limited computational resources. Through the creation of a more comprehensive vision vocabulary, Vary-toy stands as a testament to advancements in the efficient use of LLMs for multimodal tasks. It holds promise for further contributions to the LVLM domain and may inspire adaptations that render these powerful models more universally accessible.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Haoran Wei (55 papers)
  2. Lingyu Kong (13 papers)
  3. Jinyue Chen (5 papers)
  4. Liang Zhao (353 papers)
  5. Zheng Ge (60 papers)
  6. En Yu (31 papers)
  7. Jianjian Sun (23 papers)
  8. Chunrui Han (21 papers)
  9. Xiangyu Zhang (328 papers)
Citations (32)