- The paper introduces a novel black-box adaptation strategy using LLMs to enhance the zero-shot performance of VLMs for referring expression comprehension.
- It employs natural language prompts that encode object detections, allowing LLMs to perform spatial and semantic reasoning to select the best matching box.
- Experiments show significant gains, with Grounding-DINO's P@1 rising from 60.09 to 78.12 and Florence-2's from 68.28 to 77.94.
Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models
In an increasingly complex landscape of vision and language integration tasks, the paper "Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models" presents a novel approach focusing on the adaptation of Vision LLMs (VLMs) to enhance their zero-shot capabilities using LLMs. This adaptation method is particularly significant given the limitations often inherent in VLMs, such as spatial and semantic reasoning, especially when tackling open-vocabulary tasks like Referring Expression Comprehension (REC).
Introduction
Vision LLMs have demonstrated significant abilities across various tasks, including image captioning, visual question answering (VQA), and text-image retrieval. However, VLMs often require fine-tuning on specific downstream tasks to achieve competitive performance, which is limited by the need for white-box access to the model's architecture and weights. Fine-tuning also demands expertise in designing objectives and optimizing hyper-parameters for each specific task. This paper introduces a method, referred to as , which leverages LLMs to reason over VLM outputs in a black-box manner. This allows for efficient, adaptable improvements without the need for in-depth access to the VLM's internal structure.
Methodology
The presented method adapts VLMs for the REC task by wrapping their outputs in an LLM that handles spatial and semantic reasoning. Given a textual query, this approach utilizes LLMs to select the best matching object detection box from those proposed by the VLM. The LLM is fine-tuned using a light, efficient strategy based on LoRA, which allows it to adapt without making any assumptions about the VLM's architecture.
Prompt Construction
The LLM prompt includes all detected outputs (box coordinates, labels, and optionally, confidence scores) converted into natural language, enabling the LLM to undertake reasoning tasks. The prompt is structured to conclude with a query requesting the LLM to identify the best matching box.
Fine-Tuning
Fine-tuning is focused on next-token prediction with cross-entropy loss, limited to prompt completion. This approach translates to less computational burden and can be performed on a single high-spec GPU, making it both efficient and accessible.
Experiments and Results
The methodology is evaluated on the RefCOCOg dataset for REC, demonstrating compelling improvements in precision@1 (P@1) scores:
- When adapted using , Grounding-DINO's P@1 score increased from 60.09 to 78.12 (GD(T) with Llama 3) on the validation set.
- Florence-2 (Flo2) saw a performance boost from 68.28 to 77.94 using the same adaptation strategy.
- Ensembling VLM outputs and reasoning them through further improved performance, showcasing the method's versatility and strength in leveraging multiple information sources.
Analysis
The method transfers well between different VLMs, highlighting its generalization capabilities. For example, a model fine-tuned on GDrec outputs and used at inference time on Flo2 outputs still showed significant performance gains. This is critical when dealing with private models or when API call costs are prohibitive. Training dynamics revealed that even a minimal amount of training (~30k samples) substantially boosts performances, emphasizing the efficiency of the method.
Implications and Future Work
Practical implications of this research include the ability to enhance VLM performance without extensive fine-tuning, which is particularly valuable for models released under proprietary licenses. This approach also opens the door for better ensemble methods leveraging multiple models' strengths without needing individual models' intricate hyper-parameter tuning or re-training.
Theoretically, this work suggests that the fusion of VLM and LLM capabilities can lead to significant advancements in spatial and semantic reasoning within open-vocabulary tasks. Future work could extend this method to other complex vision-language tasks such as Described Object Detection, potentially spurring further research on using LLMs for a range of other specialized vision tasks.
Overall, presents a significant step forward in the efficient and effective adaptation of VLMs, underscoring the potential of LLMs in augmenting foundational vision-LLMs in a widely applicable, scalable manner.