LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension (2409.11919v3)

Published 18 Sep 2024 in cs.CV

Abstract: Vision LLMs (VLMs) have demonstrated remarkable capabilities in various open-vocabulary tasks, yet their zero-shot performance lags behind task-specific fine-tuned models, particularly in complex tasks like Referring Expression Comprehension (REC). Fine-tuning usually requires 'white-box' access to the model's architecture and weights, which is not always feasible due to proprietary or privacy concerns. In this work, we propose LLM-wrapper, a method for 'black-box' adaptation of VLMs for the REC task using LLMs. LLM-wrapper capitalizes on the reasoning abilities of LLMs, improved with a light fine-tuning, to select the most relevant bounding box matching the referring expression, from candidates generated by a zero-shot black-box VLM. Our approach offers several advantages: it enables the adaptation of closed-source models without needing access to their internal workings, it is versatile as it works with any VLM, it transfers to new VLMs and datasets, and it allows for the adaptation of an ensemble of VLMs. We evaluate LLM-wrapper on multiple datasets using different VLMs and LLMs, demonstrating significant performance improvements and highlighting the versatility of our method. While LLM-wrapper is not meant to directly compete with standard white-box fine-tuning, it offers a practical and effective alternative for black-box VLM adaptation. Code and checkpoints are available at https://github.com/valeoai/LLM_wrapper .

Summary

The paper introduces a novel black-box adaptation strategy using LLMs to enhance the zero-shot performance of VLMs for referring expression comprehension.
It employs natural language prompts that encode object detections, allowing LLMs to perform spatial and semantic reasoning to select the best matching box.
Experiments show significant gains, with Grounding-DINO's P@1 rising from 60.09 to 78.12 and Florence-2's from 68.28 to 77.94.

Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models

In an increasingly complex landscape of vision and language integration tasks, the paper "Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models" presents a novel approach focusing on the adaptation of Vision LLMs (VLMs) to enhance their zero-shot capabilities using LLMs. This adaptation method is particularly significant given the limitations often inherent in VLMs, such as spatial and semantic reasoning, especially when tackling open-vocabulary tasks like Referring Expression Comprehension (REC).

Introduction

Vision LLMs have demonstrated significant abilities across various tasks, including image captioning, visual question answering (VQA), and text-image retrieval. However, VLMs often require fine-tuning on specific downstream tasks to achieve competitive performance, which is limited by the need for white-box access to the model's architecture and weights. Fine-tuning also demands expertise in designing objectives and optimizing hyper-parameters for each specific task. This paper introduces a method, referred to as , which leverages LLMs to reason over VLM outputs in a black-box manner. This allows for efficient, adaptable improvements without the need for in-depth access to the VLM's internal structure.

Methodology

The presented method adapts VLMs for the REC task by wrapping their outputs in an LLM that handles spatial and semantic reasoning. Given a textual query, this approach utilizes LLMs to select the best matching object detection box from those proposed by the VLM. The LLM is fine-tuned using a light, efficient strategy based on LoRA, which allows it to adapt without making any assumptions about the VLM's architecture.

Prompt Construction

The LLM prompt includes all detected outputs (box coordinates, labels, and optionally, confidence scores) converted into natural language, enabling the LLM to undertake reasoning tasks. The prompt is structured to conclude with a query requesting the LLM to identify the best matching box.

Fine-Tuning

Fine-tuning is focused on next-token prediction with cross-entropy loss, limited to prompt completion. This approach translates to less computational burden and can be performed on a single high-spec GPU, making it both efficient and accessible.

Experiments and Results

The methodology is evaluated on the RefCOCOg dataset for REC, demonstrating compelling improvements in precision@1 (P@1) scores:

When adapted using , Grounding-DINO's P@1 score increased from 60.09 to 78.12 (GD(T) with Llama 3) on the validation set.
Florence-2 (Flo2) saw a performance boost from 68.28 to 77.94 using the same adaptation strategy.
Ensembling VLM outputs and reasoning them through further improved performance, showcasing the method's versatility and strength in leveraging multiple information sources.

Analysis

The method transfers well between different VLMs, highlighting its generalization capabilities. For example, a model fine-tuned on GDrec outputs and used at inference time on Flo2 outputs still showed significant performance gains. This is critical when dealing with private models or when API call costs are prohibitive. Training dynamics revealed that even a minimal amount of training (~30k samples) substantially boosts performances, emphasizing the efficiency of the method.

Implications and Future Work

Practical implications of this research include the ability to enhance VLM performance without extensive fine-tuning, which is particularly valuable for models released under proprietary licenses. This approach also opens the door for better ensemble methods leveraging multiple models' strengths without needing individual models' intricate hyper-parameter tuning or re-training.

Theoretically, this work suggests that the fusion of VLM and LLM capabilities can lead to significant advancements in spatial and semantic reasoning within open-vocabulary tasks. Future work could extend this method to other complex vision-language tasks such as Described Object Detection, potentially spurring further research on using LLMs for a range of other specialized vision tasks.

Overall, presents a significant step forward in the efficient and effective adaptation of VLMs, underscoring the potential of LLMs in augmenting foundational vision-LLMs in a widely applicable, scalable manner.

PDF Markdown