To Trust Or Not To Trust Your Vision-Language Model's Prediction

Published 29 May 2025 in cs.CV, cs.AI, and cs.LG | (2505.23745v1)

Abstract: Vision-LLMs (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code will be available at https://github.com/EPFL-IMOS/TrustVLM.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

To Trust Or Not To Trust Your Vision-LLM's Prediction

Vision-LLMs (VLMs) stand at the forefront of integrating visual and textual modalities, showcasing remarkable abilities in multimodal understanding and generation. Notwithstanding their prowess in zero-shot and transfer learning applications, VLMs encounter critical challenges related to misclassification issues, wherein models often yield confident yet erroneous predictions. This paper tackles the problem of misclassification detection, particularly in scenarios where VLM outputs, albeit incorrect, appear visually and semantically plausible. The introduction of TrustVLM—a training-free framework—seeks to address the imperative need for trustworthy predictions, especially within safety-critical domains, such as autonomous driving, medical diagnostics, and surveillance.

TrustVLM Framework

TrustVLM is structured to improve confidence estimation for VLM predictions without necessitating retraining. The framework centers around bridging the modality gap typically observed in VLMs and leverages the distinct representational capacity of the image embedding space. TrustVLM posits a novel confidence-scoring mechanism that fuses traditional image-to-text similarity with enhanced image-to-image similarity, the latter being drawn from a pre-trained vision encoder.

Key components of TrustVLM include:

Visual Prototype Generation: Extraction and storage of class-specific visual prototypes derived from an auxiliary vision encoder using N-shot samples.
Zero-shot Classification: Utilization of VLM to predict class labels, yielding initial confidence scores based on image-to-text similarity.
Confidence Verification: Calculation of complementary confidence scores based on image-to-image similarity between test images and class prototypes.

Furthermore, TrustVLM can be extended into TrustVLM*, which combines image-to-text with image-to-image metrics for improved zero-shot classification accuracy. Fine-tuning these prototypes enhances downstream accuracy, solidifying TrustVLM’s applicability in diverse scenarios.

Experimental Validation

The efficacy of TrustVLM was rigorously tested across an array of $17$ datasets, employing $4$ different architectures and $2$ VLMs. The method demonstrated a notable performance improvement—up to $51.87\%$ in Area Under the Risk-Coverage (AURC) curve, $9.14\%$ in Area Under the Receiver Operating Characteristic (AUROC), and a reduction of $32.42\%$ in False Positive Rate at 95% True Positive Rate (FPR95)—surpassing existing baseline methods. Additionally, the framework's capability to enhance zero-shot classification accuracy was evident through significant performance metrics, further consolidated by fine-tuning the visual prototypes.

Implications and Future Prospects

TrustVLM's contributions are manifold, forging pathways towards more reliable deployment of VLMs in critical applications without necessitating model retraining. The integration of information from the image embedding space and visual prototype utilization represents a pertinent advancement in addressing VLM misclassification vulnerabilities.

Looking ahead, the fundamental concepts underlying TrustVLM could spur developments across various domains of AI. The potential adaptation of this framework to other multimodal tasks, such as visual question answering and image retrieval, presents intriguing opportunities for further research. Moreover, overcoming limitations such as dependency on clean, class-specific prototypes could enable broader applications, transcending current domain confines.

In summary, TrustVLM not only offers a concrete solution to enhance prediction reliability in Vision-LLMs but also paves the way for innovations in ensuring the safe and effective use of AI in complex, real-world environments.

Markdown Report Issue