- The paper introduces UPD as a metric to evaluate models' ability to abstain when faced with unsolvable queries.
- It details the creation of UPD benchmarks across three settings (AAD, IASD, IVQD) from MMBench, involving 2,095 rigorously designed questions.
- Empirical results show that instruction tuning, notably in models like LLaVA-Next-34B, significantly enhances performance and trustworthiness.
Evaluating Trustworthiness of Vision LLMs through Unsolvable Problem Detection
Introduction to UPD
In recent studies, the trustworthiness and reliability of Vision LLMs (VLMs) have garnered attention due to their propensity for producing incorrect or misleading information, colloquially known as "hallucination." This paper introduces the concept of Unsolvable Problem Detection (UPD), a novel evaluation metric designed to test a model's capacity to appropriately abstain from providing answers when presented with unsolvable problems. UPD is explored through three distinct settings:
- Absent Answer Detection (AAD)
- Incompatible Answer Set Detection (IASD)
- Incompatible Visual Question Detection (IVQD)
These settings aim to evaluate the models' proficiency in handling various forms of unanswerable queries due to discrepancies or irrelevance among the provided image, question, and answer options.
Creating UPD Benchmarks
The UPD benchmarks are developed from the existing MMBench, modifying it to include challenges that specifically assess the models' capabilities across the UPD settings. For AAD, incorrect options are removed; for IASD, the answer sets are made completely irrelevant; and for IVQD, the image-question relevance is intentionally mismatched. Following an in-depth analysis that includes manual checks and data cleaning, the research establishes three distinct UPD benchmarks encompassing a total of 2,095 questions.
Evaluation of VLMs on UPD
Several state-of-the-art VLMs including LLaVA-1.5-13B, CogVLM-17B, and the notably recent LLaVA-Next-34B among others, were evaluated against these benchmarks. The findings suggest that while models like GPT-4V and LLaVA-Next-34B show improved performance in UPD problems compared to their counterparts, their accuracy varies significantly across different UPD settings and abilities, indicating room for improvement.
Addressing UPD through Prompt Engineering and Instruction Tuning
The study explores prompt engineering strategies, such as adding additional options or instructions, as potential solutions for UPD. The effectiveness of these strategies is found to vary across different models, and in many cases, does not account for a comprehensive solution to UPD challenges. Consequently, instruction tuning emerges as a more promising approach, showing marked improvements in UPD performance, especially with larger models like LLaVA-Next-34B.
Future Directions for UPD Research
This research emphasizes the necessity for VLMs not only to provide accurate answers when possible but also to recognize when an answer should be withheld. The study's exploration into UPD reveals significant challenges and opens avenues for future research towards developing more trustworthy and reliable VLMs. Areas of interest include expanding UPD to cover expert-level questions, exploring chain-of-thought reasoning as a potential solution, and developing model-agnostic post-hoc methods for unsolvable problem detection.
Contribution and Impact
This paper's introduction of Unsolvable Problem Detection as a criterion for evaluating VLM trustworthiness, coupled with the development of comprehensive UPD benchmarks and the analysis of VLM performance across UPD settings, constitutes a significant advancement in the field of artificial intelligence. By highlighting the inadequacies of current VLMs in handling unsolvable problems and suggesting viable approaches for improvement, this research contributes to the ongoing development of more reliable and trustworthy VLM systems.