Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning (2402.15610v2)

Published 23 Feb 2024 in cs.CL

Abstract: Selective prediction minimizes incorrect predictions from vision-LLMs (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system without increasing the error rate of the system's predictions. When the VLM makes a low-confidence prediction, instead of abstaining ReCoVERR tries to find relevant clues in the image that provide additional evidence for the prediction. ReCoVERR uses an LLM to pose related questions to the VLM, collects high-confidence evidences, and if enough evidence confirms the prediction the system makes a prediction instead of abstaining. ReCoVERR enables three VLMs (BLIP2, InstructBLIP, and LLaVA-1.5) to answer up to 20% more questions on the VQAv2 and A-OKVQA tasks without decreasing system accuracy, thus improving overall system reliability. Our code is available at https://github.com/tejas1995/ReCoVERR.

References (39)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces ReCoVERR, which reduces over-abstention by gathering additional reliable visual evidence to support low-confidence predictions.
Methodologically, it employs iterative querying with an extra LLM to refine calibration in models such as BLIP2 and InstructBLIP.
The approach enables up to 20% more answers on the A-OKVQA task without sacrificing accuracy, highlighting significant practical improvements.

Enhancing Multimodal Reasoning through Evidential Support with ReCoVERR

Introduction to ReCoVERR

In the evolving landscape of artificial intelligence, particularly within multimodal reasoning involving vision-LLMs (VLMs), the ability to balance prediction accuracy and confidence is crucial. This balance becomes significantly challenging when models are required to make selective predictions, determining when to answer a query based on confidence levels to minimize incorrect answers. The novel intervention of ReCoVERR (Reason by Collecting Visual Evidences that are Reliable and Relevant) offers an innovative pathway to address the issue of over-cautious abstention in selective VLM systems, aiming to improve both the reliability and utility of these AI systems.

The Challenge with Selective Prediction

Selective prediction systems, which allow VLMs to abstain from answering queries when uncertain, face the dilemma of overly frequent abstentions, especially under stringent accuracy requirements. This tendency undermines the system's practical usefulness, compelling the need for a solution that could navigate this intricacy efficiently.

Introducing ReCoVERR

ReCoVERR emerges as a robust inference-time algorithm designed to curtail excessive abstention in selective prediction implementations without undermining the prediction accuracy. It operates by soliciting and scrutinizing additional visual clues from images, beyond the initial low-confidence predictions. This approach leverages an additional LLM to generate pertinent inquiries directed at the VLM, accumulating reliable and relevant evidence to buttress the initial prediction. Implementation of ReCoVERR with VLMs such as BLIP2 and InstructBLIP on the A-OKVQA task illustrates its capacity to allow up to 20% more answers to be provided without compromising the accuracy metrics, pointing towards a notable enhancement in system reliability.

Theoretical and Practical Implications

ReCoVERR introduces two major advancements:

Enhanced Model Calibration: It capitalizes on well-calibrated confidence estimates from VLMs, enabling the selection of genuinely reliable visual evidence.
Iterative Evidence Collection: Through iterative querying for evidence, ReCoVERR methodically bolsters an initial low-confidence prediction with a series of high-confidence visual supports, enriching the decision-making process.

Future Horizons in AI

The construction of more reliable multimodal reasoning systems through solutions like ReCoVERR insinuates a future where AI can engage in more nuanced and complex interactions with the world. This method not only proposes a significant step forward in handling the selective prediction challenge but also sets the stage for the development of AI systems capable of dynamic evidence gathering and reasoning. Moving forward, the methodological scaffolding established by ReCoVERR could inspire further explorations into augmenting AI's decision-making processes, especially in scenarios demanding high reliability under uncertainty.

Conclusion

ReCoVERR stands as an exemplary solution that judiciously moderates the inclination towards over-abstention in selective prediction settings, affirming the possibility of achieving balanced caution without diminishing the practical applicability of AI systems. By enabling vision-LLMs to substantiate their predictions with concrete visual evidence, ReCoVERR not only enhances the confidence in the outputs generated by these multimodal systems but also advances the field closer to realizing AI's potential in interpreting and interacting with the real world in a more informed and accurate manner.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Tejas__S__/status/1762506234975170877

https://twitter.com/_jessethomason_/status/1762611972913459393