Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning (2402.15610v2)

Published 23 Feb 2024 in cs.CL

Abstract: Selective prediction minimizes incorrect predictions from vision-LLMs (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system without increasing the error rate of the system's predictions. When the VLM makes a low-confidence prediction, instead of abstaining ReCoVERR tries to find relevant clues in the image that provide additional evidence for the prediction. ReCoVERR uses an LLM to pose related questions to the VLM, collects high-confidence evidences, and if enough evidence confirms the prediction the system makes a prediction instead of abstaining. ReCoVERR enables three VLMs (BLIP2, InstructBLIP, and LLaVA-1.5) to answer up to 20% more questions on the VQAv2 and A-OKVQA tasks without decreasing system accuracy, thus improving overall system reliability. Our code is available at https://github.com/tejas1995/ReCoVERR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  2. C.K. Chow. 1957. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers.
  3. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
  4. Improving selective visual question answering by learning from your peers. In Computer Vision and Pattern Recognition CVPR.
  5. To reject or not to reject: that is the question-an answer in case of neural classifiers. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 30(1):84–94.
  6. Towards ecologically valid research on language user interfaces. CoRR.
  7. Ran El-Yaniv et al. 2010. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(5).
  8. Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. Advances in neural information processing systems, 30.
  9. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  10. Lvis: A dataset for large vocabulary instance segmentation. In Computer Vision and Pattern Recognition (CVPR).
  11. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In Computer Vision and Pattern Recognition (CVPR).
  12. The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning. In European Conference on Computer Vision (ECCV).
  13. Surface form competition: Why the highest probability answer isn’t always right. In Empirical Methods in Natural Language Processing (EMNLP).
  14. Camels in a changing climate: Enhancing lm adaptation with tulu 2.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825.
  16. To trust or not to trust a classifier. Neural Information Processing Systems (NeurIPS).
  17. Selective question answering under domain shift. In Association for Computational Linguistics (ACL).
  18. Zaid Khan and Yun Fu. 2023. Selective prediction for open-ended question answering in black-box vision-language models. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models.
  19. Obelics: An open web-scale filtered dataset of interleaved image-text documents.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  21. Llava-plus: Learning to use tools for creating multimodal agents.
  22. Improving automatic vqa evaluation using large language models. arXiv preprint arXiv:2310.02567.
  23. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Computer Vision and Pattern Recognition (CVPR).
  24. John Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
  25. Rephrase, augment, reason: Visual grounding of questions for vision-language models. arXiv preprint arXiv:2310.05861.
  26. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
  27. Thinking like a skeptic: Defeasible inference in natural language. In Findings of Empirical Methods in Natural Language Processing (EMNLP).
  28. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision (ECCV).
  29. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
  30. Beyond task performance: Evaluating and reducing the flaws of large multimodal models with in-context learning. CoRR, abs/2310.00647.
  31. Modular visual question answering via code generation. arXiv preprint arXiv:2306.05392.
  32. Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV).
  33. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Empirical Methods in Natural Language Processing (EMNLP).
  34. Investigating selective prediction approaches across several tasks in iid, ood, and adversarial settings. In Findings of the Association for Computational Linguistics (ACL).
  35. Reliable visual question answering: Abstain rather than answer incorrectly. In European Conference on Computer Vision (ECCV).
  36. See, say, and segment: Teaching lmms to overcome false premises. CoRR.
  37. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381.
  38. Idealgpt: Iteratively decomposing vision and language reasoning via large language models. arXiv preprint arXiv:2305.14985.
  39. Socratic models: Composing zero-shot multimodal reasoning with language. In International Conference on Learning Representations (ICLR).
Citations (3)

Summary

  • The paper introduces ReCoVERR, which reduces over-abstention by gathering additional reliable visual evidence to support low-confidence predictions.
  • Methodologically, it employs iterative querying with an extra LLM to refine calibration in models such as BLIP2 and InstructBLIP.
  • The approach enables up to 20% more answers on the A-OKVQA task without sacrificing accuracy, highlighting significant practical improvements.

Enhancing Multimodal Reasoning through Evidential Support with ReCoVERR

Introduction to ReCoVERR

In the evolving landscape of artificial intelligence, particularly within multimodal reasoning involving vision-LLMs (VLMs), the ability to balance prediction accuracy and confidence is crucial. This balance becomes significantly challenging when models are required to make selective predictions, determining when to answer a query based on confidence levels to minimize incorrect answers. The novel intervention of ReCoVERR (Reason by Collecting Visual Evidences that are Reliable and Relevant) offers an innovative pathway to address the issue of over-cautious abstention in selective VLM systems, aiming to improve both the reliability and utility of these AI systems.

The Challenge with Selective Prediction

Selective prediction systems, which allow VLMs to abstain from answering queries when uncertain, face the dilemma of overly frequent abstentions, especially under stringent accuracy requirements. This tendency undermines the system's practical usefulness, compelling the need for a solution that could navigate this intricacy efficiently.

Introducing ReCoVERR

ReCoVERR emerges as a robust inference-time algorithm designed to curtail excessive abstention in selective prediction implementations without undermining the prediction accuracy. It operates by soliciting and scrutinizing additional visual clues from images, beyond the initial low-confidence predictions. This approach leverages an additional LLM to generate pertinent inquiries directed at the VLM, accumulating reliable and relevant evidence to buttress the initial prediction. Implementation of ReCoVERR with VLMs such as BLIP2 and InstructBLIP on the A-OKVQA task illustrates its capacity to allow up to 20% more answers to be provided without compromising the accuracy metrics, pointing towards a notable enhancement in system reliability.

Theoretical and Practical Implications

ReCoVERR introduces two major advancements:

  • Enhanced Model Calibration: It capitalizes on well-calibrated confidence estimates from VLMs, enabling the selection of genuinely reliable visual evidence.
  • Iterative Evidence Collection: Through iterative querying for evidence, ReCoVERR methodically bolsters an initial low-confidence prediction with a series of high-confidence visual supports, enriching the decision-making process.

Future Horizons in AI

The construction of more reliable multimodal reasoning systems through solutions like ReCoVERR insinuates a future where AI can engage in more nuanced and complex interactions with the world. This method not only proposes a significant step forward in handling the selective prediction challenge but also sets the stage for the development of AI systems capable of dynamic evidence gathering and reasoning. Moving forward, the methodological scaffolding established by ReCoVERR could inspire further explorations into augmenting AI's decision-making processes, especially in scenarios demanding high reliability under uncertainty.

Conclusion

ReCoVERR stands as an exemplary solution that judiciously moderates the inclination towards over-abstention in selective prediction settings, affirming the possibility of achieving balanced caution without diminishing the practical applicability of AI systems. By enabling vision-LLMs to substantiate their predictions with concrete visual evidence, ReCoVERR not only enhances the confidence in the outputs generated by these multimodal systems but also advances the field closer to realizing AI's potential in interpreting and interacting with the real world in a more informed and accurate manner.