Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning (2402.15610v2)
Abstract: Selective prediction minimizes incorrect predictions from vision-LLMs (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system without increasing the error rate of the system's predictions. When the VLM makes a low-confidence prediction, instead of abstaining ReCoVERR tries to find relevant clues in the image that provide additional evidence for the prediction. ReCoVERR uses an LLM to pose related questions to the VLM, collects high-confidence evidences, and if enough evidence confirms the prediction the system makes a prediction instead of abstaining. ReCoVERR enables three VLMs (BLIP2, InstructBLIP, and LLaVA-1.5) to answer up to 20% more questions on the VQAv2 and A-OKVQA tasks without decreasing system accuracy, thus improving overall system reliability. Our code is available at https://github.com/tejas1995/ReCoVERR.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- C.K. Chow. 1957. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
- Improving selective visual question answering by learning from your peers. In Computer Vision and Pattern Recognition CVPR.
- To reject or not to reject: that is the question-an answer in case of neural classifiers. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 30(1):84–94.
- Towards ecologically valid research on language user interfaces. CoRR.
- Ran El-Yaniv et al. 2010. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(5).
- Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. Advances in neural information processing systems, 30.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
- Lvis: A dataset for large vocabulary instance segmentation. In Computer Vision and Pattern Recognition (CVPR).
- Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In Computer Vision and Pattern Recognition (CVPR).
- The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning. In European Conference on Computer Vision (ECCV).
- Surface form competition: Why the highest probability answer isn’t always right. In Empirical Methods in Natural Language Processing (EMNLP).
- Camels in a changing climate: Enhancing lm adaptation with tulu 2.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- To trust or not to trust a classifier. Neural Information Processing Systems (NeurIPS).
- Selective question answering under domain shift. In Association for Computational Linguistics (ACL).
- Zaid Khan and Yun Fu. 2023. Selective prediction for open-ended question answering in black-box vision-language models. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Llava-plus: Learning to use tools for creating multimodal agents.
- Improving automatic vqa evaluation using large language models. arXiv preprint arXiv:2310.02567.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Computer Vision and Pattern Recognition (CVPR).
- John Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
- Rephrase, augment, reason: Visual grounding of questions for vision-language models. arXiv preprint arXiv:2310.05861.
- Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
- Thinking like a skeptic: Defeasible inference in natural language. In Findings of Empirical Methods in Natural Language Processing (EMNLP).
- A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision (ECCV).
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
- Beyond task performance: Evaluating and reducing the flaws of large multimodal models with in-context learning. CoRR, abs/2310.00647.
- Modular visual question answering via code generation. arXiv preprint arXiv:2306.05392.
- Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV).
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Empirical Methods in Natural Language Processing (EMNLP).
- Investigating selective prediction approaches across several tasks in iid, ood, and adversarial settings. In Findings of the Association for Computational Linguistics (ACL).
- Reliable visual question answering: Abstain rather than answer incorrectly. In European Conference on Computer Vision (ECCV).
- See, say, and segment: Teaching lmms to overcome false premises. CoRR.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381.
- Idealgpt: Iteratively decomposing vision and language reasoning via large language models. arXiv preprint arXiv:2305.14985.
- Socratic models: Composing zero-shot multimodal reasoning with language. In International Conference on Learning Representations (ICLR).