Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering (2404.10193v1)

Published 16 Apr 2024 in cs.CV

Abstract: The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-LLMs in a realistic, black-box setting. We propose using the principle of \textit{neighborhood consistency} to identify unreliable responses from a black-box vision-LLM in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Don’t just assume; look and answer: Overcoming priors for visual question answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4971–4980, 2017.
  2. Building uncertainty models on top of black-box predictive apis. IEEE Access, 8:121344–121356, 2020.
  3. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  4. N. Chomsky. The Logical Structure of Linguistic Theory. Springer, 1975.
  5. Scaling instruction-finetuned language models. ArXiv, abs/2210.11416, 2022.
  6. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1554–1563, 2021.
  7. Improving selective visual question answering by learning from your peers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24049–24059, 2023.
  8. Selective classification for deep neural networks. In NIPS, 2017.
  9. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning, 2019.
  10. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  11. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2019.
  12. Attribution-Based Confidence Metric For Deep Neural Networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  13. Selective question answering under domain shift. In Annual Meeting of the Association for Computational Linguistics, 2020.
  14. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2016.
  15. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2016.
  16. Align before fuse: Vision and language representation learning with momentum distillation. In Neural Information Processing Systems, 2021.
  17. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022.
  18. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023.
  19. Truthfulqa: Measuring how models mimic human falsehoods. In Annual Meeting of the Association for Computational Linguistics, 2021.
  20. Cycle self-training for domain adaptation. In Neural Information Processing Systems, 2021.
  21. Fixing weight decay regularization in adam. ArXiv, abs/1711.05101, 2017.
  22. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems, 2019.
  23. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3190–3199, 2019.
  24. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, 2020.
  25. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, page 625–632, New York, NY, USA, 2005. Association for Computing Machinery.
  26. Measuring calibration in deep learning. ArXiv, abs/1904.01685, 2019.
  27. Squinting at vqa models: Introspecting vqa models with sub-questions. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10000–10008, 2020.
  28. Cycle-consistency for robust visual question answering. In 2019 Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
  29. Human-adversarial visual question answering. 2021.
  30. Black-box tuning for language-model-as-a-service. ArXiv, abs/2201.03514, 2022.
  31. Axiomatic attribution for deep networks. ArXiv, abs/1703.01365, 2017.
  32. Uncertainty estimation using a single deep deterministic neural network. 2020.
  33. Implicit semantic data augmentation for deep networks. In Neural Information Processing Systems, 2019.
  34. Reliable visual question answering: Abstain rather than answer incorrectly. In European Conference on Computer Vision, 2022.
  35. Deep gamblers: Learning to abstain with portfolio theory. ArXiv, abs/1907.00208, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets