Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models (2407.16221v2)

Published 23 Jul 2024 in cs.CL

Abstract: Abstention Ability (AA) is a critical aspect of LLM reliability, referring to an LLM's capability to withhold responses when uncertain or lacking a definitive answer, without compromising performance. Although previous studies have attempted to improve AA, they lack a standardised evaluation method and remain unsuitable for black-box models where token prediction probabilities are inaccessible. This makes comparative analysis challenging, especially for state-of-the-art closed-source commercial LLMs. This paper bridges this gap by introducing a black-box evaluation approach and a new dataset, Abstain-QA, crafted to rigorously assess AA across varied question types (answerable and unanswerable), domains (well-represented and under-represented), and task types (fact centric and reasoning). We also propose a new confusion matrix, the ''Answerable-Unanswerable Confusion Matrix'' (AUCM) which serves as the basis for evaluating AA, by offering a structured and precise approach for assessment. Finally, we explore the impact of three prompting strategies-Strict Prompting, Verbal Confidence Thresholding, and Chain-of-Thought (CoT)-on improving AA. Our results indicate that even powerful models like GPT-4, Mixtral 8x22b encounter difficulties with abstention; however, strategic approaches such as Strict prompting and CoT can enhance this capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. From sparse to dense: GPT-4 summarization with chain of density prompting. arXiv preprint arXiv:2309.04269 (2023).
  3. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734 (2023).
  4. Oleksandr Balabanov and Hampus Linander. 2024. Uncertainty quantification in fine-tuned LLMs using LoRA ensembles. arXiv preprint arXiv:2402.12264 (2024).
  5. Adaptation with self-evaluation to improve selective prediction in llms. arXiv preprint arXiv:2310.11689 (2023).
  6. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379 (2023).
  7. Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees. arXiv preprint arXiv:2405.10301 (2024).
  8. Phrase-based rāga recognition using vector space modeling. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 66–70.
  9. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
  10. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024).
  11. TM Krishna and Vignesh Ishwar. 2012. Carnatic music: Svara, gamaka, motif and raga identity. In Serra X, Rao P, Murthy H, Bozkurt B, editors. Proceedings of the 2nd CompMusic Workshop; 2012 Jul 12-13; Istanbul, Turkey. Barcelona: Universitat Pompeu Fabra; 2012. Universitat Pompeu Fabra.
  12. Large language models understand and can be enhanced by emotional stimuli. arXiv preprint arXiv:2307.11760 (2023).
  13. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance. 374–382.
  14. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021).
  15. Sathwik Tejaswi Madhusudhan and Girish Chowdhary. 2019. Deepsrgm-sequence classification and ranking in Indian classical music with deep learning. In Proceedings of the 20th International Society for Music Information Retrieval Conference. 533–540.
  16. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9802–9822.
  17. OpenAI. 2023. Models-OpenAI API. https://platform.openai.com/docs/models/gpt-3-5-turbo
  18. OpenAI. 2024. Models-OpenAI API. https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4
  19. Raga and tonic identification in carnatic music. Journal of New Music Research 46, 3 (2017), 229–245.
  20. Rajeswari Sridhar and TV Geetha. 2009. Raga identification of carnatic music for music information retrieval. International Journal of recent trends in Engineering 1, 1 (2009), 571.
  21. Mistral AI team. 2024. Cheaper, Better, Stronger, Faster — Mistral AI. https://mistral.ai/news/mixtral-8x22b/
  22. Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations. arXiv preprint arXiv:2404.10960 (2024).
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  24. Neeraj Varshney and Chitta Baral. 2023. Post-abstention: Towards reliably re-attempting the abstained instances in QA. arXiv preprint arXiv:2305.01812 (2023).
  25. KG Vijayakrishnan. 2007. The grammar of Carnatic music. Mouton de Gruyter.
  26. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Nishanth Madhusudhan (2 papers)
  2. Sathwik Tejaswi Madhusudhan (10 papers)
  3. Vikas Yadav (38 papers)
  4. Masoud Hashemi (12 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.