Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration (2402.00367v2)

Published 1 Feb 2024 in cs.CL
Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

Abstract: Despite efforts to expand the knowledge of LLMs, knowledge gaps -- missing or outdated information in LLMs -- might always persist given the evolving nature of knowledge. In this work, we study approaches to identify LLM knowledge gaps and abstain from answering questions when knowledge gaps are present. We first adapt existing approaches to model calibration or adaptation through fine-tuning/prompting and analyze their ability to abstain from generating low-confidence outputs. Motivated by their failures in self-reflection and over-reliance on held-out sets, we propose two novel approaches that are based on model collaboration, i.e., LLMs probing other LLMs for knowledge gaps, either cooperatively or competitively. Extensive experiments with three LLMs on four QA tasks featuring diverse knowledge domains demonstrate that both cooperative and competitive approaches to unveiling LLM knowledge gaps achieve up to 19.3% improvements on abstain accuracy against the strongest baseline. Further analysis reveals that our proposed mechanisms could help identify failure cases in retrieval augmentation and pinpoint knowledge gaps in multi-hop reasoning.

Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

The paper "Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration" explores methods to identify knowledge gaps in LLMs and proposes abstaining from answering questions when knowledge limitations are detected.

Overview

The authors address a critical issue in the reliability of LLMs: their tendency to produce hallucinations when confronted with questions where they lack sufficient knowledge. The ability to discern when to abstain from answering, rather than generating potentially erroneous responses, is posited as a beneficial capability that could improve both the trust and utility of LLMs in knowledge-intensive tasks.

Approaches to LLM Abstention

The paper categorizes approaches to LLM abstention into calibration-based, training-based, prompting-based, and self-consistency methods, supplemented with the proposal of two novel multi-LLM collaboration strategies, termed Cooperate and Compete.

  • Calibration-Based Methods: These include using token probabilities and temperature scaling to determine abstention thresholds based on model confidence.
  • Training-Based Techniques: These involve linear probing of hidden layers and training external verifiers or through instruction tuning to integrate an abstain functionality.
  • Prompting-Based Strategies: Methods such as self-reflection prompts and requests for more information allow LLMs to judiciously decide when not to provide an answer.
  • Consistency Methods: Tools like the none-of-the-above (NOTA) option and self-consistency threshold leverage multiple answer generation instances to determine abstention likelihood.

Multi-LLM Collaboration

The Cooperate and Compete approaches seek to harness multiple LLMs, providing a novel framework for abstention:

  • Cooperate: Utilizes multiple LLMs to provide varied feedback on an initial response, allowing a final 'judging' model to assimilate inputs and decide on abstention collectively. This method takes advantage of discrepancies in knowledge coverage among different models.
  • Compete: Challenges an LLM with conflicting information generated by other models, abstaining if the LLM changes its initial answer, indicating susceptibility to external influence and uncertainty.

Experimental Evaluation

The authors conduct extensive experiments using three LLMs (Mistral-7B, LLaMA2-70B, and ChatGPT) across diverse tasks such as MMLU, Knowledge Crosswords, Hellaswag, and Propaganda detection. Notably, these datasets span varied knowledge domains and reasoning challenges.

  • Performance Metrics: The evaluation measures reliable accuracy (R-Acc), effective reliability (ER), and abstain-specific metrics such as abstain accuracy (A-Acc) and abstain F1-score (A-F1).
  • Findings: The Cooperate and Compete approaches demonstrated superior performance relative to baseline methods, achieving notable improvements, particularly in reliable accuracy across different models and datasets.

Implications and Future Directions

This paper introduces robust methodologies for enhancing LLM reliability by abstaining from answering when knowledge is insufficient. These strategies reveal potential both for practical applications in improving AI trustworthiness and for theoretical developments in understanding model uncertainty.

Looking forward, the research opens avenues for integrating collaborative and competitive dynamics into LLM pipelines at scale and exploring their application in safety-critical and highly dynamic knowledge contexts. Additionally, extending these principles to other domains, such as ethical AI frameworks, could be a promising direction.

In conclusion, the paper significantly contributes to methodological advancements in LLM abstention, demonstrating that a multi-LLM approach can outperform traditional single-model strategies in identifying and managing knowledge gaps.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models. arXiv preprint arXiv:2305.13712.
  2. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.
  3. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
  4. Knowledge-augmented language model verification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1720–1736.
  5. Llm augmented llms: Expanding capabilities through composition.
  6. Trusted source alignment in large language models. arXiv preprint arXiv:2311.06697.
  7. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  8. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
  9. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  10. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  11. Do androids know they’re only dreaming of electric sheep? arXiv preprint arXiv:2312.17249.
  12. Dense x retrieval: What retrieval granularity should we use? arXiv preprint arXiv:2312.06648.
  13. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  14. Crawling the internal knowledge-base of language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1811–1824.
  15. Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506.
  16. Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Association for Computational Linguistics.
  17. Knowledge crosswords: Geometric reasoning over structured knowledge with large language models. arXiv preprint arXiv:2310.01290.
  18. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
  19. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11737–11762, Toronto, Canada. Association for Computational Linguistics.
  20. Knowledge card: Filling llms’ knowledge gaps with plug-in specialized language models.
  21. Generate then select: Open-ended visual question answering guided by world knowledge. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2333–2346, Toronto, Canada. Association for Computational Linguistics.
  22. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  23. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  24. Investigating uncertainty calibration of aligned language models under the multiple-choice setting. arXiv preprint arXiv:2310.11732.
  25. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  26. Measuring and manipulating knowledge representations in language models. arXiv preprint arXiv:2304.00740.
  27. Won’t get fooled again: Answering questions with false premises. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5626–5643, Toronto, Canada. Association for Computational Linguistics.
  28. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  29. A survey of safety and trustworthiness of large language models through the lens of verification and validation. ArXiv, abs/2305.11391.
  30. Abhyuday Jagannatha and Hong Yu. 2020. Calibrating structured output predictors for natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2078–2092, Online. Association for Computational Linguistics.
  31. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  32. Mistral 7b. ArXiv, abs/2310.06825.
  33. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  34. On transferability of bias mitigation effects in language model fine-tuning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3770–3783, Online. Association for Computational Linguistics.
  35. Language models (mostly) know what they know. ArXiv, abs/2207.05221.
  36. Adam Tauman Kalai and Santosh S Vempala. 2023. Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648.
  37. Selective question answering under domain shift. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5684–5696, Online. Association for Computational Linguistics.
  38. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR.
  39. Self-specialization: Uncovering latent expertise within large language models. arXiv preprint arXiv:2310.00160.
  40. Realtime qa: What’s the answer right now? arXiv preprint arXiv:2207.13332.
  41. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024.
  42. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491.
  43. Calibrated language model fine-tuning for in- and out-of-distribution data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1326–1340, Online. Association for Computational Linguistics.
  44. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
  45. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525:140 – 146.
  46. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
  47. Litcab: Lightweight calibration of language models on outputs of varied lengths. arXiv preprint arXiv:2310.19208.
  48. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
  49. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  50. Augmented large language models with parametric knowledge guiding. arXiv preprint arXiv:2305.04757.
  51. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822.
  52. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
  53. Ambigqa: Answering ambiguous open-domain questions. arXiv preprint arXiv:2004.10645.
  54. Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332.
  55. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  56. Do I have the knowledge to answer? investigating answerability of knowledge base questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10341–10357, Toronto, Canada. Association for Computational Linguistics.
  57. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
  58. Semeval-2023 task 3: Detecting the category, the framing, and the persuasion techniques in online news in a multi-lingual setup. In Proceedings of the the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2343–2361.
  59. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore. Association for Computational Linguistics.
  60. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  61. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.
  62. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4454–4470, Toronto, Canada. Association for Computational Linguistics.
  63. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  64. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  65. Getting more out of mixture of language model reasoning experts. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8234–8249.
  66. The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3607–3625.
  67. Elias Stengel-Eskin and Benjamin Van Durme. 2023. Calibrated interpretation: Confidence estimation in semantic parsing. Transactions of the Association for Computational Linguistics, 11:1213–1231.
  68. Quantifying uncertainty in foundation models via ensembles. In NeurIPS 2022 Workshop on Robustness in Sequence Modeling.
  69. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
  70. A benchmark study on calibration. arXiv preprint arXiv:2308.11838.
  71. Nomiracl: Knowing when you don’t know for robust multilingual retrieval-augmented generation. arXiv preprint arXiv:2312.11361.
  72. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442, Singapore. Association for Computational Linguistics.
  73. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  74. Can ChatGPT defend its belief in truth? evaluating LLM reasoning via debate. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11865–11881, Singapore. Association for Computational Linguistics.
  75. On the inference calibration of neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3070–3079, Online. Association for Computational Linguistics.
  76. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  77. Resolving knowledge conflicts in large language models. arXiv preprint arXiv:2310.00935.
  78. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377.
  79. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  80. The generative AI paradox:" What It Can Create, It May Not Understand". arXiv preprint arXiv:2311.00059.
  81. Reliable visual question answering: Abstain rather than answer incorrectly. In European Conference on Computer Vision, pages 148–166. Springer.
  82. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
  83. Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge conflicts. arXiv preprint arXiv:2305.13300.
  84. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408.
  85. Alignment for honesty. arXiv preprint arXiv:2312.07000.
  86. Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296.
  87. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210.
  88. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
  89. R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677.
  90. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. arXiv preprint arXiv:2309.17249.
  91. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty.
  92. Navigating the grey area: How expressions of uncertainty and overconfidence affect language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5506–5524, Singapore. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shangbin Feng (53 papers)
  2. Weijia Shi (55 papers)
  3. Yike Wang (16 papers)
  4. Wenxuan Ding (14 papers)
  5. Vidhisha Balachandran (31 papers)
  6. Yulia Tsvetkov (142 papers)
Citations (43)