Papers
Topics
Authors
Recent
Search
2000 character limit reached

Can AI Assistants Know What They Don't Know?

Published 24 Jan 2024 in cs.CL and cs.AI | (2401.13275v2)

Abstract: Recently, AI assistants based on LLMs show surprising performance in many tasks, such as dialogue, solving math problems, writing code, and using tools. Although LLMs possess intensive world knowledge, they still make factual errors when facing some knowledge intensive tasks, like open-domain question answering. These untruthful responses from the AI assistant may cause significant risks in practical applications. We believe that an AI assistant's refusal to answer questions it does not know is a crucial method for reducing hallucinations and making the assistant truthful. Therefore, in this paper, we ask the question "Can AI assistants know what they don't know and express them through natural language?" To answer this question, we construct a model-specific "I don't know" (Idk) dataset for an assistant, which contains its known and unknown questions, based on existing open-domain question answering datasets. Then we align the assistant with its corresponding Idk dataset and observe whether it can refuse to answer its unknown questions after alignment. Experimental results show that after alignment with Idk datasets, the assistant can refuse to answer most its unknown questions. For questions they attempt to answer, the accuracy is significantly higher than before the alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Anonymous. INSIDE: LLMs’ internal states retain the power of hallucination detection. In Submitted to The Twelfth International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Zj12nzlQbz. under review.
  2. Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  3. Self-rag: Learning to retrieve, generate, and critique through self-reflection. CoRR, abs/2310.11511, 2023. doi: 10.48550/ARXIV.2310.11511. URL https://doi.org/10.48550/arXiv.2310.11511.
  4. A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862, 2022. doi: 10.48550/ARXIV.2204.05862. URL https://doi.org/10.48550/arXiv.2204.05862.
  6. Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. URL https://arxiv.org/abs/2309.10305.
  7. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  8. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=ETKGuby0hcs.
  9. Evaluating hallucinations in chinese large language models. CoRR, abs/2310.03368, 2023. doi: 10.48550/ARXIV.2310.03368. URL https://doi.org/10.48550/arXiv.2310.03368.
  10. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  11. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  12. Dola: Decoding by contrasting layers improves factuality in large language models. CoRR, abs/2309.03883, 2023. doi: 10.48550/ARXIV.2309.03883. URL https://doi.org/10.48550/arXiv.2309.03883.
  13. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022. doi: 10.48550/ARXIV.2210.11416. URL https://doi.org/10.48550/arXiv.2210.11416.
  14. Truthful AI: developing and governing AI that does not lie. CoRR, abs/2110.06674, 2021. URL https://arxiv.org/abs/2110.06674.
  15. Mistral 7b. CoRR, abs/2310.06825, 2023. doi: 10.48550/ARXIV.2310.06825. URL https://doi.org/10.48550/arXiv.2310.06825.
  16. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp.  1601–1611. Association for Computational Linguistics, 2017. doi: 10.18653/V1/P17-1147. URL https://doi.org/10.18653/v1/P17-1147.
  17. Language models (mostly) know what they know. CoRR, abs/2207.05221, 2022. doi: 10.48550/ARXIV.2207.05221. URL https://doi.org/10.48550/arXiv.2207.05221.
  18. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019. doi: 10.1162/TACL_A_00276. URL https://doi.org/10.1162/tacl_a_00276.
  19. Inference-time intervention: Eliciting truthful answers from a language model. CoRR, abs/2306.03341, 2023. doi: 10.48550/ARXIV.2306.03341. URL https://doi.org/10.48550/arXiv.2306.03341.
  20. Truthfulqa: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  3214–3252. Association for Computational Linguistics, 2022a. doi: 10.18653/v1/2022.acl-long.229. URL https://doi.org/10.18653/v1/2022.acl-long.229.
  21. Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022, 2022b. URL https://openreview.net/forum?id=8s8K2UZGTZ.
  22. The flan collection: Designing data and methods for effective instruction tuning, 2023.
  23. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp.  9004–9017. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.557.
  24. OpenAI. Introducing chatgpt, 2022. URL https://openai.com/blog/chatgpt.
  25. Training language models to follow instructions with human feedback. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  26. Qwen-Team. Qwen technical report. 2023. URL https://qianwen-res.oss-cn-beijing.aliyuncs.com/QWEN_TECHNICAL_REPORT.pdf.
  27. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290, 2023. doi: 10.48550/ARXIV.2305.18290. URL https://doi.org/10.48550/arXiv.2305.18290.
  28. Investigating the factual knowledge boundary of large language models with retrieval augmentation. CoRR, abs/2307.11019, 2023. doi: 10.48550/ARXIV.2307.11019. URL https://doi.org/10.48550/arXiv.2307.11019.
  29. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  30. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
  31. Retrieval augmentation reduces hallucination in conversation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pp.  3784–3803. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.findings-emnlp.320. URL https://doi.org/10.18653/v1/2021.findings-emnlp.320.
  32. Learning to summarize from human feedback. CoRR, abs/2009.01325, 2020. URL https://arxiv.org/abs/2009.01325.
  33. Moss: Training conversational language models from synthetic data. 2023.
  34. Fine-tuning language models for factuality. CoRR, abs/2311.08401, 2023. doi: 10.48550/ARXIV.2311.08401. URL https://doi.org/10.48550/arXiv.2311.08401.
  35. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  36. Evaluating open-qa evaluation, 2023a. URL https://arxiv.org/abs/2305.12421.
  37. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. CoRR, abs/2310.07521, 2023b. doi: 10.48550/ARXIV.2310.07521. URL https://doi.org/10.48550/arXiv.2310.07521.
  38. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  13484–13508. Association for Computational Linguistics, 2023c. doi: 10.18653/V1/2023.ACL-LONG.754. URL https://doi.org/10.18653/v1/2023.acl-long.754.
  39. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. URL https://openreview.net/forum?id=gEZrGCozdqR.
  40. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022b. URL https://openreview.net/forum?id=yzkSU5zdwD.
  41. ALCUNA: large language models meet new knowledge. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp.  1397–1414. Association for Computational Linguistics, 2023a. URL https://aclanthology.org/2023.emnlp-main.87.
  42. Do large language models know what they don’t know? In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  8653–8665. Association for Computational Linguistics, 2023b. doi: 10.18653/v1/2023.findings-acl.551. URL https://doi.org/10.18653/v1/2023.findings-acl.551.
  43. GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=-Aw0rrrPUF.
  44. The wisdom of hindsight makes language models better instruction followers. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  41414–41428. PMLR, 2023a. URL https://proceedings.mlr.press/v202/zhang23ab.html.
  45. Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219, 2023b. doi: 10.48550/ARXIV.2309.01219. URL https://doi.org/10.48550/arXiv.2309.01219.
  46. Knowing what llms DO NOT know: A simple yet effective self-detection method. CoRR, abs/2310.17918, 2023. doi: 10.48550/ARXIV.2310.17918. URL https://doi.org/10.48550/arXiv.2310.17918.
  47. Representation engineering: A top-down approach to AI transparency. CoRR, abs/2310.01405, 2023. doi: 10.48550/ARXIV.2310.01405. URL https://doi.org/10.48550/arXiv.2310.01405.
Citations (14)

Summary

  • The paper demonstrates that aligning AI assistants with an 'I don't know' dataset significantly improves their ability to recognize unknown information, achieving up to 78.96% success.
  • It employs a methodology combining Idk-Prompting, Supervised Fine-Tuning, and Preference-Aware Optimization to balance truthful admissions with responsive engagement.
  • The findings highlight a trade-off between conservativeness and responsiveness, with larger models exhibiting enhanced discernment, suggesting scalable benefits for real-world applications.

Can AI Assistants Know What They Don't Know?

This paper addresses the challenge of enabling AI assistants powered by LLMs to recognize their own knowledge boundaries. The research investigates whether AI assistants can acknowledge unknown information and refuse to answer questions beyond their knowledge scope.

Introduction

LLMs demonstrate extraordinary capabilities across diverse tasks but often produce untruthful responses, reflecting hallucinations and factual inaccuracies. The core of the study questions whether AI can discern and accurately express what it knows and does not know. To this end, the authors propose aligning AI assistants with a specialized "I don't know" (Idk) dataset. This dataset categorizes questions into known and unknown categories based on the assistant's performance. Figure 1

Figure 1: Knowledge quadrants of an AI assistant. "Unknowns" represent what the AI does not actually know.

Methodology

Dataset Construction

The Idk dataset is derived from existing open-domain datasets like TriviaQA. A question is marked as "known" if the model answers it correctly multiple times; otherwise, it's marked as "unknown," prompting a refusal response. The dataset assembly is crucial for teaching the model to process and categorize knowledge efficiently. Figure 2

Figure 2: Knowledge quadrants of AI assistants on the Idk dataset (Ik threshold=1.0).

Experimental Alignment

Three methods were employed to teach models to refuse unknown questions:

Results

Using the Idk dataset improves the AI's ability to assess its knowledge boundaries and avoid hallucinations. The AI assistants reached a capacity to correctly identify up to 78.96% of their knowledge limitations, refusing to answer questions when they genuinely lacked information. However, reliance on SFT occasionally introduced conservativeness in the model, leading to potential operational inefficiencies. Figure 3

Figure 3: Variation in the proportions of Ik and Idk questions within the Idk datasets constructed based on different Ik thresholds.

DPO and other preference-based methods mitigated this by encouraging responsive engagement, slightly enhancing the models' ability to handle out-of-distribution data like Natural Questions and ALCUNA.

Challenges and Trade-offs

Key challenges include the trade-off between conservativeness and responsiveness. A high Ik threshold improves truthfulness but can lead to unnecessary refusals of questions the model is equipped to answer accurately. Larger models exhibit better discernment capabilities, suggesting scalability as a potential avenue for refining AI assistant responses.

Conclusion

This study demonstrates a pivotal step in advancing AI self-awareness. By aligning AI models with tailored Idk datasets, the research enhances their ability to know their own knowledge boundaries, thereby improving truthfulness and reducing hallucinatory outputs. Future work could explore fine-tuning the balance between responsiveness and conservativeness to optimize real-world applicability of AI assistants.

In sum, equipping AI with the capability of self-recognition regarding unknowns fortifies their reliability and safety in providing truthful user interactions.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 19 likes about this paper.