Can AI Assistants Know What They Don't Know? (2401.13275v2)

Published 24 Jan 2024 in cs.CL and cs.AI

Abstract: Recently, AI assistants based on LLMs show surprising performance in many tasks, such as dialogue, solving math problems, writing code, and using tools. Although LLMs possess intensive world knowledge, they still make factual errors when facing some knowledge intensive tasks, like open-domain question answering. These untruthful responses from the AI assistant may cause significant risks in practical applications. We believe that an AI assistant's refusal to answer questions it does not know is a crucial method for reducing hallucinations and making the assistant truthful. Therefore, in this paper, we ask the question "Can AI assistants know what they don't know and express them through natural language?" To answer this question, we construct a model-specific "I don't know" (Idk) dataset for an assistant, which contains its known and unknown questions, based on existing open-domain question answering datasets. Then we align the assistant with its corresponding Idk dataset and observe whether it can refuse to answer its unknown questions after alignment. Experimental results show that after alignment with Idk datasets, the assistant can refuse to answer most its unknown questions. For questions they attempt to answer, the accuracy is significantly higher than before the alignment.

References (47)

Authors (10)

Qinyuan Cheng (21 papers)
Tianxiang Sun (35 papers)
Xiangyang Liu (23 papers)
Wenwei Zhang (77 papers)
Zhangyue Yin (27 papers)
Shimin Li (22 papers)
Linyang Li (57 papers)
Kai Chen (512 papers)
Xipeng Qiu (257 papers)
Zhengfu He (10 papers)

Citations (14)

View on Semantic Scholar

Summary

Introduction to Aligned AI Assistants

AI assistants, underpinned by LLMs, are increasingly assisting with a variety of tasks and many have attained remarkable performance milestones. Notwithstanding these achievements, they are frequently prone to 'hallucinations', a phenomenon where they generate factually incorrect responses to certain queries. Such discrepancies could potentially tarnish the reliability of AI systems. To alleviate this issue, it is vital to examine strategies enabling an assistant to recognize the limits of its knowledge and to appropriately decline to answer queries beyond its expertise.

Perceiving Knowledge Boundaries

Conceptually, the knowledge of an AI assistant can be bifurcated into 'knowns' (what the AI is aware that it knows) and 'unknowns' (concepts the AI is aware that it doesn’t know). The framework to categorize knowledge into 'Known Knowns', 'Known Unknowns', 'Unknown Knowns' and 'Unknown Unknowns' offers a systematic representation of an AI assistant's self-awareness with respect to its knowledge domain. A balance between accurately answering within its knowledge base (Ik-Ik) and openly acknowledging gaps in its knowledge base (Ik-Idk) epitomizes an AI’s truthfulness.

The exploration within the paper focuses on aligning AI assistants with special 'I don’t know' (Idk) datasets, which encapsulate both known and unknown facets of the assistant's knowledge. An accurate response from an AI assistant validates its 'Ik-Ik' status, whereas an accurate refusal (not attempting to answer what it doesn't know) marks 'Ik-Idk'. Both 'Unknown Unknowns' and 'Unknown Knowns' lead to untruthful generations and hence must be transitioned into 'Known Unknowns' and 'Known Knowns' for a truthful assistant.

Aligning With Idk Datasets

Constructed upon an open-domain question-answering dataset, the specifically designed Idk dataset characterizes questions that the AI correctly or incorrectly answered multiple times. The alignment processes involve prompting, supervised fine-tuning (SFT) and preference-aware optimization techniques like best-of-n sampling (BoN), proximal policy optimization (PPO), direct preference optimization (DPO), and hindsight instruction relabeling (HIR). Notably, such alignment significantly diminishes the rate of unanswered but known questions (Idk-Ik), simultaneously heightening the AI's refusal to engage with queries outside its knowledge spectrum.

Empirical Outcomes and Ablation Studies

Experimental outcomes asserted that post alignment, AI assistants demonstrated a marked improvement in recognizing their knowledge boundaries, refusing to answer unknowns while enhancing accuracy in responses attempted. Intriguingly, supervised fine-tuning led to an overly conservative model, which was ameliorated by preference-aware optimization strategies, particularly direct preference optimization resulting in a nuanced balance.

Ablation studies underscored that multiple factors influence the AI's ability to discern its knowledge confines. The size of the model is directly proportional to its adeptness in recognizing what it knows; a larger model exhibits higher precision. The source of data, essential in the creation of Idk datasets, significantly impacts the assistant's performance, with model-specific Idk datasets proving crucial. Furthermore, the threshold to define knowns and unknowns within the Idk dataset exerts a considerable effect, with higher thresholds encouraging the assistant toward more truthful behaviors.

Concluding Perspectives

The quest to create AI operatives that can judiciously express their limits has borne fruit, with aligned AI assistants now demonstrating the ability to adeptly navigate their 'known' terrain and refrain from venturing into the uncharted realms of 'unknowns'. Such advances hold profound implications for the development of AI systems that prioritize truthfulness as an intrinsic value, thereby ushering in an era of AI-induced reliability and trust. The release of the codebase and dataset accompanying the paper stands to facilitate further refinement and application of these techniques in AI assistants across various domains.

PDF Markdown

Tweets

https://twitter.com/cheng_qinyuan/status/1752389508023361947

https://twitter.com/cheng_qinyuan/status/1765195758347882576