Unfamiliar Finetuning Examples Control How Language Models Hallucinate (2403.05612v2)

Published 8 Mar 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs are known to hallucinate when faced with unfamiliar queries, but the underlying mechanism that govern how models hallucinate are not yet fully understood. In this work, we find that unfamiliar examples in the models' finetuning data -- those that introduce concepts beyond the base model's scope of knowledge -- are crucial in shaping these errors. In particular, we find that an LLM's hallucinated predictions tend to mirror the responses associated with its unfamiliar finetuning examples. This suggests that by modifying how unfamiliar finetuning examples are supervised, we can influence a model's responses to unfamiliar queries (e.g., say ``I don't know''). We empirically validate this observation in a series of controlled experiments involving SFT, RL, and reward model finetuning on TriviaQA and MMLU. Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations. We find that, while hallucinations from the reward model can significantly undermine the effectiveness of RL factuality finetuning, strategically controlling how reward models hallucinate can minimize these negative effects. Leveraging our previous observations on controlling hallucinations, we propose an approach for learning more reliable reward models, and show that they improve the efficacy of RL factuality finetuning in long-form biography and book/movie plot generation tasks.

References (45)

Citations (33)

View on Semantic Scholar

Summary

The paper identifies that supervision on unfamiliar finetuning examples leads LLMs to produce hedged, cautious responses when facing uncertain inputs.
It demonstrates through empirical tests on multi-choice and long-form tasks that adjusting the supervision for these examples can reduce factual inaccuracies.
The study offers actionable insights to enhance LLM reliability by improving how models handle uncertainty and recognize their limits.

Unfamiliar Finetuning Examples Determine LLM Hallucinations

Introduction

The challenge of LLM (LM) hallucinations, particularly in producing factually incorrect responses to unfamiliar queries, is a notable limitation of LLMs. This paper presents an in-depth investigation into the mechanics of how finetuned LLMs generate these false but plausible responses, revealing a connection to the handling of unfamiliar examples during finetuning. The research highlights a pattern where, amidst unfamiliar inputs, LLM outputs show a tendency to default towards a hedged prediction, influenced significantly by the supervision provided for analogous unfamiliar instances in the finetuning phase. Crucially, this mechanism opens avenues to mitigate hallucinatory responses through the strategic adjustment of supervision for these critical examples.

Understanding LLM Hallucinations

At the heart of this exploration is the hypothesis that the pattern of LLM hallucinations, rather than being arbitrary, is significantly influenced by the distribution of responses associated with unfamiliar examples encountered during finetuning. The paper posits that as inputs drift into unfamiliar territory, the LLM's predictions gravitate towards a hedged prediction—essentially an educated guess minimizing the aggregate loss across these problematic examples.

This behavior was empirically tested and confirmed through carefully designed experiments on multi-choice questioning and long-form generation tasks, including biographies and plot summaries. Remarkably, the analysis showcases that by adjusting the supervision of the finetuning datasheet's unfamiliar inputs, it's possible to steer the model towards more desirable outcomes—for instance, teaching it to acknowledge its limits by preferring an “I don’t know” response in the face of uncertainty.

Implications and Future Directions

The implications of these findings are multifaceted. Practically, they provide a blueprint for improving the reliability and trustworthiness of LLMs in producing factual content, especially when venturing into domains not well-represented in their training material. Theoretically, this work contributes to a deeper understanding of the dynamics of LLM learning, particularly in how these models internalize and generalize from their training instances to unseen data.

Looking ahead, this paper’s model of hallucinations in LLMs lays a foundation for further research in the domain. There is an open invitation to explore the nuances of partially familiar inputs—those lying midway between the familiar and the utterly unknown. Additionally, while the paper zeroes in on models finetuned for specific tasks, extending these findings to more generically trained LLMs represents a promising arena for exploration.

Conclusion

In summary, this paper elucidates a critical link between the supervision of unfamiliar examples during LLM finetuning and the tendency of these models to generate hallucinated content. By leveraging this insight, the research not only opens new vistas in controlling LLM outputs under uncertainty but also enhances our theoretical grasp of how these advanced AI constructs learn and adapt. The overarching goal is a future where LLMs can navigate the vast expanse of human knowledge with both confidence and caution, reliably informing us when they venture beyond their realms of expertise.

PDF Markdown

Related Papers

Tweets

https://twitter.com/katie_kang_/status/1767616926522773873

https://twitter.com/svlevine/status/1767790130830667790

https://twitter.com/aviral_kumar2/status/1767638465074905538

https://twitter.com/fly51fly/status/1767670966539468823

https://twitter.com/lu_sichu/status/1767920451383071139

https://twitter.com/theomitsa/status/1769403729520808146