Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unfamiliar Finetuning Examples Control How Language Models Hallucinate (2403.05612v2)

Published 8 Mar 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs are known to hallucinate when faced with unfamiliar queries, but the underlying mechanism that govern how models hallucinate are not yet fully understood. In this work, we find that unfamiliar examples in the models' finetuning data -- those that introduce concepts beyond the base model's scope of knowledge -- are crucial in shaping these errors. In particular, we find that an LLM's hallucinated predictions tend to mirror the responses associated with its unfamiliar finetuning examples. This suggests that by modifying how unfamiliar finetuning examples are supervised, we can influence a model's responses to unfamiliar queries (e.g., say ``I don't know''). We empirically validate this observation in a series of controlled experiments involving SFT, RL, and reward model finetuning on TriviaQA and MMLU. Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations. We find that, while hallucinations from the reward model can significantly undermine the effectiveness of RL factuality finetuning, strategically controlling how reward models hallucinate can minimize these negative effects. Leveraging our previous observations on controlling hallucinations, we propose an approach for learning more reliable reward models, and show that they improve the efficacy of RL factuality finetuning in long-form biography and book/movie plot generation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248, 2023.
  2. The internal state of an LLM knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
  3. Jon Bell. Wikiplots, 2017. URL https://github.com/markriedl/WikiPlots.
  4. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
  5. DoLa: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
  6. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  7. RARR: Researching and revising what language models say, using language models. In ACL, 2023.
  8. Yoav Goldberg. Reinforcement learning for language models, 2023. URL https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81.
  9. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  10. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  11. FAITHSCORE: Evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477, 2023.
  12. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
  13. Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648, 2023.
  14. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 2023.
  15. Deep neural networks tend to extrapolate predictably. arXiv preprint arXiv:2310.00873, 2023.
  16. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023.
  17. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 2022.
  18. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
  19. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022.
  20. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? arXiv preprint arXiv:2312.03729, 2023.
  21. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In ACL, 2023.
  22. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
  23. Improving factual consistency between a response and persona facts. arXiv preprint arXiv:2005.00036, 2020.
  24. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
  25. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852, 2023.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  27. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
  28. Factually consistent summarization via reinforcement learning with textual entailment feedback. arXiv preprint arXiv:2306.00186, 2023.
  29. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  30. John Shulman. Reinforcement learning from human feedback: Progress and challenges, 2023. URL https://www.youtube.com/watch?v=hhiLw5Q_UFg.
  31. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  32. Prompting GPT-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022.
  33. Wikibio: a semantic resource for the intersectional analysis of biographical events. arXiv preprint arXiv:2306.09505, 2023.
  34. Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525, 2023.
  35. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023a.
  36. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023b.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  38. Med-HALT: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343, 2023.
  39. A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987, 2023.
  40. Understanding and detecting hallucinations in neural machine translation via model introspection. TACL, 2023.
  41. Alignment for honesty. arXiv preprint arXiv:2312.07000, 2023.
  42. WikiChat: Combating hallucination of large language models by few-shot grounding on wikipedia. arXiv preprint arXiv:2305.14292, 2023.
  43. R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677, 2023a.
  44. Alleviating hallucinations of large language models through induced hallucinations. arXiv preprint arXiv:2312.15710, 2023b.
  45. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, 2021.
Citations (33)

Summary

  • The paper identifies that supervision on unfamiliar finetuning examples leads LLMs to produce hedged, cautious responses when facing uncertain inputs.
  • It demonstrates through empirical tests on multi-choice and long-form tasks that adjusting the supervision for these examples can reduce factual inaccuracies.
  • The study offers actionable insights to enhance LLM reliability by improving how models handle uncertainty and recognize their limits.

Unfamiliar Finetuning Examples Determine LLM Hallucinations

Introduction

The challenge of LLM (LM) hallucinations, particularly in producing factually incorrect responses to unfamiliar queries, is a notable limitation of LLMs. This paper presents an in-depth investigation into the mechanics of how finetuned LLMs generate these false but plausible responses, revealing a connection to the handling of unfamiliar examples during finetuning. The research highlights a pattern where, amidst unfamiliar inputs, LLM outputs show a tendency to default towards a hedged prediction, influenced significantly by the supervision provided for analogous unfamiliar instances in the finetuning phase. Crucially, this mechanism opens avenues to mitigate hallucinatory responses through the strategic adjustment of supervision for these critical examples.

Understanding LLM Hallucinations

At the heart of this exploration is the hypothesis that the pattern of LLM hallucinations, rather than being arbitrary, is significantly influenced by the distribution of responses associated with unfamiliar examples encountered during finetuning. The paper posits that as inputs drift into unfamiliar territory, the LLM's predictions gravitate towards a hedged prediction—essentially an educated guess minimizing the aggregate loss across these problematic examples.

This behavior was empirically tested and confirmed through carefully designed experiments on multi-choice questioning and long-form generation tasks, including biographies and plot summaries. Remarkably, the analysis showcases that by adjusting the supervision of the finetuning datasheet's unfamiliar inputs, it's possible to steer the model towards more desirable outcomes—for instance, teaching it to acknowledge its limits by preferring an “I don’t know” response in the face of uncertainty.

Implications and Future Directions

The implications of these findings are multifaceted. Practically, they provide a blueprint for improving the reliability and trustworthiness of LLMs in producing factual content, especially when venturing into domains not well-represented in their training material. Theoretically, this work contributes to a deeper understanding of the dynamics of LLM learning, particularly in how these models internalize and generalize from their training instances to unseen data.

Looking ahead, this paper’s model of hallucinations in LLMs lays a foundation for further research in the domain. There is an open invitation to explore the nuances of partially familiar inputs—those lying midway between the familiar and the utterly unknown. Additionally, while the paper zeroes in on models finetuned for specific tasks, extending these findings to more generically trained LLMs represents a promising arena for exploration.

Conclusion

In summary, this paper elucidates a critical link between the supervision of unfamiliar examples during LLM finetuning and the tendency of these models to generate hallucinated content. By leveraging this insight, the research not only opens new vistas in controlling LLM outputs under uncertainty but also enhances our theoretical grasp of how these advanced AI constructs learn and adapt. The overarching goal is a future where LLMs can navigate the vast expanse of human knowledge with both confidence and caution, reliably informing us when they venture beyond their realms of expertise.