Unfamiliar Finetuning Examples Control How Language Models Hallucinate (2403.05612v2)
Abstract: LLMs are known to hallucinate when faced with unfamiliar queries, but the underlying mechanism that govern how models hallucinate are not yet fully understood. In this work, we find that unfamiliar examples in the models' finetuning data -- those that introduce concepts beyond the base model's scope of knowledge -- are crucial in shaping these errors. In particular, we find that an LLM's hallucinated predictions tend to mirror the responses associated with its unfamiliar finetuning examples. This suggests that by modifying how unfamiliar finetuning examples are supervised, we can influence a model's responses to unfamiliar queries (e.g., say ``I don't know''). We empirically validate this observation in a series of controlled experiments involving SFT, RL, and reward model finetuning on TriviaQA and MMLU. Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations. We find that, while hallucinations from the reward model can significantly undermine the effectiveness of RL factuality finetuning, strategically controlling how reward models hallucinate can minimize these negative effects. Leveraging our previous observations on controlling hallucinations, we propose an approach for learning more reliable reward models, and show that they improve the efficacy of RL factuality finetuning in long-form biography and book/movie plot generation tasks.
- Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248, 2023.
- The internal state of an LLM knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
- Jon Bell. Wikiplots, 2017. URL https://github.com/markriedl/WikiPlots.
- Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
- DoLa: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- RARR: Researching and revising what language models say, using language models. In ACL, 2023.
- Yoav Goldberg. Reinforcement learning for language models, 2023. URL https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- FAITHSCORE: Evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477, 2023.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648, 2023.
- Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 2023.
- Deep neural networks tend to extrapolate predictably. arXiv preprint arXiv:2310.00873, 2023.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023.
- Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 2022.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
- Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022.
- Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? arXiv preprint arXiv:2312.03729, 2023.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In ACL, 2023.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
- Improving factual consistency between a response and persona facts. arXiv preprint arXiv:2005.00036, 2020.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
- Factually consistent summarization via reinforcement learning with textual entailment feedback. arXiv preprint arXiv:2306.00186, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- John Shulman. Reinforcement learning from human feedback: Progress and challenges, 2023. URL https://www.youtube.com/watch?v=hhiLw5Q_UFg.
- Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
- Prompting GPT-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022.
- Wikibio: a semantic resource for the intersectional analysis of biographical events. arXiv preprint arXiv:2306.09505, 2023.
- Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525, 2023.
- Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023a.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023b.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Med-HALT: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343, 2023.
- A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987, 2023.
- Understanding and detecting hallucinations in neural machine translation via model introspection. TACL, 2023.
- Alignment for honesty. arXiv preprint arXiv:2312.07000, 2023.
- WikiChat: Combating hallucination of large language models by few-shot grounding on wikipedia. arXiv preprint arXiv:2305.14292, 2023.
- R-tuning: Teaching large language models to refuse unknown questions. arXiv preprint arXiv:2311.09677, 2023a.
- Alleviating hallucinations of large language models through induced hallucinations. arXiv preprint arXiv:2312.15710, 2023b.
- Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, 2021.