Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback (2403.18349v3)
Abstract: LLMs often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope. While addressing hallucination has been a focal point in research, previous efforts primarily concentrate on enhancing correctness without giving due consideration to the significance of rejection mechanisms. In this paper, we conduct a comprehensive examination of the role of rejection, introducing the notion of model reliability along with corresponding metrics. These metrics measure the model's ability to provide accurate responses while adeptly rejecting questions exceeding its knowledge boundaries, thereby minimizing hallucinations. To improve the inherent reliability of LLMs, we present a novel alignment framework called Reinforcement Learning from Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically determine the model's knowledge boundary and trains a reliable reward model to encourage the refusal of out-of-knowledge questions. Experimental results on mathematical questions affirm the substantial efficacy of RLKF in significantly enhancing LLM reliability.
- Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610β623.
- Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
- Understanding dataset difficulty with π±π±\mathcal{V}caligraphic_V-usable information. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988β6008. PMLR.
- Truthful ai: Developing and governing ai that does not lie. arXiv preprint arXiv:2110.06674.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835β10866. PMLR.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- Hallucinations in large multilingual translation models. Transactions of the Association for Computational Linguistics, 11:1500β1517.
- IDEA-CCNL. 2021. Fengshenbang-lm. https://github.com/IDEA-CCNL/Fengshenbang-LM.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
- LAION-AI. 2023. Openassistant. https://github.com/LAION-AI/Open-Assistant.
- A token-level reference-free hallucination detection benchmark for free-form text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6723β6737.
- Tiedong Liu and Bryan KianΒ Hsiang Low. 2023. Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks. arXiv preprint arXiv:2305.14201.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730β27744.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
- Irene Solaiman and Christy Dennison. 2021. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861β5873.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008β3021.
- Moss: Training conversational language models from synthetic data.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Improving the reliability of large language models by leveraging uncertainty-aware in-context learning. arXiv preprint arXiv:2310.04782.
- Alignment for honesty. arXiv preprint arXiv:2312.07000.
- Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320.
- How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
- Verify-and-edit: A knowledge-enhanced chain-of-thought framework. arXiv preprint arXiv:2305.03268.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Hongshen Xu (21 papers)
- Zichen Zhu (17 papers)
- Da Ma (28 papers)
- Situo Zhang (9 papers)
- Shuai Fan (17 papers)
- Lu Chen (244 papers)
- Kai Yu (201 papers)