GRATH: Gradual Self-Truthifying for Large Language Models (2401.12292v2)
Abstract: Truthfulness is paramount for LLMs as they are increasingly deployed in real-world applications. However, existing LLMs still struggle with generating truthful content, as evidenced by their modest performance on benchmarks like TruthfulQA. To address this issue, we propose GRAdual self-truTHifying (GRATH), a novel post-processing method to enhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to generate pairwise truthfulness training data with each pair containing a question and its correct and incorrect answers, and then optimizes the model via direct preference optimization (DPO) to learn from the truthfulness difference between answer pairs. GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. Empirically, we evaluate GRATH using different 7B-LLMs and compare with LLMs with similar or even larger sizes on benchmark datasets. Our results show that GRATH effectively improves LLMs' truthfulness without compromising other core capabilities. Notably, GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- The internal state of an LLM knows when it’s lying. In EMNLP, 2023.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Gpt takes the bar exam. arXiv preprint arXiv:2212.14402, 2022.
- Discovering latent knowledge in language models without supervision. In ICLR, 2023.
- Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
- Driving with llms: Fusing object-level vector modality for explainable autonomous driving. arXiv preprint arXiv:2310.01957, 2023a.
- Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. arXiv preprint arXiv:2312.17484, 2023b.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- Palm-e: An embodied multimodal language model. In ICML, 2023.
- Gallego, V. Zyn: Zero-shot reward models with yes-no questions. arXiv preprint arXiv:2308.06385, 2023.
- A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 2021.
- Measuring massive multitask language understanding. In ICLR, 2021.
- Lora: Low-rank adaptation of large language models. In ICLR, 2022.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Personas as a way to model truthfulness in language models. arXiv preprint arXiv:2310.18168, 2023.
- Backdoor attacks for in-context learning with language models. arXiv preprint arXiv:2307.14692, 2023.
- Reformulating unsupervised style transfer as paraphrase generation. In EMNLP, 2020.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Linguistic properties of truthful response. arXiv preprint arXiv:2305.15875, 2023a.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023b.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In EMNLP, 2023a.
- Inference-time intervention: Eliciting truthful answers from a language model. In NeurIPS, 2023b.
- Truthfulqa: Measuring how models mimic human falsehoods. In ACL, 2022.
- Lo, C. K. What is the impact of chatgpt on education? a rapid review of the literature. Education Sciences, 2023.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022.
- Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
- Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
- Team, M. N. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023a. URL www.mosaicml.com/blog/mpt-7b.
- Team, X.-L. Xwin-lm, 9 2023b. URL https://github.com/Xwin-LM/Xwin-LM.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- Hellaswag: Can a machine really finish your sentence? In ACL, 2019.
- How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
- Weixin Chen (10 papers)
- Bo Li (1108 papers)
- Dawn Song (229 papers)