Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GRATH: Gradual Self-Truthifying for Large Language Models (2401.12292v2)

Published 22 Jan 2024 in cs.CL and cs.AI

Abstract: Truthfulness is paramount for LLMs as they are increasingly deployed in real-world applications. However, existing LLMs still struggle with generating truthful content, as evidenced by their modest performance on benchmarks like TruthfulQA. To address this issue, we propose GRAdual self-truTHifying (GRATH), a novel post-processing method to enhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to generate pairwise truthfulness training data with each pair containing a question and its correct and incorrect answers, and then optimizes the model via direct preference optimization (DPO) to learn from the truthfulness difference between answer pairs. GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner. Empirically, we evaluate GRATH using different 7B-LLMs and compare with LLMs with similar or even larger sizes on benchmark datasets. Our results show that GRATH effectively improves LLMs' truthfulness without compromising other core capabilities. Notably, GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  2. The internal state of an LLM knows when it’s lying. In EMNLP, 2023.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  5. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  6. Gpt takes the bar exam. arXiv preprint arXiv:2212.14402, 2022.
  7. Discovering latent knowledge in language models without supervision. In ICLR, 2023.
  8. Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
  9. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. arXiv preprint arXiv:2310.01957, 2023a.
  10. Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. arXiv preprint arXiv:2312.17484, 2023b.
  11. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
  14. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  15. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  16. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  17. Palm-e: An embodied multimodal language model. In ICML, 2023.
  18. Gallego, V. Zyn: Zero-shot reward models with yes-no questions. arXiv preprint arXiv:2308.06385, 2023.
  19. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 2021.
  20. Measuring massive multitask language understanding. In ICLR, 2021.
  21. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  22. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  23. Personas as a way to model truthfulness in language models. arXiv preprint arXiv:2310.18168, 2023.
  24. Backdoor attacks for in-context learning with language models. arXiv preprint arXiv:2307.14692, 2023.
  25. Reformulating unsupervised style transfer as paraphrase generation. In EMNLP, 2020.
  26. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  27. Linguistic properties of truthful response. arXiv preprint arXiv:2305.15875, 2023a.
  28. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023b.
  29. Halueval: A large-scale hallucination evaluation benchmark for large language models. In EMNLP, 2023a.
  30. Inference-time intervention: Eliciting truthful answers from a language model. In NeurIPS, 2023b.
  31. Truthfulqa: Measuring how models mimic human falsehoods. In ACL, 2022.
  32. Lo, C. K. What is the impact of chatgpt on education? a rapid review of the literature. Education Sciences, 2023.
  33. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  34. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023.
  35. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  36. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
  37. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
  38. Team, M. N. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023a. URL www.mosaicml.com/blog/mpt-7b.
  39. Team, X.-L. Xwin-lm, 9 2023b. URL https://github.com/Xwin-LM/Xwin-LM.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  41. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  42. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  43. Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023.
  44. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  45. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  46. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  47. Hellaswag: Can a machine really finish your sentence? In ACL, 2019.
  48. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
  49. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Weixin Chen (10 papers)
  2. Bo Li (1108 papers)
  3. Dawn Song (229 papers)
Citations (4)

Summary

  • The paper introduces GRATH, a self-truthification framework that iteratively fine-tunes LLMs using direct preference optimization to enhance factual accuracy.
  • The methodology leverages pairwise training data to narrow domain gaps and expand distributional distances, achieving significant truthfulness improvements in just two DPO iterations.
  • Empirical results show GRATH achieving 54.71% MC1 and 69.10% MC2 accuracy on TruthfulQA, outperforming larger models like Llama2-Chat-70B by over 23%.

Introduction

LLMs play a pivotal role in various applications. Given their extensive utilization, the truthfulness of their responses has emerged as a crucial quality metric. The development of models capable of discerning and disseminating accurate information is of particular importance, especially in domains where the integrity of the output has significant implications. Indeed, recent benchmarks such as TruthfulQA have been established to appraise the veracity of model-generated content.

Gradual Self-Truthifying

Against this backdrop, a notable innovation comes in the form of GRAdual self-truTHifying (GRATH), a post-processing methodology aimed at incrementally refining the truthfulness of LLMs. Central to GRATH is the generation of pairwise truthfulness data via LLM prompting, accomplishing a self-supervised fine-tuning process through direct preference optimization (DPO). Essentially, the LLM is initially generated with correct and incorrect answers, creating training pairs that serve as the basis for successive DPO fine-tuning. This iterative cycle culminates in an amplification of the model's capacity to deliver truthful answers.

Empirical Evaluation

The proficiency of GRATH is rigorously evaluated across various LLMs including 7B-sized models. When gauged against other LLMs, even those with substantially greater parameters, GRATH demonstrates a commendable improvement in truthfulness metrics. Specifically, GRATH achieves a MC1 accuracy of 54.71% and a MC2 accuracy of 69.10% on the TruthfulQA benchmark, surpassing Llama2-Chat-70B by over 23%. Furthermore, this enhancement in truthfulness is not at the expense of other capabilities like reasoning and common-sense understanding, as validated by benchmarks such as ARC, HellaSwag, and MMLU.

Towards a Better Understanding

GRATH's methodology has provoked an in-depth analysis aiming to unravel the workings behind the enhanced truthfulness. Two facets are identified - the domain gap between training and testing data, and the distributional distance between correct and incorrect answers within training pairs. Insights reveal that a narrowing of the domain gap and an expansion of distributional distance substantially benefit the truthfulness of the model. A further dimension is the iterative optimization process of GRATH which showcases the method's expediency, generally requiring a mere two DPO executions to achieve state-of-the-art performance.

Conclusion

GRATH brings forth a highly potent technique for improving the trustworthiness of LLMs and does so without the need for labor-intensive annotations. It also proves to be OOD-resilient and enhances models' truthfulness with remarkable efficiency. With the versatility to adapt to various alignment techniques, GRATH sets the stage for further explorations around multi-attribute optimization for LLMs. As the landscape of LLMs expands and their application domains grow more critical, tools like GRATH will be instrumental in ensuring that the lines between machine-generated content and factual truthfulness remain unblurred.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com