Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chain of Hindsight Aligns Language Models with Feedback (2302.02676v8)

Published 6 Feb 2023 in cs.LG and cs.CL

Abstract: Learning from human preferences is important for LLMs to match human needs and to align with human and social values. Prior works have achieved remarkable successes by learning from human feedback to understand and follow instructions. Nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them inefficient in terms of data utilization and challenging to apply in general, or they depend on reinforcement learning, which often suffers from imperfect reward functions and relies on extremely challenging optimizations. In this work, we propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. Our idea is inspired by how humans learn from extensive feedback presented in the form of languages. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model, allowing us to take advantage of the language comprehension capabilities of LLMs. We condition the model on a sequence of model generations paired with feedback. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors. Applying our method to LLMs, we observed that Chain of Hindsight significantly surpasses previous methods in aligning LLMs with human preferences. We report significant improvements on summarization and dialogue benchmarks, with our approach markedly preferred in human evaluations.

Chain of Hindsight: Aligning LLMs with Feedback

The paper "Chain of Hindsight aligns LLMs with Feedback" introduces a novel approach for enhancing the alignment of LLMs with human preferences. This work is situated within the ongoing effort to make LLMs more attuned to human values by learning from human feedback, a crucial aspect for the models’ broader acceptance in society.

The primary innovation presented in the paper is the Chain of Hindsight (CoH), which proposes an efficient technique for finetuning LLMs by leveraging human feedback in the form of natural language comparisons. Unlike conventional methods reliant on selective positive feedback data or reinforcement learning (RLHF), which can be cumbersome and imperfect, CoH utilizes a straightforward optimization process to condition models on sequences of outputs paired with feedback. These sequences enable the models to learn to correct errors based on comparisons and improve upon negative attributes of the generated outputs.

The paper outlines that existing methods like Supervised Finetuning (SFT) and RLHF have limitations. SFT hinges on labeled datasets that emphasize positively-rated outputs, subsequently limiting the model's exposure to error correction paradigms. RLHF, while more generalized in terms of data utilization, demands meticulous reward function design and challenging optimization processes. CoH innovatively synergizes the strengths of both methods without inheriting their respective drawbacks.

In practical terms, CoH transforms all forms of feedback—positive or negative—into sequences, allowing LLMs to use their inherent language comprehension abilities to optimize and fine-tune their outputs. This is achieved by presenting the models with feedback alongside previous model generations, which guides them through comparative analysis towards more aligned outputs.

The researchers demonstrate that CoH significantly outperforms SFT, Conditional SFT, SFT with unlikelihood loss, and state-of-the-art RLHF baselines on tasks involving summarization and dialogue, as evidenced by both human evaluation and automated metrics. Notably, CoH integrates feedback through natural language descriptions, enhancing the models' flexibility and scalability. For example, when conditioned on feedback indicators like 'Good' or 'Bad', models using CoH produce more accurate, coherent, and comprehensive summaries compared to baselines.

An important contribution of the paper is the scalability of the CoH method. Unlike RLHF, CoH retains the training objective consistency with the pretraining phase, suggesting a much broader applicability and ease of integration into existing LLM training pipelines. Furthermore, CoH's ability to incorporate both fine-grained language feedback and binary feedback without requiring reinforcement signals indicates its potential to significantly reduce the alignment tax typically associated with preference-trained models.

The implications of this work are substantial both theoretically and practically. Theoretically, CoH provides a robust framework for feedback integration, offering a methodological departure from reliance on hand-engineered reward functions. Practically, it points towards more resource-effective and scalable ways to align LLMs with human norms, potentially making these models more reliable for real-world applications.

Future developments could explore the integration of external feedback sources beyond human comparative feedback, such as technical evaluations and user-generated performance metrics, which could further bolster the efficacy and alignment potential of LLMs. In conclusion, the Chain of Hindsight presents a promising paradigm for aligning LLMs with human preferences, offering significant improvements and efficiencies over existing approaches. Its continued development and application could mark a substantial step forward in the responsible deployment of AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  3. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  6. Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214, 2019.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  9. Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511, 2018.
  10. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  12. Hierarchical neural story generation. arXiv preprint arXiv: Arxiv-1805.04833, 2018.
  13. Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv: Arxiv-1707.02633, 2017.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  15. The Pile: An 800GB dataset of diverse text for language modeling. Computing Research Repository, arXiv:2101.00027, 2020. version 1.
  16. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
  17. Koala: A dialogue model for academic research. Blog post, April, 1, 2023.
  18. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019.
  19. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  20. Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages 1094–8. Citeseer, 1993.
  21. Ctrl: A conditional transformer language model for controllable generation. PREPRINT, 2019.
  22. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  23. Pretraining language models with human preferences. arXiv preprint arXiv:2302.08582, 2023.
  24. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958, 2018.
  25. In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv: Arxiv-2210.14215, 2022.
  26. Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252, 2018.
  27. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. International Conference On Machine Learning, 2021.
  28. Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746, 2022.
  29. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. arXiv preprint arXiv:1911.03860, 2019.
  30. Fcm: Forgetful causal masking makes causal language models better zero-shot learners. arXiv preprint arXiv:2210.13432, 2022.
  31. QUARK: Controllable text generation with reinforced unlearning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=5HaIds3ux5O.
  32. Interactive learning from policy-dependent human feedback. International Conference On Machine Learning, 2017.
  33. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
  34. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  35. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  36. Finding generalizable evidence by learning to convince q&a models. arXiv preprint arXiv:1909.05863, 2019.
  37. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
  38. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  39. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
  40. Universal value function approximators. In International conference on machine learning, pages 1312–1320. PMLR, 2015.
  41. Training language models with language feedback. arXiv preprint arXiv: Arxiv-2204.14146, 2022.
  42. Chatgpt: Optimizing language models for dialogue. OpenAI Blog, 2022. URL https://openai.com/blog/chatgpt.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  44. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929–1958, 2014.
  45. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  46. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  47. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.
  48. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  49. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. URL https://arxiv. org/abs/2204.07705, 2022.
  50. Deep tamer: Interactive agent shaping in high-dimensional state spaces. Aaai Conference On Artificial Intelligence, 2017. doi: 10.1609/aaai.v32i1.11485.
  51. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  52. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  53. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
  54. Crossfit: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835, 2021.
  55. Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. arXiv preprint arXiv:1904.13015, 2019.
  56. Star: Self-taught reasoner bootstrapping reasoning with reasoning. NeurIPS, 2022.
  57. Opt: Open pre-trained transformer language models. arXiv preprint arXiv: Arxiv-2205.01068, 2022.
  58. The wisdom of hindsight makes language models better instruction followers. arXiv preprint arXiv: Arxiv-2302.05206, 2023.
  59. Wangchunshu Zhou and Ke Xu. Learning to compare for better training and evaluation of open domain natural language generation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9717–9724, 2020.
  60. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hao Liu (497 papers)
  2. Carmelo Sferrazza (22 papers)
  3. Pieter Abbeel (372 papers)
Citations (109)
Youtube Logo Streamline Icon: https://streamlinehq.com