Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation (2401.07382v2)
Abstract: Reinforcement learning (RL) can align LLMs with non-differentiable reward signals, such as human preferences. However, a major challenge arises from the sparsity of these reward signals - typically, there is only a single reward for an entire output. This sparsity of rewards can lead to inefficient and unstable learning. To address this challenge, our paper introduces an novel framework that utilizes the critique capability of LLMs to produce intermediate-step rewards during RL training. Our method involves coupling a policy model with a critic LLM, which is responsible for providing comprehensive feedback of each part of the output. This feedback is then translated into token or span-level rewards that can be used to guide the RL training process. We investigate this approach under two different settings: one where the policy model is smaller and is paired with a more powerful critic model, and another where a single LLM fulfills both roles. We assess our approach on three text generation tasks: sentiment control, LLM detoxification, and summarization. Experimental results show that incorporating artificial intrinsic rewards significantly improve both sample efficiency and the overall performance of the policy model, supported by both automatic and human evaluation.
- Hindsight experience replay. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Anonymous. 2023. Let’s verify step by step. In Submitted to The Twelfth International Conference on Learning Representations. Under review.
- An actor-critic algorithm for sequence prediction. In International Conference on Learning Representations.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Ask the right questions: Active question reformulation with reinforcement learning. In International Conference on Learning Representations.
- Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.
- Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
- Learning with rejection for abstractive text summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9768–9780, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Systematic rectification of language models via dead-end analysis. In The Eleventh International Conference on Learning Representations.
- Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations.
- Exploration-guided reward shaping for reinforcement learning under sparse rewards. In Advances in Neural Information Processing Systems.
- Towards end-to-end reinforcement learning of dialogue agents for information access. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 484–495, Vancouver, Canada. Association for Computational Linguistics.
- BanditSum: Extractive summarization as a contextual bandit. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3739–3748, Brussels, Belgium. Association for Computational Linguistics.
- Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692.
- Policy networks with two-stage training for dialogue systems. In Proceedings of SIGDial 2016. arXiv.
- APRIL: Interactively learning to summarise by combining active preference learning and reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4120–4130, Brussels, Belgium. Association for Computational Linguistics.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
- Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
- Using natural language for reward shaping in reinforcement learning. arXiv preprint arXiv:1903.02020.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
- Dual learning for machine translation. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Classifying emotions in customer support dialogues in social media. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 64–73, Los Angeles. Association for Computational Linguistics.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. CoRR, abs/1907.00456.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
- Motif: Intrinsic motivation from artificial intelligence feedback. arXiv preprint arXiv:2310.00166.
- GeDi: Generative discriminator guided sequence generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4929–4952, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- The nethack learning environment. In Advances in Neural Information Processing Systems, volume 33, pages 7671–7684. Curran Associates, Inc.
- Reward design with language models. In The Eleventh International Conference on Learning Representations.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Austin, Texas. Association for Computational Linguistics.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050.
- DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706, Online. Association for Computational Linguistics.
- QUARK: Controllable text generation with reinforced unlearning. In Advances in Neural Information Processing Systems.
- Critique ability of large language models. arXiv preprint arXiv:2310.04815.
- Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA. PMLR.
- Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287. Citeseer.
- Reward augmented maximum likelihood for neural structured prediction. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
- Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2721–2730. PMLR.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Text generation by learning from demonstrations. In International Conference on Learning Representations.
- Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR.
- Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2778–2787. PMLR.
- Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2231–2240, Copenhagen, Denmark. Association for Computational Linguistics.
- Martin L. Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
- Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
- Seonggi Ryang and Takeshi Abekawa. 2012. Framework of automatic text summarization using reinforcement learning. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 256–265, Jeju Island, Korea. Association for Computational Linguistics.
- Self-critiquing models for assisting human evaluators, 2022. URL https://arxiv. org/abs/2206.05802.
- High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR).
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
- Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 147–157, Saarbrücken, Germany. Association for Computational Linguistics.
- Intrinsic motivation and automatic curricula via asymmetric self-play. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 12. MIT Press.
- #exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288.
- TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, Copenhagen, Denmark. Association for Computational Linguistics.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- A large-scale dataset for empathetic response generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1251–1264, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
- Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693.
- DCN+: Mixed objective and deep residual coattention for question answering. In International Conference on Learning Representations.
- Hierarchical reinforcement learning by discovering intrinsic options. In International Conference on Learning Representations.
- Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copenhagen, Denmark. Association for Computational Linguistics.
- On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Meng Cao (107 papers)
- Lei Shu (82 papers)
- Lei Yu (234 papers)
- Yun Zhu (52 papers)
- Nevan Wichers (11 papers)
- Yinxiao Liu (8 papers)
- Lei Meng (54 papers)