Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models (2305.14718v5)
Abstract: Reinforcement Learning with Human Feedback (RLHF) is the most prominent method for LLM (LM) alignment. However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's value estimate, A-LoL only trains on positive advantage (leftover) data points, making it resilient to noise. Overall, A-LoL is an easy-to-implement, sample-efficient, and stable LM training recipe. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than the baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL
- Just say no: Analyzing the stance of neural dialogue generation in offensive contexts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4846–4862, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.397. URL https://aclanthology.org/2021.emnlp-main.397.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
- Constitutional ai: Harmlessness from ai feedback, 2022b.
- Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp. 131–198, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-2301. URL https://aclanthology.org/W16-2301.
- COMET: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4762–4779, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1470. URL https://aclanthology.org/P19-1470.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Open problems and fundamental limitations of reinforcement learning from human feedback, 2023.
- Deep rl with hierarchical action exploration for dialogue generation, 2023.
- Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
- Did it happen? the pragmatic complexity of veridicality assessment. Computational Linguistics, 38(2):301–333, February 2012. URL https://doi.org/10.1162/COLI_a_00097.
- Off-Policy Actor-Critic. In International Conference on Machine Learning, Edinburgh, United Kingdom, June 2012. URL https://inria.hal.science/hal-00764021.
- Qlora: Efficient finetuning of quantized llms, 2023.
- Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1l73iRqKm.
- Measuring the carbon intensity of ai in cloud instances. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, pp. 1877–1894, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533234. URL https://doi.org/10.1145/3531146.3533234.
- FaithDial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022a. doi: 10.1162/tacl˙a˙00529. URL https://aclanthology.org/2022.tacl-1.84.
- On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5271–5285, Seattle, United States, July 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.387. URL https://aclanthology.org/2022.naacl-main.387.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
- Dialogue response ranking training with large-scale human feedback data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 386–395, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.28. URL https://aclanthology.org/2020.emnlp-main.28.
- Aligning language models with preferences through f-divergence minimization. In International Conference on Machine Learning (ICML), 2023. URL https://openreview.net/forum?id=ttga7UlrsE.
- Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 708–719, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1065. URL https://aclanthology.org/N18-1065.
- Efficient (soft) Q-learning for text generation with limited good data. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 6969–6991, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.518. URL https://aclanthology.org/2022.findings-emnlp.518.
- Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp. 375–385, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445901. URL https://doi.org/10.1145/3442188.3445901.
- Gpt-critic: Offline reinforcement learning for end-to-end task-oriented dialogue systems. In International Conference on Learning Representations, 2022.
- Critic-guided decoding for controlled text generation, 2022.
- Should i run offline reinforcement learning or behavioral cloning? In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=AP1MKT37rJ.
- A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. URL https://aclanthology.org/N16-1014.
- Aligning generative language models with human values. In Findings of the Association for Computational Linguistics: NAACL 2022, pp. 241–252, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.18. URL https://aclanthology.org/2022.findings-naacl.18.
- Statistical rejection sampling improves preference optimization, 2023.
- Quark: Controllable text generation with reinforced unlearning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 27591–27609. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b125999bde7e80910cbdbd323087df8f-Paper-Conference.pdf.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
- Richard Yuanzhe Pang and He He. Text generation by learning from demonstrations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=RovX-uQ1Hua.
- Reward gaming in conditional text generation, 2023.
- Stabilizing rlhf through advantage model and selective rehearsal, 2023.
- Barbara Plank. The ’problem’ of human label variation: On ground truth in data, modeling and evaluation. In EMNLP, November 2022. URL http://arxiv.org/abs/2211.02570.
- Language models are unsupervised multitask learners. In Arxiv, 2019.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=8aHzds2uUyB.
- Mark B. Ring. Child: A first step towards continual learning. Mach. Learn., 28(1):77–104, jul 1997. ISSN 0885-6125. doi: 10.1023/A:1007331723572. URL https://doi.org/10.1023/A:1007331723572.
- High-dimensional continuous control using generalized advantage estimation. In Yoshua Bengio and Yann LeCun (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1506.02438.
- Proximal policy optimization algorithms, 2017.
- What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1702–1723, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1170. URL https://aclanthology.org/N19-1170.
- Toward diverse text generation with inverse reinforcement learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, pp. 4361–4367. AAAI Press, 2018. ISBN 9780999241127.
- The curse of recursion: Training on generated data makes models forget, 2023.
- Defining and characterizing reward gaming. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yb3HOXO3lX2.
- Offline rl for natural language generation with implicit language q learning. International Conference on Learning Representations, 2023.
- Preference ranking optimization for human alignment, 2023a.
- Reward collapse in aligning large language models, 2023b.
- Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1355. URL https://aclanthology.org/P19-1355.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- CHAI: A CHatbot AI for task-oriented dialogue with offline reinforcement learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4471–4491, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.332. URL https://aclanthology.org/2022.naacl-main.332.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- Large language models are not fair evaluators, 2023.
- Critic regularized regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 7768–7778. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/588cb956d6bbe67078f29f8de420a13d-Paper.pdf.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=gEZrGCozdqR.
- Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=_VjQlMeSB_J.
- Generating sequences by learning to self-correct, 2022.
- Lilian Weng. Policy gradient algorithms. lilianweng.github.io, 2018. URL https://lilianweng.github.io/posts/2018-04-08-policy-gradient/.
- Symbolic knowledge distillation: from general language models to commonsense models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4602–4625, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.341. URL https://aclanthology.org/2022.naacl-main.341.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
- Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
- Rlcd: Reinforcement learning from contrast distillation for language model alignment, 2023.
- Rrhf: Rank responses to align language models with human feedback without tears, 2023.
- DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 270–278, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-demos.30. URL https://aclanthology.org/2020.acl-demos.30.
- Slic-hf: Sequence likelihood calibration with human feedback, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Fine-tuning language models with advantage-induced policy alignment, 2023.
- Ashutosh Baheti (9 papers)
- Ximing Lu (52 papers)
- Faeze Brahman (47 papers)
- Ronan Le Bras (56 papers)
- Maarten Sap (86 papers)
- Mark Riedl (51 papers)