Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 31 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 95 tok/s
GPT OSS 120B 478 tok/s Pro
Kimi K2 223 tok/s Pro
2000 character limit reached

COPR: Continual Human Preference Learning via Optimal Policy Regularization (2402.14228v3)

Published 22 Feb 2024 in cs.LG and cs.AI

Abstract: Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of LLMs with human preferences. Given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. Nevertheless, making RLHF compatible with Continual Learning (CL) is challenging due to its complex process. Meanwhile, directly learning new human preferences may lead to Catastrophic Forgetting (CF) of historical preferences, resulting in helpless or harmful outputs. To overcome these challenges, we propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory. COPR utilizes a sampling distribution as a demonstration and regularization constraints for CL. It adopts the Lagrangian Duality (LD) method to dynamically regularize the current policy based on the historically optimal policy, which prevents CF and avoids over-emphasizing unbalanced objectives. We also provide formal proof for the learnability of COPR. The experimental results show that COPR outperforms strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4 evaluations and human assessment. Furthermore, we validate the robustness of COPR under various CL settings, including different backbones, replay memory sizes, and learning orders.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Memory aware synapses: Learning what (not) to forget. In Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y. (eds.), Proceedings of the European Conference on Computer Vision (ECCV), pp.  144–161, Cham, 2018. Springer International Publishing. ISBN 978-3-030-01219-9.
  2. Task-free continual learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  3. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  5. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  6. Dark experience for general continual learning: a strong, simple baseline. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  15920–15930. Curran Associates, Inc., 2020.
  7. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  8. Efficient lifelong learning with A-GEM. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  9. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  10. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  11. Safe RLHF: Safe reinforcement learning from human feedback, 2023.
  12. RAFT: Reward ranked finetuning for generative foundation model alignment, 2023.
  13. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pp.  86–102. Springer, 2020.
  14. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5988–6008. PMLR, 17–23 Jul 2022.
  15. Self-supervised training enhances online continual learning. arXiv preprint arXiv:2103.14010, 2021.
  16. Deep learning. MIT press, 2016.
  17. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  831–839, 2019.
  18. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114.
  19. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  20. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell., 40(12):2935–2947, dec 2018. ISSN 0162-8828. doi: 10.1109/TPAMI.2017.2773081.
  21. Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn., 8(3–4):293–321, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992699.
  22. Chain of hindsight aligns language models with feedback, 2023a.
  23. Second thoughts are best: Learning to re-align with human values from text edits. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
  24. Training socially aligned language models on simulated social interactions, 2023b.
  25. Gradient episodic memory for continual learning. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  6467–6476, 2017.
  26. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
  27. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  28. Training language models to follow instructions with human feedback, 2022.
  29. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  30. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  31. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. 2022.
  32. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019.
  33. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
  34. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pp.  4548–4557. PMLR, 2018.
  35. Preference ranking optimization for human alignment, 2023.
  36. Learning to summarize from human feedback. CoRR, abs/2009.01325, 2020.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  39. TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pp.  59–63, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4508.
  40. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, May 2021.
  41. A comprehensive survey of continual learning: Theory, method and application, 2023a.
  42. Large language models are not fair evaluators, 2023b.
  43. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment, 2023.
  44. RRHF: Rank responses to align language models with human feedback without tears, 2023.
  45. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  46. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations, 2023.
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.