COPR: Continual Learning Human Preference through Optimal Policy Regularization (2310.15694v5)
Abstract: The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained LLMs (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Regularization (COPR), in which we compute the distribution of optimal policy bypassing the partition function and then regularize the current policy based on the historically optimal distribution to mitigate Catastrophic Forgetting (CF). COPR involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data by maintaining a scoring module, similar to reward model, making it flexible for continually learning without human feedback. Our experimental results show that COPR outperforms strong Continuous Learning (CL) baselines when it comes to consistently aligning with human preferences on incremental tasks and domains.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Learning to summarize from human feedback. CoRR, abs/2009.01325, 2020.
- Training language models to follow instructions with human feedback, 2022.
- A comprehensive survey of continual learning: Theory, method and application, 2023.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. 2022.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR, 17–23 Jul 2022.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
- Direct preference optimization: Your language model is secretly a reward model. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Safe imitation learning via fast bayesian reward inference from preferences. In International Conference on Machine Learning, pages 1165–1177. PMLR, 2020.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- April: Active preference learning-based reinforcement learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II 23, pages 116–131. Springer, 2012.
- Programming by feedback. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1503–1511, Bejing, China, 22–24 Jun 2014. PMLR.
- Models of human preference for learning reward functions. arXiv preprint arXiv:2206.02231, 2022.
- Stabilizing rlhf through advantage model and selective rehearsal, 2023.
- Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023.
- Deep learning. MIT press, 2016.
- Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment, 2023.
- Preference ranking optimization for human alignment, 2023.
- Raft: Reward ranked finetuning for generative foundation model alignment, 2023.
- Rrhf: Rank responses to align language models with human feedback without tears, 2023.
- Chain of hindsight aligns language models with feedback, 2023.
- TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019.
- Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- Gradient episodic memory for continual learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6467–6476, 2017.
- Podnet: Pooled outputs distillation for small-tasks incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 86–102. Springer, 2020.
- Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 831–839, 2019.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
- Memory aware synapses: Learning what (not) to forget. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Proceedings of the European Conference on Computer Vision (ECCV), pages 144–161, Cham, 2018. Springer International Publishing.
- Efficient lifelong learning with A-GEM. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
- Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell., 40(12):2935–2947, dec 2018.
- Task-free continual learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Dark experience for general continual learning: a strong, simple baseline. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15920–15930. Curran Associates, Inc., 2020.
- End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV), pages 233–248, 2018.
- Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn., 8(3–4):293–321, May 1992.
- LAMOL: language modeling for lifelong language learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- Generative feature replay for class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 226–227, 2020.
- Meta-learning representations for continual learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Self-supervised training enhances online continual learning. arXiv preprint arXiv:2103.14010, 2021.
- An empirical investigation of the role of pre-training in lifelong learning, 2022.
- ECONET: Effective continual pretraining of language models for event temporal reasoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5367–5380, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
- Memory-based parameter adaptation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
- Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pages 4548–4557. PMLR, 2018.
- Adversarial continual learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 386–402. Springer, 2020.
- Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
- W. Bradley Knox and Peter Stone. Tamer: Training an agent manually via evaluative reinforcement. In 2008 7th IEEE International Conference on Development and Learning, pages 292–297, 2008.
- Interactive learning from policy-dependent human feedback. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2285–2294. PMLR, 06–11 Aug 2017.
- Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018.
- Training language models with language feedback at scale, 2023.
- Constitutional ai: Harmlessness from ai feedback, 2022.
- Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
- Scaling laws for reward model overoptimization, 2022.
- Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine learning, 97:327–351, 2014.
- Dueling rl: Reinforcement learning with trajectory preferences. In International Conference on Artificial Intelligence and Statistics, pages 6263–6289. PMLR, 2023.
- Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26, 2013.
- Active preference-based learning of reward functions. 2017.
- Learning dynamic robot-to-human object handover from human feedback. Robotics Research: Volume 1, pages 161–176, 2018.
Collections
Sign up for free to add this paper to one or more collections.