Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Model Alignment with Elastic Reset (2312.07551v1)

Published 6 Dec 2023 in cs.CL

Abstract: Finetuning LLMs with reinforcement learning (RL), e.g. from human feedback (HF), is a prominent method for alignment. But optimizing against a reward model can improve on reward while degrading performance in other areas, a phenomenon known as reward hacking, alignment tax, or language drift. First, we argue that commonly-used test metrics are insufficient and instead measure how different algorithms tradeoff between reward and drift. The standard method modified the reward with a Kullback-Lieber (KL) penalty between the online and initial model. We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. We periodically reset the online model to an exponentially moving average (EMA) of itself, then reset the EMA model to the initial model. Through the use of an EMA, our model recovers quickly after resets and achieves higher reward with less drift in the same number of steps. We demonstrate that fine-tuning LLMs with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark, outperforms all baselines in a medium-scale RLHF-like IMDB mock sentiment task and leads to a more performant and more aligned technical QA chatbot with LLaMA-7B. Code available at github.com/mnoukhov/elastic-reset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. P. Ammanabrolu. Re: Reproducing NLPO (RL4LMs), Jan. 2023.
  2. Learning with Latent Language. In NAACL. arXiv, 2018. doi: 10.48550/arXiv.1711.00482. URL http://arxiv.org/abs/1711.00482. arXiv:1711.00482 [cs].
  3. A General Language Assistant as a Laboratory for Alignment, Dec. 2021. URL http://arxiv.org/abs/2112.00861. arXiv:2112.00861 [cs].
  4. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR. arXiv, 2015. URL http://arxiv.org/abs/1409.0473. arXiv:1409.0473 [cs, stat].
  5. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Apr. 2022. URL http://arxiv.org/abs/2204.05862. arXiv:2204.05862 [cs].
  6. StackLLaMA: An RL Fine-tuned LLaMA Model for Stack Exchange Question and Answering, 2023. URL https://huggingface.co/blog/stackllama.
  7. E. M. Bender and A. Koller. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.463. URL https://aclanthology.org/2020.acl-main.463.
  8. Language Models are Few-Shot Learners. In Neural Information Processing Systems. arXiv, July 2020. doi: 10.48550/arXiv.2005.14165. URL http://arxiv.org/abs/2005.14165. arXiv:2005.14165 [cs].
  9. Emerging Properties in Self-Supervised Vision Transformers, May 2021. URL http://arxiv.org/abs/2104.14294. arXiv:2104.14294 [cs].
  10. WIT3: Web Inventory of Transcribed and Translated Talks. In Proceedings of the 16th Annual conference of the European Association for Machine Translation, pages 261–268, Trento, Italy, May 2012. European Association for Machine Translation. URL https://aclanthology.org/2012.eamt-1.60.
  11. Evaluating Large Language Models Trained on Code, July 2021. URL http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].
  12. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://papers.nips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  13. J. Clark and D. Amodei. Faulty Reward Functions in the Wild, Dec. 2016. URL https://openai.com/blog/faulty-reward-functions/.
  14. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In NeurIPS. arXiv, Nov. 2022. doi: 10.48550/arXiv.2208.07339. URL http://arxiv.org/abs/2208.07339. arXiv:2208.07339 [cs].
  15. Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=4GBGwVIEYJ.
  16. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74, Berlin, Germany, 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-3210. URL http://aclweb.org/anthology/W16-3210.
  17. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. In Proceedings of the Second Conference on Machine Translation, pages 215–233, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4718. URL https://aclanthology.org/W17-4718.
  18. Scaling Laws for Reward Model Overoptimization, Oct. 2022. URL http://arxiv.org/abs/2210.10760. arXiv:2210.10760 [cs, stat].
  19. Accelerate: Training and inference at scale made simple, efficient and adaptable, 2022. URL https://github.com/huggingface/accelerate.
  20. S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, Nov. 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.
  21. LoRA: Low-Rank Adaptation of Large Language Models, Oct. 2021. URL http://arxiv.org/abs/2106.09685. arXiv:2106.09685 [cs].
  22. J. D. Hunter. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3):90–95, May 2007. ISSN 1558-366X. doi: 10.1109/MCSE.2007.55. URL https://ieeexplore.ieee.org/document/4160265. Conference Name: Computing in Science & Engineering.
  23. Categorical Reparameterization with Gumbel-Softmax. In ICLR. arXiv, Aug. 2017. doi: 10.48550/arXiv.1611.01144. URL http://arxiv.org/abs/1611.01144. arXiv:1611.01144 [cs, stat].
  24. Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control. In International Conference on Machine Learning. arXiv, Oct. 2017. doi: 10.48550/arXiv.1611.02796. URL http://arxiv.org/abs/1611.02796. arXiv:1611.02796 [cs].
  25. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog, July 2019. URL http://arxiv.org/abs/1907.00456. arXiv:1907.00456 [cs, stat].
  26. S. Kirby. Spontaneous evolution of linguistic structure-an iterated learning model of the emergence of regularity and irregularity. IEEE Transactions on Evolutionary Computation, 5(2):102–110, Apr. 2001. ISSN 1941-0026. doi: 10.1109/4235.918430. Conference Name: IEEE Transactions on Evolutionary Computation.
  27. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, Mar. 2017. doi: 10.1073/pnas.1611835114. URL https://www.pnas.org/doi/10.1073/pnas.1611835114. Publisher: Proceedings of the National Academy of Sciences.
  28. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/P07-2045.
  29. HuggingFace H4 Stack Exchange Preference Dataset, 2023. URL https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
  30. A. Lazaridou and M. Baroni. Emergent Multi-Agent Communication in the Deep Learning Era, July 2020.
  31. Multi-agent Communication meets Natural Language: Synergies between Functional and Structural Language Learning. In ACL. arXiv, May 2020. doi: 10.48550/arXiv.2005.07064. URL http://arxiv.org/abs/2005.07064. arXiv:2005.07064 [cs].
  32. Countering Language Drift via Visual Grounding, Sept. 2019. URL http://arxiv.org/abs/1909.04499. arXiv:1909.04499 [cs].
  33. Deal or No Deal? End-to-End Learning for Negotiation Dialogues, June 2017. URL http://arxiv.org/abs/1706.05125. arXiv:1706.05125 [cs] version: 1.
  34. Datasets: A Community Library for Natural Language Processing, Sept. 2021. URL http://arxiv.org/abs/2109.02846. arXiv:2109.02846 [cs].
  35. F. Li and M. Bowling. Ease-of-Teaching and Language Structure from Emergent Communication. In NeurIPS, 2019. URL http://arxiv.org/abs/1906.02403. arXiv: 1906.02403.
  36. I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, Feb. 2022. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  37. On the interaction between supervision and self-play in emergent communication. In International Conference on Learning Representations, Sept. 2021. URL https://openreview.net/forum?id=rJxGLlBtwH.
  38. Countering Language Drift with Seeded Iterated Learning. In ICML. arXiv, Aug. 2020. doi: 10.48550/arXiv.2003.12694. URL http://arxiv.org/abs/2003.12694. arXiv:2003.12694 [cs].
  39. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1015.
  40. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In ICLR. arXiv, Mar. 2017. doi: 10.48550/arXiv.1611.00712. URL http://arxiv.org/abs/1611.00712. arXiv:1611.00712 [cs, stat].
  41. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods, May 2023. URL https://github.com/huggingface/peft. original-date: 2022-11-25T03:51:09Z.
  42. M. McCloskey and N. J. Cohen. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In G. H. Bower, editor, Psychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, Jan. 1989. doi: 10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368.
  43. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science, Nov. 2022. doi: 10.1126/science.ade9097. URL https://www.science.org/doi/10.1126/science.ade9097. Publisher: American Association for the Advancement of Science.
  44. The Primacy Bias in Deep Reinforcement Learning. In Proceedings of the 39th International Conference on Machine Learning, pages 16828–16847. PMLR, June 2022. URL https://proceedings.mlr.press/v162/nikishin22a.html. ISSN: 2640-3498.
  45. OpenAI. ChatGPT: Optimizing Language Models for Dialogue, Nov. 2022. URL https://webcache.googleusercontent.com/search?q=cache:qLONB_tyjdcJ:https://openai.com/blog/chatgpt/&cd=1&hl=en&ct=clnk&gl=ca.
  46. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North, pages 48–53, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-4009. URL http://aclweb.org/anthology/N19-4009.
  47. Training language models to follow instructions with human feedback, Mar. 2022. URL http://arxiv.org/abs/2203.02155. arXiv:2203.02155 [cs].
  48. PyTorch: An Imperative Style, High-Performance Deep Learning Library, Dec. 2019. URL http://arxiv.org/abs/1912.01703. arXiv:1912.01703 [cs, stat].
  49. M. Post. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium, Oct. 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6319. URL https://aclanthology.org/W18-6319.
  50. Language Models are Unsupervised Multitask Learners, 2019. URL https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe.
  51. Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization. In ICLR. arXiv, 2022. URL http://arxiv.org/abs/2210.01241. arXiv:2210.01241 [cs].
  52. Emergent Communication: Generalization and Overfitting in Lewis Games, Sept. 2022. URL http://arxiv.org/abs/2209.15342. arXiv:2209.15342 [cs, math].
  53. A. Robins. Catastrophic Forgetting, Rehearsal and Pseudorehearsal. Connection Science, 7(2):123–146, June 1995. ISSN 0954-0091. doi: 10.1080/09540099550039318. URL https://doi.org/10.1080/09540099550039318. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/09540099550039318.
  54. Experience Replay for Continual Learning, Nov. 2019. URL http://arxiv.org/abs/1811.11682. arXiv:1811.11682 [cs, stat].
  55. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Feb. 2020. URL http://arxiv.org/abs/1910.01108. arXiv:1910.01108 [cs].
  56. Trust Region Policy Optimization. In ICML. arXiv, 2015. doi: 10.48550/arXiv.1502.05477. URL http://arxiv.org/abs/1502.05477. arXiv:1502.05477 [cs].
  57. Proximal Policy Optimization Algorithms, Aug. 2017. URL http://arxiv.org/abs/1707.06347. arXiv:1707.06347 [cs].
  58. Know your audience: specializing grounded language models with the game of Dixit, June 2022. URL http://arxiv.org/abs/2206.08349. arXiv:2206.08349 [cs].
  59. Emergent Communication Fine-tuning (EC-FT) for Pretrained Language Models. In Emergent Communication Workshop at ICLR 2022, June 2022. URL https://openreview.net/forum?id=SUqrM7WR7W5.
  60. Learning to summarize from human feedback. In NeurIPS. arXiv, 2020. URL http://arxiv.org/abs/2009.01325. arXiv:2009.01325 [cs].
  61. Sequence to Sequence Learning with Neural Networks. In Neural Information Processing Systems. arXiv, Dec. 2014. URL http://arxiv.org/abs/1409.3215. arXiv:1409.3215 [cs].
  62. CodeCapybara: Open Source LLaMA Model that Follow Instruction-Tuning for Code Generation., May 2023. URL https://github.com/FSoft-AI4Code/CodeCapybara. original-date: 2023-04-21T10:28:53Z.
  63. LLaMA: Open and Efficient Foundation Language Models, Feb. 2023. URL http://arxiv.org/abs/2302.13971. arXiv:2302.13971 [cs] version: 1.
  64. M. Utiyama and H. Isahara. A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 484–491, Rochester, New York, Apr. 2007. Association for Computational Linguistics. URL https://aclanthology.org/N07-1061.
  65. Attention is All you Need. In Neural Information Processing Systems, page 11, 2017.
  66. TRL: Transformer Reinforcement Learning, 2023. URL https://github.com/lvwerra/trl.
  67. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, page 28, 1992.
  68. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, Oct. 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  69. Fortuitous Forgetting in Connectionist Networks. In International Conference on Learning Representations, Mar. 2022. URL https://openreview.net/forum?id=ei3SY1_zYsE.
  70. Fine-Tuning Language Models from Human Preferences, 2019. URL http://arxiv.org/abs/1909.08593. arXiv:1909.08593 [cs, stat].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Michael Noukhovitch (9 papers)
  2. Samuel Lavoie (9 papers)
  3. Florian Strub (39 papers)
  4. Aaron Courville (201 papers)
Citations (21)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com