Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 206 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales (2405.17618v3)

Published 27 May 2024 in cs.LG and cs.AI

Abstract: Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can introduce additional difficulty. Differing preferences can complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. To enhance training robustness, RL has adopted techniques from supervised learning, such as ensembles and layer normalization. In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), with and without added noise with especially notable performance in SPPO across different hyperparameters. Furthermore, we validate the benefits of the symmetric RL loss when using SPPO for LLMs through improved performance in RLHF tasks, such as IMDB positive sentiment sentiment and TL;DR summarization tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024.
  2. Constitutional ai: Harmlessness from ai feedback, 2022.
  3. Erin Catto. Box2d, a 2d physics engine for games, 2011. URL http://box2d.org.
  4. Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences, 2024.
  5. Tril: Transformers reinforcement and imitation learning library. https://github.com/Cornell-RL/tril, 2023.
  6. Scaling instruction-finetuned language models, 2022.
  7. Palm-e: An embodied multimodal language model, 2023.
  8. Kto: Model alignment as prospect theoretic optimization, 2024.
  9. Robust loss functions under label noise for deep neural networks, 2017.
  10. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.
  11. trlX: A framework for large scale reinforcement learning from human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8578–8595, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.530. URL https://aclanthology.org/2023.emnlp-main.530.
  12. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL https://arxiv.org/abs/2106.09685.
  13. Clinicalbert: Modeling clinical notes and predicting hospital readmission, 2020.
  14. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022.
  15. Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023a.
  16. Aligning text-to-image models using human feedback, 2023b.
  17. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023c. doi: 10.1056/NEJMsr2214184. URL https://www.nejm.org/doi/full/10.1056/NEJMsr2214184.
  18. Let’s verify step by step, 2023.
  19. Normalized loss functions for deep learning with noisy labels. CoRR, abs/2006.13554, 2020. URL https://arxiv.org/abs/2006.13554.
  20. Learning word vectors for sentiment analysis. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1015.
  21. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602.
  22. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783.
  23. OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
  24. Training language models to follow instructions with human feedback, 2022.
  25. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  27. Direct preference optimization: Your language model is secretly a reward model, 2023.
  28. Antonin Raffin. Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo, 2020.
  29. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
  30. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization, 2023.
  31. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019. URL http://arxiv.org/abs/1910.01108.
  32. Trust region policy optimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/schulman15.html.
  33. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
  34. High-dimensional continuous control using generalized advantage estimation, 2018.
  35. Learning to summarize from human feedback. CoRR, abs/2009.01325, 2020. URL https://arxiv.org/abs/2009.01325.
  36. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 2018a. ISBN 978-0262039246.
  37. Reinforcement learning: An introduction. MIT press, 2018b.
  38. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  39. Discretizing continuous action space for on-policy optimization, 2020.
  40. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.
  41. Deep reinforcement learning with double q-learning, 2015.
  42. TL;DR: Mining Reddit to learn automatic summarization. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4508. URL https://aclanthology.org/W17-4508.
  43. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  44. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  45. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024.
  46. Symmetric cross entropy for robust learning with noisy labels. CoRR, abs/1908.06112, 2019. URL http://arxiv.org/abs/1908.06112.
  47. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  48. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  49. Generalized cross entropy loss for training deep neural networks with noisy labels. CoRR, abs/1805.07836, 2018. URL http://arxiv.org/abs/1805.07836.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets