Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation (2403.13578v1)

Published 20 Mar 2024 in cs.CL and cs.LG

Abstract: In this paper, we study the problem of multi-reward reinforcement learning to jointly optimize for multiple text qualities for natural language generation. We focus on the task of counselor reflection generation, where we optimize the generators to simultaneously improve the fluency, coherence, and reflection quality of generated counselor responses. We introduce two novel bandit methods, DynaOpt and C-DynaOpt, which rely on the broad strategy of combining rewards into a single value and optimizing them simultaneously. Specifically, we employ non-contextual and contextual multi-arm bandits to dynamically adjust multiple reward weights during training. Through automatic and manual evaluations, we show that our proposed techniques, DynaOpt and C-DynaOpt, outperform existing naive and bandit baselines, showcasing their potential for enhancing LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Taming the monster: A fast and simple algorithm for contextual bandits. CoRR, abs/1402.0555.
  2. Local dynamic mode of cognitive behavioral therapy.
  3. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902. Algorithmic Learning Theory.
  4. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77.
  5. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  7. A contextual bandit bake-off. J. Mach. Learn. Res., 22(1).
  8. Sébastien Bubeck and Nicolò Cesa-Bianchi. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122.
  9. A survey of online experiment design with the stochastic multi-armed bandit. CoRR, abs/1510.00757.
  10. David Cortes. 2018. Adapting multi-armed bandits policies to contextual bandits scenarios. CoRR, abs/1811.04383.
  11. Reinforcement learning can be more efficient with multiple rewards. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 6948–6967. PMLR.
  12. RLPrompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  13. Automated quality assessment of cognitive behavioral therapy sessions through highly contextualized language representations. CoRR, abs/2102.11573.
  14. Cristina Garbacea and Qiaozhu Mei. 2022. Why is constrained neural language generation particularly challenging?
  15. Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1311–1320. PMLR.
  16. Efficient benchmarking of nlp apis using multi-armed bandits. In Conference of the European Chapter of the Association for Computational Linguistics.
  17. Bandits don’t follow rules: Balancing multi-facet machine translation with multi-armed bandits. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3190–3204, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  18. Keep it simple: Unsupervised simplification of multi-paragraph text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6365–6378, Online. Association for Computational Linguistics.
  19. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
  20. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Austin, Texas. Association for Computational Linguistics.
  21. A contextual-bandit approach to personalized news article recommendation. CoRR, abs/1003.0146.
  22. More than reflections: Empathy in motivational interviewing includes language style synchrony between therapist and client. Behavior Therapy, 46.
  23. QUARK: controllable text generation with reinforced unlearning. In NeurIPS.
  24. Pair: Prompt-aware margIn ranking for counselor reflection scoring in motivational interviewing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 148–158, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  25. The motivational interviewing treatment integrity code (miti 4): Rationale, preliminary reliability and validity. Journal of Substance Abuse Treatment, 65.
  26. Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-reward reinforced summarization with saliency and entailment. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 646–653, New Orleans, Louisiana. Association for Computational Linguistics.
  27. DORB: Dynamically optimizing multiple rewards with bandits. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7766–7780, Online. Association for Computational Linguistics.
  28. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
  29. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In International Conference on Learning Representations (ICLR).
  30. Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1195, Los Alamitos, CA, USA. IEEE Computer Society.
  31. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
  32. Towards facilitating empathic conversations in online mental health support: A reinforcement learning approach. Proceedings of the Web Conference 2021.
  33. Knowledge enhanced reflection generation for counseling dialogues. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3096–3107, Dublin, Ireland. Association for Computational Linguistics.
  34. Counseling-style reflection generation using generative pretrained transformers with augmented context. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 10–20, 1st virtual meeting. Association for Computational Linguistics.
  35. Optimizing dialogue management with reinforcement learning: Experiments with the njfun system. J. Artif. Int. Res., 16(1):105–133.
  36. Offline RL for natural language generation with implicit language Q learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  37. Learning structured predictors from bandit feedback for interactive NLP. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1610–1620, Berlin, Germany. Association for Computational Linguistics.
  38. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  39. Cost-efficient crowdsourcing for span-based sequence labeling: Worker selection and data augmentation. ArXiv, abs/2305.06683.
  40. Anuradha Welivita and Pearl Pu. 2023. Boosting distress support dialogue responses with motivational interviewing strategy. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5411–5432, Toronto, Canada. Association for Computational Linguistics.
  41. Reinforcement learning for abstractive question summarization with question-aware semantic rewards. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 249–255, Online. Association for Computational Linguistics.
  42. Building task-oriented visual dialog systems through alternative optimization between dialog policy and language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 143–153, Hong Kong, China. Association for Computational Linguistics.
Citations (1)

Summary

We haven't generated a summary for this paper yet.