Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning (2407.15762v2)

Published 22 Jul 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge is to develop steerable LLMs that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditional Language Policy (CLP), a general framework for finetuning LLMs on multiple objectives. Building on techniques from multi-task training and parameter-efficient finetuning, CLP learn steerable models that effectively trade-off conflicting objectives at inference time. Notably, this does not require training or maintaining multiple models to achieve different trade-offs between the objectives. Through extensive experiments and ablations on two summarization datasets, we show that CLP learns steerable LLMs that outperform and Pareto-dominate the existing approaches for multi-objective finetuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  4. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  5. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  6. Pareto manifold learning: Tackling multiple tasks via ensembles of single-task models. In International Conference on Machine Learning, pages 8015–8052. PMLR, 2023.
  7. You only train once: Loss-conditional training of deep networks. In International conference on learning representations, 2020.
  8. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
  9. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
  10. Learning a subspace of policies for online adaptation in reinforcement learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=4Muj-t_4o4.
  11. Autosem: Automatic task selection and mixing in multi-task learning. In Proc. of NAACL, 2019.
  12. Controllable preference optimization: Toward controllable multi-objective alignment. CoRR, abs/2402.19085, 2024. 10.48550/ARXIV.2402.19085. URL https://doi.org/10.48550/arXiv.2402.19085.
  13. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022.
  14. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  15. Putting rl back in rlhf. https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo, 2024. Accessed: 2024-06-14.
  16. Promptable behaviors: Personalizing multi-objective rewards from human preferences. CoRR, abs/2312.09337, 2023. 10.48550/ARXIV.2312.09337. URL https://doi.org/10.48550/arXiv.2312.09337.
  17. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023.
  18. Buy 4 REINFORCE samples, get a baseline for free!, 2019. URL https://openreview.net/forum?id=r1lgTGL5DE.
  19. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  20. Decoding-time realignment of language models. arXiv preprint arXiv:2402.02992, 2024.
  21. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206.
  22. Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online, July 2020. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.441. URL https://aclanthology.org/2020.acl-main.441.
  23. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  24. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 2023.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  26. Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35:10821–10836, 2022.
  27. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/e12a3b98b67e8395f639fde4c2b03168-Abstract-Conference.html.
  28. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1–8, 2023.
  29. Diederik M Roijers. Multi-objective decision-theoretic planning. AI Matters, 2(4):11–12, 2016.
  30. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.
  31. Factually consistent summarization via reinforcement learning with textual entailment feedback. ACL, 2023.
  32. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  33. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
  34. Decoding-time language model alignment with multiple objectives. arXiv preprint arXiv:2406.18853, 2024.
  35. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  36. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  37. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.
  38. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. CoRR, abs/2402.18571, 2024. 10.48550/ARXIV.2402.18571. URL https://doi.org/10.48550/arXiv.2402.18571.
  39. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
  40. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  41. Learning neural network subspaces. In International Conference on Machine Learning, pages 11217–11227. PMLR, 2021.
  42. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pages 23965–23998. PMLR, 2022.
  43. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  44. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment. CoRR, abs/2402.10207, 2024. 10.48550/ARXIV.2402.10207. URL https://doi.org/10.48550/arXiv.2402.10207.
  45. Adaptive budget allocation for parameter-efficient fine-tuning. In International Conference on Learning Representations. Openreview, 2023.
  46. Panacea: Pareto alignment via preference adaptation for llms. CoRR, abs/2402.02030, 2024. 10.48550/ARXIV.2402.02030. URL https://doi.org/10.48550/arXiv.2402.02030.
  47. Panacea: Pareto alignment via preference adaptation for llms. CoRR, abs/2402.02030, 2024. 10.48550/ARXIV.2402.02030. URL https://doi.org/10.48550/arXiv.2402.02030.
  48. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com