Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF (2310.05344v1)

Published 9 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Model alignment with human preferences is an essential step in making LLMs helpful and consistent with human values. It typically consists of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that end users cannot control at run-time. Moreover, reward models in RLHF stage commonly rely on single-dimensional feedback as opposed to explicit, multifaceted signals that indicate attributes such as helpfulness, humor, and toxicity. To address these limitations, we propose SteerLM, a supervised fine-tuning method that empowers end-users to control responses during inference. SteerLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses while maintaining customizability. Experiments show that SteerLM trained on open source datasets generates responses that are preferred by human and automatic evaluators to many state-of-the-art baselines trained with RLHF while being much easier to train. Try SteerLM at https://huggingface.co/nvidia/SteerLM-llama2-13B

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  2. Language models are few-shot learners.
  3. Decision transformer: Reinforcement learning via sequence modeling.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  5. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  6. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  7. Qlora: Efficient finetuning of quantized llms.
  8. Raft: Reward ranked finetuning for generative foundation model alignment.
  9. Alpacafarm: A simulation framework for methods that learn from human feedback.
  10. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
  11. Sophie Jentzsch and Kristian Kersting. 2023. Chatgpt is fun, but it is not funny! humor is still challenging large language models.
  12. Scaling laws for neural language models.
  13. Nemo: a toolkit for building ai applications using neural modules.
  14. Openassistant conversations – democratizing large language model alignment.
  15. Measuring and signing fairness as performance under multiple stakeholder distributions. arXiv preprint arXiv:2207.09960.
  16. Wizardcoder: Empowering code large language models with evol-instruct.
  17. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074.
  18. Marcos Nadal and Anjan Chatterjee. 2019. Neuroaesthetics and art’s diversity and universality. Wiley Interdisciplinary Reviews: Cognitive Science, 10(3):e1487.
  19. Webgpt: Browser-assisted question-answering with human feedback.
  20. Training language models to follow instructions with human feedback.
  21. Instruction tuning with gpt-4.
  22. Direct preference optimization: Your language model is secretly a reward model.
  23. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
  24. Multitask prompted training enables zero-shot task generalization.
  25. Proximal policy optimization algorithms.
  26. Controlling style in generated dialogue.
  27. Offline rl for natural language generation with implicit language q learning.
  28. Learning to summarize from human feedback.
  29. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  30. Llama 2: Open foundation and fine-tuned chat models.
  31. How far can camels go? exploring the state of instruction tuning on open resources.
  32. Self-instruct: Aligning language models with self-generated instructions.
  33. Extracting and inferring personal attributes from dialogue. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 58–69, Dublin, Ireland. Association for Computational Linguistics.
  34. Finetuned language models are zero-shot learners.
  35. Rrhf: Rank responses to align language models with human feedback without tears.
  36. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.
  37. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
  38. Lima: Less is more for alignment.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yi Dong (46 papers)
  2. Zhilin Wang (38 papers)
  3. Makesh Narsimhan Sreedhar (14 papers)
  4. Xianchao Wu (16 papers)
  5. Oleksii Kuchaiev (31 papers)
Citations (48)
X Twitter Logo Streamline Icon: https://streamlinehq.com