SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF (2310.05344v1)
Abstract: Model alignment with human preferences is an essential step in making LLMs helpful and consistent with human values. It typically consists of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that end users cannot control at run-time. Moreover, reward models in RLHF stage commonly rely on single-dimensional feedback as opposed to explicit, multifaceted signals that indicate attributes such as helpfulness, humor, and toxicity. To address these limitations, we propose SteerLM, a supervised fine-tuning method that empowers end-users to control responses during inference. SteerLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses while maintaining customizability. Experiments show that SteerLM trained on open source datasets generates responses that are preferred by human and automatic evaluators to many state-of-the-art baselines trained with RLHF while being much easier to train. Try SteerLM at https://huggingface.co/nvidia/SteerLM-llama2-13B
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Language models are few-shot learners.
- Decision transformer: Reinforcement learning via sequence modeling.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- Qlora: Efficient finetuning of quantized llms.
- Raft: Reward ranked finetuning for generative foundation model alignment.
- Alpacafarm: A simulation framework for methods that learn from human feedback.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
- Sophie Jentzsch and Kristian Kersting. 2023. Chatgpt is fun, but it is not funny! humor is still challenging large language models.
- Scaling laws for neural language models.
- Nemo: a toolkit for building ai applications using neural modules.
- Openassistant conversations – democratizing large language model alignment.
- Measuring and signing fairness as performance under multiple stakeholder distributions. arXiv preprint arXiv:2207.09960.
- Wizardcoder: Empowering code large language models with evol-instruct.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074.
- Marcos Nadal and Anjan Chatterjee. 2019. Neuroaesthetics and art’s diversity and universality. Wiley Interdisciplinary Reviews: Cognitive Science, 10(3):e1487.
- Webgpt: Browser-assisted question-answering with human feedback.
- Training language models to follow instructions with human feedback.
- Instruction tuning with gpt-4.
- Direct preference optimization: Your language model is secretly a reward model.
- Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
- Multitask prompted training enables zero-shot task generalization.
- Proximal policy optimization algorithms.
- Controlling style in generated dialogue.
- Offline rl for natural language generation with implicit language q learning.
- Learning to summarize from human feedback.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models.
- How far can camels go? exploring the state of instruction tuning on open resources.
- Self-instruct: Aligning language models with self-generated instructions.
- Extracting and inferring personal attributes from dialogue. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 58–69, Dublin, Ireland. Association for Computational Linguistics.
- Finetuned language models are zero-shot learners.
- Rrhf: Rank responses to align language models with human feedback without tears.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention.
- Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
- Lima: Less is more for alignment.
- Yi Dong (46 papers)
- Zhilin Wang (38 papers)
- Makesh Narsimhan Sreedhar (14 papers)
- Xianchao Wu (16 papers)
- Oleksii Kuchaiev (31 papers)