Aligning Language Models with Offline Learning from Human Feedback (2308.12050v2)
Abstract: Learning from human preferences is crucial for LLMs (LMs) to effectively cater to human needs and societal values. Previous research has made notable progress by leveraging human feedback to follow instructions. However, these approaches rely primarily on online learning techniques like Proximal Policy Optimization (PPO), which have been proven unstable and challenging to tune for LLMs. Moreover, PPO requires complex distributed system implementation, hindering the efficiency of large-scale distributed training. In this study, we propose an offline learning from human feedback framework to align LMs without interacting with environments. Specifically, we explore filtering alignment (FA), reward-weighted regression (RWR), and conditional alignment (CA) to align LLMs to human preferences. By employing a loss function similar to supervised fine-tuning, our methods ensure more stable model training than PPO with a simple machine learning system~(MLSys) and much fewer (around 9\%) computing resources. Experimental results demonstrate that conditional alignment outperforms other offline alignment methods and is comparable to PPO.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- Where did my optimum go?: An empirical analysis of gradient descent optimization in policy gradient methods. arXiv preprint arXiv:1810.02525.
- Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2102.03479.
- The 37 implementation details of proximal policy optimization. In ICLR Blog Track. Https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/.
- Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- kipply. 2022. Transformer inference arithmetic. https://kipp.ly/transformer-inference-arithmetic/.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
- Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Transactions on Parallel and Distributed Systems, 31(1):94–110.
- NVIDIA. 2023. Nemo aligner: Scalable toolkit for efficient model alignment. https://github.com/NVIDIA/NeMo-Aligner.
- OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Jan Peters and Stefan Schaal. 2007. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
- Han Vanholder. 2016. Efficient inference with tensorrt. In GPU Technology Conference, volume 1.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Jian Hu (40 papers)
- Li Tao (27 papers)
- June Yang (3 papers)
- Chandler Zhou (2 papers)