Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training (2312.11819v3)

Published 19 Dec 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Recently, ChatGPT or InstructGPT like LLMs (LLM) has made a significant impact in the AI world. Many works have attempted to reproduce the complex InstructGPT's training pipeline, namely Reinforcement Learning with Human Feedback (RLHF). However, the mainstream distributed RLHF training methods typically adopt a fixed model placement strategy, referred to as the Co-located strategy. This strategy treats all four interdependent models involved in RLHF as a single entity, distributing them across all devices and applying parallelism techniques designed for a single model, regardless of the workload heterogeneity inherent to each model. As a result, this strategy exacerbates the generation bottlenecks in the RLHF training and degrades the overall training efficiency. To address these issues, we propose a flexible model placement framework that offers two general and agile model placement strategies. The Interleaving strategy helps reduce memory redundancy and communication costs of RLHF training by placing models without dependencies on exclusive devices with careful orchestration. On the other hand, the Disaggregated strategy improves the throughput of model training by separating the training and inference runtime of the RLHF pipeline with additional shadow models. Furthermore, our framework provides a simple user interface and guidelines to easily and flexibly configure these strategies in various training scenarios. Our experiments have shown that our strategy can achieve notable improvements up to 11x, compared to the current state-of-the-art (SOTA) approaches. The results highlight the effectiveness and adaptability of our methods in accelerating the training of distributed RLHF.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. trlX: A scalable framework for RLHF, June 2023.
  4. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  5. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  8. Hetseq: Distributed gpu training on heterogeneous infrastructure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 15432–15438, 2021.
  9. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
  10. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, page 431–445, New York, NY, USA, 2021. Association for Computing Machinery.
  11. Andrew Gibiansky. Bringing hpc techniques to deep learning, 2017.
  12. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  13. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
  14. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  15. Analysis of {{\{{Large-Scale}}\}}{{\{{Multi-Tenant}}\}}{{\{{GPU}}\}} clusters for {{\{{DNN}}\}} training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 947–960, 2019.
  16. Whale: Efficient giant model training over heterogeneous {{\{{GPUs}}\}}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 673–688, 2022.
  17. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021.
  18. Adam: A method for stochastic optimization, 2017.
  19. Efficient memory management for large language model serving with pagedattention, 2023.
  20. Pytorch distributed: Experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12).
  21. Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766–775, 2023.
  22. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937. PMLR, 2016.
  23. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  24. NVIDIA. Nvidia nemo framework, 2022.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  27. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  28. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  29. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
  30. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems, 31, 2018.
  31. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  32. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  33. Llama: Open and efficient foundation language models, 2023.
  34. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  35. Zero++: Extremely efficient collective communication for giant model training, 2023.
  36. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  37. {{\{{MLaaS}}\}} in the wild: Workload analysis and scheduling in {{\{{Large-Scale}}\}} heterogeneous {{\{{GPU}}\}} clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945–960, 2022.
  38. Rethinking memory and communication cost for efficient large language model training. arXiv preprint arXiv:2310.06003, 2023.
  39. G-meta: Distributed meta learning in gpu clusters for large-scale recommender systems. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4365–4369, 2023.
  40. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320, 2023.
  41. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
  42. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  43. Mics: Near-linear scaling for training gigantic model on public cloud. Proc. VLDB Endow., 16(1):37–50, sep 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Youshao Xiao (6 papers)
  2. Weichang Wu (6 papers)
  3. Zhenglei Zhou (4 papers)
  4. Fagui Mao (1 paper)
  5. Shangchun Zhao (4 papers)
  6. Lin Ju (10 papers)
  7. Lei Liang (37 papers)
  8. Xiaolu Zhang (39 papers)
  9. Jun Zhou (370 papers)
Citations (3)