Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF (2403.02513v1)

Published 4 Mar 2024 in cs.CL

Abstract: In recent advancements in Conversational LLMs, a concerning trend has emerged, showing that many new base LLMs experience a knowledge reduction in their foundational capabilities following Supervised Fine-Tuning (SFT). This process often leads to issues such as forgetting or a decrease in the base model's abilities. Moreover, fine-tuned models struggle to align with user preferences, inadvertently increasing the generation of toxic outputs when specifically prompted. To overcome these challenges, we adopted an innovative approach by completely bypassing SFT and directly implementing Harmless Reinforcement Learning from Human Feedback (RLHF). Our method not only preserves the base model's general capabilities but also significantly enhances its conversational abilities, while notably reducing the generation of toxic outputs. Our approach holds significant implications for fields that demand a nuanced understanding and generation of responses, such as customer service. We applied this methodology to Mistral, the most popular base model, thereby creating Mistral-Plus. Our validation across 11 general tasks demonstrates that Mistral-Plus outperforms similarly sized open-source base models and their corresponding instruct versions. Importantly, the conversational abilities of Mistral-Plus were significantly improved, indicating a substantial advancement over traditional SFT models in both safety and user preference alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. A general language assistant as a laboratory for alignment. ArXiv, abs/2112.00861.
  2. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862.
  5. Deepseek llm: Scaling open-source language models with longtermism. ArXiv, abs/2401.02954.
  6. Language models are few-shot learners. ArXiv, abs/2005.14165.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Latent hatred: A benchmark for understanding implicit hate speech. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 345–363, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  9. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
  10. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
  11. Whose language counts as high quality? measuring language ideologies in text data selection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2562–2580, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  12. Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639.
  13. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  14. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786.
  15. Mistral 7b. ArXiv, abs/2310.06825.
  16. Race: Large-scale reading comprehension dataset from examinations. ArXiv, abs/1704.04683.
  17. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967.
  18. Lipo: Listwise preference optimization through learning-to-rank. arXiv preprint arXiv:2402.01878.
  19. Nostalgebraist. 2020. Interpreting gpt: The logit lens.
  20. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  21. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  22. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  23. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
  24. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  25. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  26. Regularized rl. arXiv preprint arXiv:2310.17303.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  28. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  29. Two-stage llm fine-tuning with less specialization and more generalization.
  30. Unveiling the implicit toxicity in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1322–1338, Singapore. Association for Computational Linguistics.
  31. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256.
  32. Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations. ArXiv, abs/2305.18354.
  33. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics.
  34. Investigating the catastrophic forgetting in multimodal large language models. ArXiv, abs/2309.10313.
  35. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  36. Balancing specialized and general skills in llms: The impact of modern tuning and data strategy. arXiv preprint arXiv:2310.04945.
  37. Ice-grt: Instruction context enhancement by generative reinforcement based transformers. ArXiv, abs/2401.02072.
  38. Judging llm-as-a-judge with mt-bench and chatbot arena.
  39. Gpt-fathom: Benchmarking large language models to decipher the evolutionary path towards gpt-4 and beyond. ArXiv, abs/2309.16583.
  40. Agieval: A human-centric benchmark for evaluating foundation models. ArXiv, abs/2304.06364.
Citations (5)

Summary

We haven't generated a summary for this paper yet.