Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LIONs: An Empirically Optimized Approach to Align Language Models (2407.06542v2)

Published 9 Jul 2024 in cs.CL

Abstract: Alignment is a crucial step to enhance the instruction-following and conversational abilities of LLMs. Despite many recent work proposing new algorithms, datasets, and training pipelines, there is a lack of comprehensive studies measuring the impact of various design choices throughout the whole training process. We first conduct a rigorous analysis over a three-stage training pipeline consisting of supervised fine-tuning, offline preference learning, and online preference learning. We have found that using techniques like sequence packing, loss masking in SFT, increasing the preference dataset size in DPO, and online DPO training can significantly improve the performance of LLMs. We then train from Gemma-2b-base and LLama-3-8b-base, and find that our best models exceed the performance of the official instruct models tuned with closed-source data and algorithms. Our code and models can be found at \url{https://github.com/Columbia-NLP-Lab/LionAlignment}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. AI@Meta. 2024. Llama 3 model card.
  2. Team Argilla. 2024. Argilla: Dpo-mix-7k.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint, arXiv:2204.05862.
  4. Starling-7b: Improving llm helpfulness and harmlessness with rlaif.
  5. Open llm leaderboard. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.
  6. Language models are few-shot learners. Preprint, arXiv:2005.14165.
  7. Alpagasus: Training a better alpaca with fewer data. Preprint, arXiv:2307.08701.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  9. Scaling instruction-finetuned language models. Preprint, arXiv:2210.11416.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  11. Ultrafeedback: Boosting language models with high-quality feedback. Preprint, arXiv:2310.01377.
  12. Safe rlhf: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations.
  13. Amplify-instruct: Synthetically generated diverse multi-turn conversations for efficient llm training. arXiv preprint arXiv:(coming soon).
  14. Tri Dao. 2024. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR).
  15. Toxicity in chatgpt: Analyzing persona-assigned language models. Preprint, arXiv:2304.05335.
  16. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  17. Rlhf workflow: From reward modeling to online rlhf. Preprint, arXiv:2405.07863.
  18. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475.
  19. Kto: Model alignment as prospect theoretic optimization. Preprint, arXiv:2402.01306.
  20. Scaling laws for reward model overoptimization. Preprint, arXiv:2210.10760.
  21. Gemma: Open models based on gemini research and technology. Preprint, arXiv:2403.08295.
  22. Learn your reference model for real good alignment. Preprint, arXiv:2404.09656.
  23. Direct language model alignment from online ai feedback. Preprint, arXiv:2402.04792.
  24. Measuring mathematical problem solving with the math dataset. NeurIPS.
  25. Training compute-optimal large language models. Preprint, arXiv:2203.15556.
  26. Orpo: Monolithic preference optimization without reference model. Preprint, arXiv:2403.07691.
  27. Camels in a changing climate: Enhancing lm adaptation with tulu 2. Preprint, arXiv:2311.10702.
  28. Mistral 7b. Preprint, arXiv:2310.06825.
  29. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. Preprint, arXiv:2306.02561.
  30. Scaling laws for neural language models. Preprint, arXiv:2001.08361.
  31. Rewardbench: Evaluating reward models for language modeling. Preprint, arXiv:2403.13787.
  32. Camel: Communicative agents for "mind" exploration of large scale language model society. Preprint, arXiv:2303.17760.
  33. From live data to high-quality benchmarks: The arena-hard pipeline.
  34. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  35. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification.
  36. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. In The Twelfth International Conference on Learning Representations.
  37. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. Preprint, arXiv:2312.15685.
  38. The flan collection: Designing data and methods for effective instruction tuning. Preprint, arXiv:2301.13688.
  39. Wizardcoder: Empowering code large language models with evol-instruct.
  40. Simpo: Simple preference optimization with a reference-free reward. Preprint, arXiv:2405.14734.
  41. Cross-task generalization via natural language crowdsourcing instructions. In Annual Meeting of the Association for Computational Linguistics.
  42. Orca-math: Unlocking the potential of slms in grade school math. Preprint, arXiv:2402.14830.
  43. Orca: Progressive learning from complex explanation traces of gpt-4. Preprint, arXiv:2306.02707.
  44. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  45. Training language models to follow instructions with human feedback. Preprint, arXiv:2203.02155.
  46. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155.
  47. Scaling laws for reward model overoptimization in direct alignment algorithms. Preprint, arXiv:2406.02900.
  48. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
  49. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
  50. Multitask prompted training enables zero-shot task generalization. ArXiv, abs/2110.08207.
  51. Proximal policy optimization algorithms. Preprint, arXiv:1707.06347.
  52. Proximal policy optimization algorithms. ArXiv, abs/1707.06347.
  53. Instruction tuning with loss over instructions. Preprint, arXiv:2405.14394.
  54. Learning to summarize from human feedback. In NeurIPS.
  55. Learning to summarize from human feedback. ArXiv, abs/2009.01325.
  56. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  57. Teknium. 2023. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants.
  58. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  59. Iterative dpo alignment.
  60. The alignment handbook. https://github.com/huggingface/alignment-handbook.
  61. Zephyr: Direct distillation of lm alignment. Preprint, arXiv:2310.16944.
  62. Secrets of rlhf in large language models part ii: Reward modeling. Preprint, arXiv:2401.06080.
  63. Helpsteer: Multi-attribute helpfulness dataset for steerlm. Preprint, arXiv:2311.09528.
  64. Jailbroken: How does llm safety training fail? Preprint, arXiv:2307.02483.
  65. Finetuned language models are zero-shot learners. Preprint, arXiv:2109.01652.
  66. Less: Selecting influential data for targeted instruction tuning. Preprint, arXiv:2402.04333.
  67. Wizardlm: Empowering large language models to follow complex instructions. ArXiv, abs/2304.12244.
  68. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. Preprint, arXiv:2401.08417.
  69. Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss. Preprint, arXiv:2312.16682.
  70. Is dpo superior to ppo for llm alignment? a comprehensive study. ArXiv, abs/2404.10719.
  71. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  72. Teaching language models to self-improve through interactive demonstrations. Preprint, arXiv:2310.13522.
  73. Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning. Preprint, arXiv:2402.04833.
  74. Judging llm-as-a-judge with mt-bench and chatbot arena. Preprint, arXiv:2306.05685.
  75. Lima: Less is more for alignment. Preprint, arXiv:2305.11206.
  76. Fine-tuning language models from human preferences. Preprint, arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiao Yu (66 papers)
  2. Qingyang Wu (29 papers)
  3. Yu Li (377 papers)
  4. Zhou Yu (206 papers)
Citations (1)