Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets (2405.18952v2)

Published 29 May 2024 in cs.CL, cs.AI, and cs.LG
Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets

Abstract: Training LLMs with Reinforcement Learning from AI Feedback (RLAIF) aligns model outputs more closely with human preferences. This involves an evaluator model ranking multiple candidate responses to user prompts. However, the rankings from popular evaluator models such as GPT-4 can be inconsistent. We propose the Repeat Ranking method - where we evaluate the same responses multiple times and train only on those responses which are consistently ranked. Using 2,714 prompts in 62 languages, we generated responses from 7 top multilingual LLMs and had GPT-4 rank them five times each. Evaluating on MT-Bench chat benchmarks in six languages, our method outperformed the standard practice of training on all available prompts. Our work highlights the quality versus quantity trade-off in RLAIF dataset generation and offers a stackable strategy for enhancing dataset and thus model quality.

Repeat Ranking in RLAIF for Training Multilingual LLMs

Introduction

Reinforcement Learning from AI Feedback (RLAIF) has emerged as a pivotal methodology for enhancing the alignment of LLMs with human preferences. Using an evaluator model to rank multiple candidate responses to user prompts, the consistency of these rankings becomes crucial. This paper introduces the Repeat Ranking method, where model responses are evaluated multiple times to ensure consistency. Only the responses consistently ranked across multiple evaluations are used for training, hypothesizing that this will lead to improved evaluation performance.

Methodology

The paper follows a comprehensive approach to creating a multilingual preference dataset and employing the Repeat Ranking method. Initially, 7 state-of-the-art multilingual LLMs generated responses to 2,714 prompts across 62 languages. GPT-4 ranked these responses five times each, examining consistency using Kendall's W. Training sets were then created based on the top 25%, 50%, 75%, and 100% of consistent rankings.

The models trained using these subsets were compared against each other and against a baseline model in multilingual chat benchmarks. Additionally, the efficacy of the Repeat Ranking method was contrasted with training models on randomly selected subsets, as well as training solely on responses from the best and worst performing models.

Results

The findings are particularly noteworthy, showcasing that:

  • Models trained only on the most consistently ranked responses (top 25%, 50%, and 75%) outperformed the model trained on all data across multiple languages tested in the MT-Bench chat benchmark.
  • Suzume-ORPO-50 (trained on the top 50% consistent data) exhibited superior or equal performance to the Suzume-ORPO-100 model in 5 out of 6 languages evaluated, demonstrating the effectiveness of training on less but higher quality data.
  • Comparative results also indicated that the performance of GPT-3.5 Turbo was exceeded by the best ORPO-trained model in 4 out of 6 languages evaluated.

However, the paper also noted a trade-off in performance on the next token prediction tasks, as the ORPO-trained model using all data showed lower performance on the Belebele benchmark. Nevertheless, models trained on a consistent subset (particularly, Suzume-ORPO-75 and Suzume-ORPO-25) performed better, achieving comparable or superior scores to the base model in several languages.

Discussion

The results emphasize the substantial improvement in chat capabilities that the Repeat Ranking method can deliver. The insights gained suggest that focusing on the consistency of training data evaluations can be more beneficial than maximizing the quantity of data. By selecting the most consistently ranked responses, models not only improve evaluation scores but also potentially reduce computational costs associated with training.

Furthermore, this work highlights the critical balance between data quality and quantity within RLAIF dataset construction, presenting a robust strategy for dataset refinement. Applying these methodologies can markedly improve existing datasets' quality, potentially bolstering the training of LLMs to achieve higher performance levels.

Future Directions

The implications of this research are far-reaching, suggesting multiple avenues for future exploration:

  1. Expanding Beyond RLAIF: Similar methodologies could be employed within Reinforcement Learning from Human Feedback (RLHF) datasets, investigating whether consistent human evaluations could similarly enhance performance.
  2. Diverse Evaluator Models: Incorporating evaluations from multiple high-performing LLMs (e.g., Claude 3, Gemini 1.5 Pro) might yield more comprehensive and unbiased response evaluations.
  3. Multi-turn Conversations: Extending datasets to include multi-turn conversations could provide a richer training ground for LLMs, enhancing their interactive capabilities.
  4. Task Difficulty Stratification: Filtering prompts based on task difficulty could optimize training resource allocation, potentially improving LLM performance in challenging tasks.
  5. Tool-Assisted Evaluations: Leveraging tools or agents to augment the evaluation capabilities (e.g., fact-checking using search tools, mathematical validation using calculators) could further refine preference datasets' quality.

Conclusion

The Repeat Ranking method presents a significant advancement in RLAIF, demonstrating that training on consistently ranked responses leads to better performance in multilingual chat tasks. This research underscores the importance of evaluation consistency in preference dataset creation, offering a strategic approach to enhance the quality and efficacy of LLM training. As future work continues to refine and expand upon these findings, the development of more proficient multilingual LLMs remains an exciting and promising frontier.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. AI@Meta. 2024. Llama 3 model card.
  3. AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  6. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  7. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884.
  8. J. Borda. 1781. Mémoire sur les élections au scrutin. Histoire de L’Académie Royale des Sciences, Paris.
  9. Assessing cross-cultural alignment between chatgpt and human societies: An empirical study. arXiv preprint arXiv:2303.17466.
  10. Peter Devine. 2024. Tagengo: A multilingual chat dataset.
  11. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. arXiv preprint arXiv:2305.08283.
  12. Andy P Field. 2005. Kendall’s coefficient of concordance. Encyclopedia of statistics in behavioral science, 2:1010–11.
  13. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  14. Aidan Gomez. 2024. Command R: Retrieval-Augmented Generation at Production Scale.
  15. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642.
  16. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
  17. Open hermes preferences. https://huggingface.co/datasets/argilla/OpenHermesPreferences.
  18. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  19. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561.
  20. Maurice G Kendall and B Babington Smith. 1939. The problem of m rankings. The annals of mathematical statistics, 10(3):275–287.
  21. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
  22. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  23. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  24. Lmsys - chatbot arena human preference predictions.
  25. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734.
  26. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  27. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255.
  28. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  29. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  30. Benjamin Reilly. 2002. Social choice in the south seas: Electoral innovation and the borda count in the pacific island countries. International Political Science Review, 23(4):355–372.
  31. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36.
  32. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  33. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
  34. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320.
  35. Weak-to-strong extrapolation expedites alignment. arXiv preprint arXiv:2404.16792.
  36. Lmsys-chat-1m: A large-scale real-world llm conversation dataset.
  37. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  38. Starling-7b: Improving llm helpfulness & harmlessness with rlaif.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Peter Devine (5 papers)
Citations (2)