Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs (2410.18451v1)
Abstract: In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Stable lm 2 1.6 b technical report. arXiv preprint arXiv:2402.17834, 2024.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- The dangers of inference using the bradley-terry model. The Annals of Statistics, 38(3):1491–1514, 2010.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
- L. Daniele and Suphavadeeprasit. Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(coming soon), 2023. URL https://huggingface.co/datasets/LDJnr/Capybara.
- Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In International Conference on Machine Learning, pages 5988–6008. PMLR, 2022.
- The Elements of Statistical Learning. Springer, 2001.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
- Deep Learning. MIT press, 2016.
- Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
- Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510, 2024.
- Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
- Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024.
- Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
- T. Lin. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
- Uncertainty-aware reward model: Teaching reward models to know what is unknown. arXiv preprint arXiv:2410.00847, 2024.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Offsetbias: Leveraging debiased data for tuning evaluators. arXiv preprint arXiv:2407.06551, 2024.
- From r𝑟ritalic_r to q∗superscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024a.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024b.
- RyokoAI. ShareGPT52K Dataset. https://huggingface.co/datasets/RyokoAI/ShareGPT52K, 2023.
- Do user preferences and evaluation measures line up? In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 555–562, 2010.
- Kernel methods for pattern analysis. In Proceedings of the IEEE, volume 12, pages 406–417. IEEE, 2001.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- G. Team. Gemma. 2024. 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. in arxiv [cs. cl]. arxiv, 2024a.
- Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024b.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. In ACL, 2024a.
- Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP, 2024b.
- Direct judgement preference optimization. arXiv preprint arXiv:2409.14664, 2024c.
- Self-taught evaluators. arXiv preprint arXiv:2408.02666, 2024d.
- Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv preprint arXiv:2311.09528, 2023.
- Helpsteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673, 2024e.
- Metametrics: Calibrating metrics for generation tasks using human preferences. arXiv preprint arXiv:2410.02381, 2024.
- Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464, 2024.
- Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216, 2024.
- Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078, 2024.
- Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641, 2023.
- General preference modeling with preference representations for aligning language models. arXiv preprint arXiv:2410.02197, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.