Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Reward Modeling with Weak Supervision for Language Models (2410.20869v1)

Published 28 Oct 2024 in cs.CL

Abstract: Recent advancements in LLMs have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user intentions. In the RLHF process, a reward model is trained using responses preferences determined by human labelers or AI systems, which then refines the LLM through reinforcement learning. This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance. Weak supervision employs noisy or imprecise data labeling, reducing reliance on expensive manually labeled data. By analyzing RLHF datasets to identify heuristics that correlate with response preference, we wrote simple labeling functions and then calibrated a label model to weakly annotate unlabeled data. Our evaluation show that while weak supervision significantly benefits smaller datasets by improving reward model performance, its effectiveness decreases with larger, originally labeled datasets. Additionally, using an LLM to generate and then weakly label responses offers a promising method for extending preference data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  2. Deep reinforcement learning from hierarchical weak preference feedback. arXiv preprint arXiv:2309.02632.
  3. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  4. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
  5. Rudolph Flesch. 1948. A new readability yardstick. Journal of Applied Psychology, 32(3):p221 – 233.
  6. Vader: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1):216–225.
  7. Aligning large language models through synthetic feedback. arXiv preprint arXiv:2305.13735.
  8. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  9. OpenAI. 2024. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  10. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  11. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, page 269. NIH Public Access.
  12. Training complex models with multi-task weak supervision. arXiv preprint arXiv:1810.02840.
  13. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, arXiv:1707.06347.
  14. A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716.
  15. Salmon: Self-alignment with principle-following reward models. arXiv preprint arXiv:2310.05910.
  16. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  17. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.