TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback (2407.16574v2)
Abstract: Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train LLMs to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the LLM. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign continuous rewards to each token considering the context. Extensive experiments show that our proposed TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards on open-ended generation benchmarks.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324.
- Drlc: Reinforcement learning with dense rewards from llm critic.
- Dense reward for free in reinforcement learning from human feedback.
- Booookscore: A systematic exploration of book-length summarization in the era of llms. In The Twelfth International Conference on Learning Representations.
- Improving large language models via fine-grained reinforcement learning with minimum editing constraint.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Ultrafeedback: Boosting language models with high-quality feedback.
- Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Beyond imitation: Leveraging fine-grained quality signals for alignment. In The Twelfth International Conference on Learning Representations.
- Vladimir Iosifovich Levenshtein. 1965. Binary codes capable of correcting deletions, insertions, and reversals. In Doklady Akademii Nauk, volume 163, pages 845–848. Russian Academy of Sciences.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Remax: A simple, effective, and efficient method for aligning large language models.
- Second thoughts are best: Learning to re-align with human values from text edits. In Advances in Neural Information Processing Systems, volume 35, pages 181–196. Curran Associates, Inc.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- Direct preference optimization: Your language model is secretly a reward model. In ICML 2023 Workshop The Many Facets of Preference-Based Learning.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization.
- Efficient rlhf: Reducing the memory usage of ppo. ArXiv, abs/2309.00754.
- Proximal policy optimization algorithms.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8(3–4):229–256.
- Fine-grained human feedback gives better rewards for language model training. In Thirty-seventh Conference on Neural Information Processing Systems.
- Preference-grounded token-level guidance for language model fine-tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320.
- C-TPT: Calibrated test-time prompt tuning for vision-language models via text feature dispersion. In The Twelfth International Conference on Learning Representations.
- Hear: Hearing enhanced audio response for video-grounded dialogue. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11911–11924.
- Information-theoretic text hallucination reduction for video-grounded dialogue. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4182–4193.
- Prompting large language model for machine translation: A case study. In International Conference on Machine Learning, pages 41092–41110. PMLR.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Secrets of rlhf in large language models part i: Ppo.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.