Dense Reward for Free in Reinforcement Learning from Human Feedback (2402.00782v1)

Published 1 Feb 2024 in cs.LG

Abstract: Reinforcement Learning from Human Feedback (RLHF) has been credited as the key advance that has allowed LLMs to effectively follow instructions and produce useful assistance. Classically, this involves generating completions from the LLM in response to a query before using a separate reward model to assign a score to the full completion. As an auto-regressive process, the LLM has to take many "actions" (selecting individual tokens) and only receives a single, sparse reward at the end of an episode, a setup that is known to be difficult to optimise in traditional reinforcement learning. In this work we leverage the fact that the reward model contains more information than just its scalar output, in particular, it calculates an attention map over tokens as part of the transformer architecture. We use these attention weights to redistribute the reward along the whole completion, effectively densifying the signal and highlighting the most important tokens, all without incurring extra computational cost or requiring any additional modelling. We demonstrate that, theoretically, this approach is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.

Citations (19)

View on Semantic Scholar

Summary

The paper introduces Attention Based Credit (ABC) to redistribute a single scalar reward into token-level feedback by leveraging transformer attention weights.
The method is theoretically validated as equivalent to potential-based reward shaping, ensuring that optimal policies remain unaffected.
Empirical evaluations on tasks like sentiment generation, summarization, and dialogue reveal enhanced training stability, faster convergence, and improved local optima.

Dense Reward for Free in Reinforcement Learning from Human Feedback

The paper "Dense Reward for Free in Reinforcement Learning from Human Feedback" by Alex J. Chan, Hao Sun, Samuel Holt, and Mihaela van der Schaar addresses a core challenge in the application of Reinforcement Learning from Human Feedback (RLHF) to LLMs. In RLHF, a critical issue is the sparsity of rewards; LLMs receive a single reward signal at the end of a sequence rather than immediate feedback at each action or token generation step. This often complicates the optimization process, making it challenging for RL algorithms to efficiently and effectively learn.

Main Contributions

Leveraging Attention for Reward Redistribution: The authors propose using the attention weights from the transformer-based reward model to distribute the scalar reward along the completion. This method, termed Attention Based Credit (ABC), effectively densifies the reward signal by assigning portions of the final reward to individual tokens according to their calculated importance (attention weights) during prediction. This approach enhances the informativeness of the feedback without incurring additional computational burdens.
Theoretical Validation: The paper demonstrates that ABC is equivalent to potential-based reward shaping, ensuring that transforming the reward signal in this manner does not alter the optimal policy. Consequently, agents optimizing under the reshaped reward can still discover policies that maximize the original reward criteria.
Empirical Evaluation: The authors empirically validate their method across diverse tasks, including sentiment generation, summarization, and dialogue generation. These experiments indicate that ABC not only stabilizes training and accelerates convergence but may also facilitate the attainment of better local optima.

Implications and Future Work

Practical Implications:

The refinement of reward signals in RL settings could significantly enhance performance and scalability, particularly as the field applies to LLMs where feedback sparsity presents a bottleneck. By redistributing rewards at the token level, practitioners could achieve more stable and efficient fine-tuning of LLMs with less computational overhead, leading to improvements in a variety of task-specific LLMs.

Theoretical Implications:

The research highlights the importance of attention mechanisms not just as a feature of model architectures but as a potential tool for improving training methodologies in RLHF. This insight invites further exploration of multi-head attention's role beyond token prediction, possibly extending to other aspects of model interpretability and optimization.

Future Directions:

This work lays a foundation for further investigation into how attention-based methods can enhance RL from both theoretical and applied perspectives. Future research might explore various types of transformer architectures and differing attention head configurations to ascertain their respective impacts on reward redistribution. Additionally, extending this framework to non-language domains could demonstrate its utility in broader RL applications.

In summary, the paper presents a compelling advancement in the facilitation of dense reward structures in RL for LLMs, potentially marking a step forward in optimizing the interplay between human feedback and machine learning to cultivate more capable and efficient models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/HolarisSun/status/1834362976037732413

https://twitter.com/MihaelaVDS/status/1815687878950674584

https://twitter.com/Tianyi_Alex_Qiu/status/1760183808719212704

https://twitter.com/Tianyi_Alex_Qiu/status/1760178461661094088