LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits (2410.01735v1)

Published 2 Oct 2024 in cs.CL and cs.LG

Abstract: Reward Models (RMs) play a crucial role in aligning LLMs with human preferences, enhancing their performance by ranking outputs during inference or iterative training. However, the degree to which an RM generalizes to new tasks is often not known a priori (e.g. some RMs may excel at scoring creative writing vs. math reasoning). Therefore, using only one fixed RM while training LLMs can be suboptimal. Moreover, optimizing LLMs with multiple RMs simultaneously can be prohibitively computationally-intensive and challenging due to conflicting signals from different RMs, potentially degrading performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which iteratively trains LLMs using multiple RMs, selecting and utilizing the most well-suited RM for each instance to rank outputs and generate preference data, framed as a multi-armed bandit problem. Our results on commonsense and math reasoning tasks demonstrate that LASeR can boost iterative LLM optimization by optimizing for multiple RMs, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over training with ensemble RM scores while also showing superior training efficiency (e.g., a 2x speedup). Moreover, on WildChat, a benchmark of instruction-following prompts, we find that using Llama-3-8B LASeR leads to a 71.45% AlpacaEval win rate over sequentially optimizing multiple RMs. Extending to long-context generation tasks, we find that on Llama-3-8B, LASeR achieves an average improvement of 2.64 F1 and 2.42 F1 on single- and multi-document QA over random RM selection when used with best-of-n sampling. LASeR is robust to noisy rewards and generalizes to multiple settings. Finally, LASeR's RM selection changes depending on the underlying task or instance and we verify the presence of conflicting preferences from multiple RMs that can be mitigated using LASeR.

Summary

The paper introduces the LASeR method, which adaptively selects multiple reward models to better align LLMs with diverse human preferences.
It utilizes a LinUCB-based multi-armed bandit framework to dynamically choose the optimal reward model for each training instance.
Empirical results show that LASeR improves training efficiency and task accuracy compared to fixed or ensemble reward model approaches.

Overview of LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

The paper, "LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits," introduces an innovative approach to optimizing LLMs by adaptively selecting relevant reward models (RMs) during training. The motivation behind this work stems from the realization that using a single fixed RM to align LLMs with human preferences can be suboptimal, given the diverse nature of tasks and the RMs themselves.

Key Contributions

Multi-Reward Model Utilization: The work presents the LASeR method that leverages multiple RMs, each potentially better suited to different tasks. These RMs can vary greatly, with some excelling in domains such as creative writing and others in mathematical reasoning.
Multi-Armed Bandit (MAB) Framework: The paper frames the selection of RMs as a multi-armed bandit problem, allowing the model to dynamically choose the best-suited RM based on the context and past interactions. Specifically, LinUCB, a contextual bandit algorithm, is employed to facilitate this process.
Empirical Validation: The method was tested across several domains, showing improvements in tasks ranging from commonsense reasoning and mathematics to more open-ended instruction-following tasks. Results indicated consistent improvements in accuracy and efficiency compared to baselines such as single best RM selection and ensemble RM methods.
Training Efficiency and Robustness: LASeR demonstrated superior training efficiency, reducing training time while effectively handling noisy or conflicting signals from multiple RMs. Importantly, it allows adaptive selection based on individual instances, proving more efficient than sequential or random RM selection.

Practical Implications

The ability to dynamically select the most applicable RM for each task or instance represents a significant advancement in LLM training. By addressing the shortcomings of fixed RM selection, LASeR enhances both the diversity and generalization capabilities of LLMs. This approach is particularly beneficial in heterogeneous domains, where user queries or prompts can vary significantly in nature and complexity.

Theoretical Implications

The momentous contribution lies in the application of MABs to RM selection, offering an adaptive framework that melds exploration and exploitation effectively. This adaptability ensures RMs are not only selected based on past success but also in a manner that anticipates future requirements, balancing well-known multi-objective optimization challenges.

Future Directions

Future advancements could explore integrating additional types of RMs or refining bandit algorithms further. Investigating the use of LASeR in conjunction with LLMs as self-assessing judges might present an avenue for minimizing reliance on human feedback datasets. Furthermore, as the size and capabilities of RMs expand, ensuring computational scalability will be a crucial focus.

Conclusion

The research presents LASeR as a competent method for enhancing LLM training through adaptive RM selection. By combining multiple RMs and employing MAB strategies, this approach aligns LLM optimization more closely with the nuanced and varied nature of human preferences, thereby setting the stage for more sophisticated AI systems that can nimbly navigate diverse tasks and domains.

PDF Markdown

Tweets

https://twitter.com/ArchikiPrasad/status/1841892480478020028