Selective Preference Optimization via Token-Level Reward Function Estimation (2408.13518v1)

Published 24 Aug 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advancements in LLM alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection. SePO proposes the first token selection method based on Direct Preference Optimization (DPO), which trains an oracle model to estimate a token-level reward function on the target data. This method applies to any existing alignment datasets with response-level annotations and enables cost-efficient token selection with small-scale oracle models and training data. The estimated reward function is then utilized to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a reference model-free contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing 30% key tokens on the target dataset. SePO applications on weak-to-strong generalization show that weak oracle models effectively supervise strong policy models with up to 16.8x more parameters. SePO also effectively selects key tokens from out-of-distribution data to enhance strong policy models and alleviate the over-optimization problem.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a novel framework that decouples response-level rewards by estimating token-level rewards for efficient LLM alignment.
It demonstrates that optimizing around 30% of key tokens significantly enhances performance on benchmarks such as Arena-Hard and AlpacaEval 2.0.
It validates a weak-to-strong supervision strategy, showing that small oracle models can effectively guide stronger policy models in diverse data scenarios.

Selective Preference Optimization via Token-Level Reward Function Estimation

Selective Preference Optimization (SePO) is proposed as a novel approach to addressing the inefficiencies and complexities inherent in existing token-level alignment methods for LLMs. This method fundamentally posits that not all tokens in a dataset equally contribute to effective alignment and therefore focuses on optimizing a selective subset of key tokens.

The SePO framework consists of three primary stages: oracle modeling with Direct Preference Optimization (DPO), token-level reward estimation, and contrastive optimization of the target policy model.

Key Contributions

Efficient Oracle Model Training: SePO introduces a pragmatic method of using a moderately scaled dataset to train an oracle model via DPO. This approach effectively decouples the response-level reward into token-level rewards without requiring extensive and finely-grained annotation effort. The oracle model's role is to estimate the optimal token-level reward function that guides token selection.
Selective Token Optimization: The token-level reward function derived from the oracle model is used to score tokens across the larger dataset. The process involves selecting only a specified percentage of tokens from responses—those with the highest rewards in chosen responses and the lowest rewards in rejected responses—to supervise the target policy model. Through this selective optimization, SePO ensures cost-efficiency and effectiveness by avoiding noise and redundancy.
Weak-to-Strong Generalization: A significant benefit of SePO is its applicability in weak-to-strong generalization scenarios. Here, a weak oracle model supervises a much more complex and capable policy model. This paradigm is extended to scenarios involving out-of-distribution data, where the method mitigates over-optimization while leveraging useful supervisory signals.

Experimental Validation

The efficacy of SePO is validated through extensive experiments on multiple public benchmarks: AlpacaEval 2.0, MT-Bench, and Arena-Hard. The experiments involved several model families, including LLaMA and Pythia, with various configurations of policy models and oracle models.

Results and Analysis

Comparative Performance: SePO consistently outperforms baseline methods such as DPO, IPO, and SimPO, by focusing on optimizing only 30% of the key tokens in the target dataset. For instance, SePO achieved significant improvements in win rates on the Arena-Hard benchmark and LC win rates on AlpacaEval 2.0.
Token Selection Proportion Impact: Experiments revealed that while increasing the token selection ratio enhances model performance, the marginal gains diminish beyond a point. Optimal performance was generally observed around 30% for selected tokens from the target dataset.
Data Scale and Oracle Training: The results also emphasized the importance of training data scale for oracle modeling. Oracle models trained on higher data proportions perform better in guiding the target models effectively, with a notable improvement seen when utilizing around 70% of the target dataset.
Weak-to-Strong Supervision: In a weak-to-strong supervision scenario, SePO demonstrated that even a small oracle model could significantly enhance strong policy models. For example, a Pythia-410M oracle effectively guided Pythia-SFT-6.9B, achieving superior LC win rates that outperformed the full optimization techniques used in baseline methods.

Implications and Future Directions

SePO offers a nuanced approach to preference optimization by focusing on selective training, which is not only cost-efficient but also improves the stability and performance of LLMs. The implications extend to improving the alignment process for stronger models like LLaMA2-Chat-13B and providing a robust framework for integrating out-of-distribution data effectively.

Future developments in this field could explore the scalability of SePO to even larger model configurations, such as LLaMA2-Chat-70B, and test its adaptability across different model families and vocabularies. Additionally, the framework could be refined to improve weak oracle model performance, providing a more reliable estimate of token-level rewards.

Conclusion

Selective Preference Optimization (SePO) addresses crucial challenges in the alignment of LLMs by leveraging efficient token-level reward estimation and selective optimization strategies. This method significantly improves upon existing alignment processes in terms of efficiency, effectiveness, and scalability, offering valuable insights and practical applications in enhancing the capabilities of LLMs. By advancing weak-to-strong generalization techniques, SePO paves the way for future research to further explore and extend these improvements in the broader context of AI model training and alignment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JagersbergKnut/status/1830701186481102862

YouTube

Show All Videos