TI-DPO: Token-Importance Guided DPO
- TI-DPO is a class of methods that refines Direct Preference Optimization by emphasizing influential tokens for improved model alignment.
- The approach integrates token-level discriminative weighting to adaptively modulate gradient updates and address issues like reward dilution and length bias.
- By focusing on the tokens most impactful for preference outcomes, TI-DPO offers a practical solution to challenges in sequence-level preference modeling.
Token-Importance Guided Direct Preference Optimization (TI-DPO) is a class of methods designed to enhance preference alignment for LLMs by integrating token-level discriminative weighting into the standard Direct Preference Optimization (DPO) paradigm. Unlike conventional DPO, which uniformly aggregates the log-likelihood ratio over all tokens in a sequence, TI-DPO leverages fine-grained token-level signals to emphasize critical tokens and adaptively modulate gradient updates. These approaches target the shortcomings of sequence-level preference modeling, such as reward dilution, length bias, and weak discrimination at salient tokens, by focusing learning on the tokens most impactful for preference outcomes.
1. Theoretical Foundations and Motivation
Traditional DPO optimizes a policy to maximize the likelihood of preferred complet