SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks (2410.05102v2)

Published 7 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Preference Optimization (PO) has proven an effective step for aligning LLMs to human-desired behaviors. Current variants, following the offline Direct Preference Optimization objective, have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected by each word in a sequence equally but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best weight reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments on multiple domains, including sentiment control, dialogue, text summarization and text-to-code generation, illustrate that our approach assigns meaningful weights to tokens according to the target task, generates more responses with the desired preference and improves reasoning tasks by up to 2 percentage points compared to other token- and response-level PO methods.

Summary

The paper introduces SparsePO, a token-level preference optimization method that uses sparse token masks to learn token importance and improve alignment.
It employs two masking strategies—Model Activation-based Mask and Learnable Sparse Mask—to effectively weigh token contributions based on KL divergence and reward metrics.
Experiments across sentiment control, dialogue, summarization, and code generation demonstrate that SparsePO outperforms existing preference optimization methods in aligning human preferences.

SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

The paper introduces a novel approach to preference optimization in LLMs, termed SparsePO, which leverages sparse token masks to control preference alignment at a token level. SparsePO capitalizes on the insight that not all tokens equally impact human preferences, proposing a strategy that applies token-level weighting to the Kullback-Leibler (KL) divergence and reward contributions during training.

Key Contributions

Token-Level Preference Optimization:
- The paper acknowledges that traditional preference optimization (PO), particularly Direct Preference Optimization (DPO), often applies uniformly across entire sequences. SparsePO diverges from this by focusing on token-level contributions, where the sparsity of token importance is automatically learned during training.
Masking Strategies:
- The paper proposes two main strategies for mask computation:
  - Model Activation-based Mask (MaPO): Utilizes internal activations of the reference model to derive token-level weightings.
  - Learnable Sparse Mask (SparsePO): Incorporates learnable parameters to compute mask values during training, allowing the model to dynamically adjust which tokens are pivotal for preference alignment.
Extensive Experimental Analysis:
- The effectiveness of SparsePO is demonstrated across several domains, including sentiment control, dialogue, summarization, and text-to-code generation. SparsePO consistently assigns meaningful weights according to the specific task, outperforming existing PO methods like sequence-level DPO and token-level TDPO.
Quantitative Results:
- Notable findings include achieving improved or comparable reward distributions while allowing greater KL divergence, indicating a balanced alignment to human preferences without excessive constraints.

Implications and Future Directions

The paper suggests significant implications for designing LLMs that interact more effectively with nuanced human preferences. Sparse token masking introduces a level of granularity in preference alignment, facilitating more flexible and diverse response generation. This methodological advancement could lead to more robust applications in areas requiring fine-tuned language generation, such as conversational agents and content moderation systems.

Speculations on Future Developments

The integration of SparsePO with larger and more complex LLMs could enhance their ability to model detailed preference signals.
Future work could explore the combination of SparsePO with other preference optimization techniques for a hybrid approach to model alignment.
Moreover, investigating the application of SparsePO in multi-modal models might unearth further potential for cross-domain preference alignment.

Conclusion

SparsePO stands as a promising advancement in the field of preference optimization, introducing a structured and token-specific approach to model alignment. By focusing on token-level contributions, this method facilitates a meaningful balance between reward realization and KL divergence, ultimately promoting a more refined and human-aligned LLM behavior.

PDF Markdown

Related Papers

Tweets

https://twitter.com/papers_anon/status/1843507789181923443