Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation (2508.02618v1)

Published 4 Aug 2025 in cs.CL

Abstract: The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for LLMs, responsible for providing reward signals to generated responses. However, mainstream preference modeling in RM is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework for more adequate preference modeling through attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the preference modeling to simulate teacher model's interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in RM.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation (2508.02618v1)

Collections

Summary

Follow-up Questions

Authors (7)

Don't miss out on important new AI/ML research

Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation (2508.02618v1)

Collections

Summary

Follow-up Questions

Related Papers

Authors (7)

Don't miss out on important new AI/ML research