Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment (2501.09620v2)

Published 16 Jan 2025 in cs.LG and cs.AI

Abstract: Recent advances in LLMs have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination-that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causality to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.

Summary

The paper introduces a novel Causal Reward Model (CRM) that integrates causal inference to counteract reward hacking in RLHF-trained large language models.
It employs Maximum Mean Discrepancy (MMD) regularization to enforce counterfactual invariance, ensuring the reward model remains unaffected by spurious factors like response length.
Experimental results on multiple benchmarks demonstrate that both unconditional and conditional CRM variants effectively reduce biases such as sycophancy, length, concept, and discrimination while preserving overall utility.

This paper introduces a novel Causal Reward Model (CRM) framework to address the problem of reward hacking in LLMs trained using Reinforcement Learning from Human Feedback (RLHF). Reward hacking occurs when LLMs exploit spurious correlations in the human preference data used for training reward models, leading to undesirable biases like length bias, sycophancy, concept bias, and discrimination, rather than genuinely aligning with true human preferences.

The core idea is to integrate causal inference principles into the reward modeling process to mitigate these spurious correlations. The authors propose enforcing counterfactual invariance, meaning the reward prediction should remain consistent even when irrelevant variables (spurious factors like response length) are changed.

Methodology:

Causal Framework: The paper models the relationships between spurious factors ( $Z$ ), prompt-response pairs ( $T$ ), true rewards ( $R$ ), and human preference labels ( $L$ ) using a causal diagram (Figure 1). It decomposes the prompt-response pair $T$ into latent components: $T^{Z, \perp}$ (independent of the spurious factor $Z$ ), $T^{L, \perp}$ (does not cause the label $L$ ), and $T^{Z \wedge L}$ (influenced by both $Z$ and $L$ ). An ideal, debiased reward model should depend only on $T^{Z, \perp}$ .
Independence Condition: Since obtaining counterfactual data to directly learn $T^{Z, \perp}$ is hard, the paper leverages an observable signature from the causal graph: a counterfactually invariant reward model $f(T)$ must be independent of the spurious factor $Z$ (Equation \ref{eqn:independence_condition}).
MMD Regularization: To enforce this independence condition during training, the paper uses Maximum Mean Discrepancy (MMD) as a regularizer. MMD measures the difference between the distributions of the reward model's latent representations $f(T)$ conditioned on different values (or bins) of the spurious factor $Z$ . The final loss function combines the standard RLHF reward model loss with this MMD regularization term.
Variants: The paper proposes two variants: an unconditional CRM where MMD is applied across all data, and a conditional CRM where MMD is applied separately to chosen and rejected responses.

Experiments and Results:

The proposed CRM was evaluated on mitigating four types of bias using Llama-3 8B:

Sycophantic Bias: Using a semi-synthetic dataset, CRM (especially the conditional variant) significantly reduced the model's tendency to agree with the user ("Yes, you are right.") spuriously correlated with correctness (Table \ref{table:synco}).
Length Bias: On the Alpaca dataset, CRM outperformed vanilla RLHF with length penalties in terms of win rate against the SFT model and showed a better trade-off on the Pareto front. It also reduced the correlation between response length and reward rank (Figure \ref{fig:length-bias-main}).
Concept Bias: Using Yelp, IMDB, and Amazon review datasets modified to induce concept bias, CRM consistently reduced the Bias@C metric compared to the vanilla reward model, demonstrating its ability to mitigate reliance on spurious concept-sentiment correlations. Conditional CRM often achieved the lowest bias, while unconditional CRM showed strong utility (accuracy) (Table \ref{tab:concept_bias_results}).
Discrimination Bias: Using a filtered subset of the Anthropic HH-RLHF dataset and the Discrm-eval benchmark, CRM significantly reduced both explicit and implicit discrimination based on age, race, and gender compared to vanilla RLHF, particularly for implicit bias. Unconditional CRM showed the best overall bias reduction with minimal impact on general utility (Table \ref{tab:discrimination_scores}, Figure \ref{fig:discrim_coeff}).

Conclusion:

The paper demonstrates that incorporating causal regularization via MMD to enforce counterfactual invariance is an effective and practical method for mitigating various biases stemming from spurious correlations in RLHF reward models. The proposed CRM acts as a drop-in enhancement to existing RLHF pipelines, improving the fairness, reliability, and trustworthiness of LLMs without significant complexity or loss of general utility.

PDF Markdown

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment (2501.09620v2)

Summary

Related Papers