- The paper introduces a novel Causal Reward Model (CRM) that integrates causal inference to counteract reward hacking in RLHF-trained large language models.
- It employs Maximum Mean Discrepancy (MMD) regularization to enforce counterfactual invariance, ensuring the reward model remains unaffected by spurious factors like response length.
- Experimental results on multiple benchmarks demonstrate that both unconditional and conditional CRM variants effectively reduce biases such as sycophancy, length, concept, and discrimination while preserving overall utility.
This paper introduces a novel Causal Reward Model (CRM) framework to address the problem of reward hacking in LLMs trained using Reinforcement Learning from Human Feedback (RLHF). Reward hacking occurs when LLMs exploit spurious correlations in the human preference data used for training reward models, leading to undesirable biases like length bias, sycophancy, concept bias, and discrimination, rather than genuinely aligning with true human preferences.
The core idea is to integrate causal inference principles into the reward modeling process to mitigate these spurious correlations. The authors propose enforcing counterfactual invariance, meaning the reward prediction should remain consistent even when irrelevant variables (spurious factors like response length) are changed.
Methodology:
- Causal Framework: The paper models the relationships between spurious factors (Z), prompt-response pairs (T), true rewards (R), and human preference labels (L) using a causal diagram (Figure 1). It decomposes the prompt-response pair T into latent components: TZ,⊥ (independent of the spurious factor Z), TL,⊥ (does not cause the label L), and TZ∧L (influenced by both Z and L). An ideal, debiased reward model should depend only on TZ,⊥.
- Independence Condition: Since obtaining counterfactual data to directly learn TZ,⊥ is hard, the paper leverages an observable signature from the causal graph: a counterfactually invariant reward model f(T) must be independent of the spurious factor Z (Equation \ref{eqn:independence_condition}).
- MMD Regularization: To enforce this independence condition during training, the paper uses Maximum Mean Discrepancy (MMD) as a regularizer. MMD measures the difference between the distributions of the reward model's latent representations f(T) conditioned on different values (or bins) of the spurious factor Z. The final loss function combines the standard RLHF reward model loss with this MMD regularization term.
- Variants: The paper proposes two variants: an unconditional CRM where MMD is applied across all data, and a conditional CRM where MMD is applied separately to chosen and rejected responses.
Experiments and Results:
The proposed CRM was evaluated on mitigating four types of bias using Llama-3 8B:
- Sycophantic Bias: Using a semi-synthetic dataset, CRM (especially the conditional variant) significantly reduced the model's tendency to agree with the user ("Yes, you are right.") spuriously correlated with correctness (Table \ref{table:synco}).
- Length Bias: On the Alpaca dataset, CRM outperformed vanilla RLHF with length penalties in terms of win rate against the SFT model and showed a better trade-off on the Pareto front. It also reduced the correlation between response length and reward rank (Figure \ref{fig:length-bias-main}).
- Concept Bias: Using Yelp, IMDB, and Amazon review datasets modified to induce concept bias, CRM consistently reduced the Bias@C metric compared to the vanilla reward model, demonstrating its ability to mitigate reliance on spurious concept-sentiment correlations. Conditional CRM often achieved the lowest bias, while unconditional CRM showed strong utility (accuracy) (Table \ref{tab:concept_bias_results}).
- Discrimination Bias: Using a filtered subset of the Anthropic HH-RLHF dataset and the Discrm-eval benchmark, CRM significantly reduced both explicit and implicit discrimination based on age, race, and gender compared to vanilla RLHF, particularly for implicit bias. Unconditional CRM showed the best overall bias reduction with minimal impact on general utility (Table \ref{tab:discrimination_scores}, Figure \ref{fig:discrim_coeff}).
Conclusion:
The paper demonstrates that incorporating causal regularization via MMD to enforce counterfactual invariance is an effective and practical method for mitigating various biases stemming from spurious correlations in RLHF reward models. The proposed CRM acts as a drop-in enhancement to existing RLHF pipelines, improving the fairness, reliability, and trustworthiness of LLMs without significant complexity or loss of general utility.