- The paper introduces Crome, a framework that trains reward models via targeted causal and neutral data augmentations to separate true quality signals from spurious correlations.
- It employs oracle LLMs to generate counterfactual pairs that improve RM sensitivity to causal attributes while enforcing invariance to irrelevant features.
- Experiments demonstrate significant gains in accuracy, safety, and robustness on benchmarks like RewardBench and reWordBench compared to baseline models.
This paper introduces Crome (Causally Robust Reward Modeling), a novel framework designed to mitigate reward hacking in LLMs by training more robust reward models (RMs). Reward hacking occurs when RMs, crucial for aligning LLMs via human feedback (RLHF), learn to assign high scores based on superficial or spurious attributes (e.g., length, formatting) correlated with, but not causally responsible for, true response quality. This leads to brittle RMs and misaligned LLM policies.
Crome addresses this by grounding RM training in an explicit causal model that distinguishes between:
- Causal Attributes (C(A)): Fundamental quality dimensions like factuality or relevance that genuinely determine an answer's quality.
- Spurious Attributes (SP(A)): Other features like length or specific formatting that are merely correlated with preferences in training data.
The core idea is to train RMs to be sensitive to causal attributes and invariant to spurious ones, even if the specific spurious attributes are unknown. Crome achieves this through targeted synthetic data augmentations generated by an oracle LLM (e.g., Gemini 2.0 Flash):
- Causal Augmentations: These are pairs of responses that differ along specific, identified causal attributes. For an original answer, the LLM generates "upgraded" or "degraded" versions by intervening only on a single causal attribute (e.g., improving factuality, reducing clarity). These pairs are labeled with preferences (e.g., upgraded ≻ original) to teach the RM sensitivity to individual causal dimensions.
- Neutral Augmentations: These are tie-labeled pairs designed to enforce invariance to spurious attributes. The primary strategy discussed is Irrelevant Query Neutrals (IQN). For an existing pair of answers (original or causally augmented), Crome pairs them with a new, unrelated query. In this new context, the original causal distinctions become irrelevant, and any remaining differences are primarily spurious. Labeling these as ties teaches the RM to disregard such spurious variations when no true causal signal for the current query exists. Other neutral strategies like Causally Aligned Neutrals (CAN) and Paraphrase Neutrals (PARA) are also explored.
Methodology:
The Crome pipeline involves:
- Attribute Identification: An oracle LLM identifies principal causal attributes relevant to the task (e.g., accuracy, completeness, clarity).
- Counterfactual Generation: The LLM generates Causal and Neutral augmentation pairs based on these attributes.
- Data Filtering: The augmented data is filtered to retain pairs where a baseline RM is uncertain or incorrect, focusing training on informative examples.
- Robust RM Training: The RM is trained on a combined dataset (original preferences + filtered augmentations) using a composite loss function. This loss includes a standard preference loss for causal sensitivity and a neutral tie loss (encouraging reward differences to be near zero for tie-labeled pairs) for spurious invariance.
Theoretical Analysis (Informal):
The paper provides an informal theoretical argument suggesting that under idealized assumptions (boolean attributes, quadratic reward, perfect counterfactuals), ℓ1-constrained regression on causally augmented data can recover true causal reward coefficients with an error that depends primarily on the number of causal attributes and samples, and only weakly on the (potentially large) number of spurious attributes.
Experiments and Results:
Crome was evaluated against Vanilla RMs and RRM (Robust Reward Model training) using base LLMs like Gemma-2-9B-IT, Qwen2.5-7B, and Gemma-2-2B.
- RewardBench Performance: Crome significantly outperformed baselines, improving average accuracy by up to 5.4% (PairPM setting with Gemma-2-9B-IT). Gains were particularly notable in Safety (up to +13.18%) and Reasoning (up to +7.19%).
- Robustness on reWordBench: Crome demonstrated superior robustness against meaning-preserving transformations on reWordBench, achieving higher accuracy and a smaller drop in performance compared to RewardBench scores. For example, with Gemma-2-9B-IT (PairPM), Crome showed an aggregate accuracy gain of up to 9.1% on reWordBench.
- Best-of-N (BoN) Alignment: Crome-trained RMs led to consistent improvements in BoN selection across various N values on RewardBench, the safety-focused WildGuardTest (showing lower Attack Success Rates without significantly increasing refusal rates on benign prompts), and the reasoning-specific GSM8k.
- Neutral Augmentation Ablations: Experiments showed that neutral augmentations are crucial for robustness. IQN generally performed best on RewardBench, while CAN showed strong results on reWordBench.
- Oracle LLM Choice: Crome demonstrated robustness to the choice of oracle LLM, showing significant improvements even when using a weaker open-weights model like Gemma-2-27B-IT for augmentation generation.
Key Contributions:
- A spurious-unaware causal framework that only requires intervention on LLM-identified causal rubrics.
- Targeted counterfactual augmentations (Causal and Neutral) to disentangle causal and spurious attributes without explicit knowledge of the latter.
- State-of-the-art RM robustness demonstrated on benchmarks like RewardBench and reWordBench.
- Improved Best-of-N selection performance across chat, reasoning, and safety tasks.
The paper concludes that Crome's causally-informed data augmentation strategy effectively mitigates reward hacking, leading to more robust and aligned RMs. Future work includes applying these causal principles to synthetic data generation for base model training.