Robust Reward Modeling via Causal Rubrics (2506.16507v1)

Published 19 Jun 2025 in cs.LG

Abstract: Reward models (RMs) are fundamental to aligning LLMs via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce Crome (Causally Robust Reward Modeling), a novel framework grounded in an explicit causal model designed to mitigate reward hacking. Crome employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes, to enforce sensitivity along each causal attribute individually, and (2) Neutral Augmentations, which are tie-label pairs varying primarily in spurious attributes, to enforce invariance along spurious attributes. Notably, our augmentations are produced without any knowledge of spurious factors, via answer interventions only along causal rubrics, that are identified by querying an oracle LLM. Empirically, Crome significantly outperforms standard baselines on RewardBench, improving average accuracy by up to 5.4% and achieving gains of up to 13.2% and 7.2% in specific categories. The robustness of Crome is further testified by the consistent gains obtained in a Best-of-N inference setting across increasing N, across various benchmarks, including the popular RewardBench (covering chat, chat-hard, safety, and reasoning tasks), the safety-focused WildGuardTest, and the reasoning-specific GSM8k.

Summary

The paper introduces Crome, a framework that trains reward models via targeted causal and neutral data augmentations to separate true quality signals from spurious correlations.
It employs oracle LLMs to generate counterfactual pairs that improve RM sensitivity to causal attributes while enforcing invariance to irrelevant features.
Experiments demonstrate significant gains in accuracy, safety, and robustness on benchmarks like RewardBench and reWordBench compared to baseline models.

This paper introduces Crome (Causally Robust Reward Modeling), a novel framework designed to mitigate reward hacking in LLMs by training more robust reward models (RMs). Reward hacking occurs when RMs, crucial for aligning LLMs via human feedback (RLHF), learn to assign high scores based on superficial or spurious attributes (e.g., length, formatting) correlated with, but not causally responsible for, true response quality. This leads to brittle RMs and misaligned LLM policies.

Crome addresses this by grounding RM training in an explicit causal model that distinguishes between:

Causal Attributes (C(A)): Fundamental quality dimensions like factuality or relevance that genuinely determine an answer's quality.
Spurious Attributes (SP(A)): Other features like length or specific formatting that are merely correlated with preferences in training data.

The core idea is to train RMs to be sensitive to causal attributes and invariant to spurious ones, even if the specific spurious attributes are unknown. Crome achieves this through targeted synthetic data augmentations generated by an oracle LLM (e.g., Gemini 2.0 Flash):

Causal Augmentations: These are pairs of responses that differ along specific, identified causal attributes. For an original answer, the LLM generates "upgraded" or "degraded" versions by intervening only on a single causal attribute (e.g., improving factuality, reducing clarity). These pairs are labeled with preferences (e.g., upgraded $\succ$ original) to teach the RM sensitivity to individual causal dimensions.
Neutral Augmentations: These are tie-labeled pairs designed to enforce invariance to spurious attributes. The primary strategy discussed is Irrelevant Query Neutrals (IQN). For an existing pair of answers (original or causally augmented), Crome pairs them with a new, unrelated query. In this new context, the original causal distinctions become irrelevant, and any remaining differences are primarily spurious. Labeling these as ties teaches the RM to disregard such spurious variations when no true causal signal for the current query exists. Other neutral strategies like Causally Aligned Neutrals (CAN) and Paraphrase Neutrals (PARA) are also explored.

Methodology:

The Crome pipeline involves:

Attribute Identification: An oracle LLM identifies principal causal attributes relevant to the task (e.g., accuracy, completeness, clarity).
Counterfactual Generation: The LLM generates Causal and Neutral augmentation pairs based on these attributes.
Data Filtering: The augmented data is filtered to retain pairs where a baseline RM is uncertain or incorrect, focusing training on informative examples.
Robust RM Training: The RM is trained on a combined dataset (original preferences + filtered augmentations) using a composite loss function. This loss includes a standard preference loss for causal sensitivity and a neutral tie loss (encouraging reward differences to be near zero for tie-labeled pairs) for spurious invariance.

Theoretical Analysis (Informal):

The paper provides an informal theoretical argument suggesting that under idealized assumptions (boolean attributes, quadratic reward, perfect counterfactuals), $\ell_1$ -constrained regression on causally augmented data can recover true causal reward coefficients with an error that depends primarily on the number of causal attributes and samples, and only weakly on the (potentially large) number of spurious attributes.

Experiments and Results:

Crome was evaluated against Vanilla RMs and RRM (Robust Reward Model training) using base LLMs like Gemma-2-9B-IT, Qwen2.5-7B, and Gemma-2-2B.

RewardBench Performance: Crome significantly outperformed baselines, improving average accuracy by up to 5.4% (PairPM setting with Gemma-2-9B-IT). Gains were particularly notable in Safety (up to +13.18%) and Reasoning (up to +7.19%).
Robustness on reWordBench: Crome demonstrated superior robustness against meaning-preserving transformations on reWordBench, achieving higher accuracy and a smaller drop in performance compared to RewardBench scores. For example, with Gemma-2-9B-IT (PairPM), Crome showed an aggregate accuracy gain of up to 9.1% on reWordBench.
Best-of-N (BoN) Alignment: Crome-trained RMs led to consistent improvements in BoN selection across various N values on RewardBench, the safety-focused WildGuardTest (showing lower Attack Success Rates without significantly increasing refusal rates on benign prompts), and the reasoning-specific GSM8k.
Neutral Augmentation Ablations: Experiments showed that neutral augmentations are crucial for robustness. IQN generally performed best on RewardBench, while CAN showed strong results on reWordBench.
Oracle LLM Choice: Crome demonstrated robustness to the choice of oracle LLM, showing significant improvements even when using a weaker open-weights model like Gemma-2-27B-IT for augmentation generation.

Key Contributions:

A spurious-unaware causal framework that only requires intervention on LLM-identified causal rubrics.
Targeted counterfactual augmentations (Causal and Neutral) to disentangle causal and spurious attributes without explicit knowledge of the latter.
State-of-the-art RM robustness demonstrated on benchmarks like RewardBench and reWordBench.
Improved Best-of-N selection performance across chat, reasoning, and safety tasks.

The paper concludes that Crome's causally-informed data augmentation strategy effectively mitigates reward hacking, leading to more robust and aligned RMs. Future work includes applying these causal principles to synthetic data generation for base model training.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Harman26Singh/status/1937876897058181230

https://twitter.com/Pragya2k/status/1937885501127811341

https://twitter.com/imrahulmaddy/status/1937886741148709314

https://twitter.com/_akhaliq/status/1937502286353162257

https://twitter.com/Pragya2k/status/1937886964449177768

https://twitter.com/Harman26Singh/status/1937876919363437028

YouTube

Show All Videos