CREAM: Consistency Regularized Self-Rewarding

Updated 11 June 2026

The paper presents a novel scheme that integrates consistency regularization into self-rewarding loops to mitigate bias and instability in reward signaling.
Its methodology employs soft-labeling, temporal rank consistency, and multi-model reward aggregation to reliably align pseudo-preference data with human intent.
Empirical results show improved model consistency (e.g., Llama-3 gains from ~0.73 to ~0.92) and enhanced downstream accuracy compared to traditional self-rewarding methods.

Consistency Regularized Self-Rewarding (CREAM) refers to a class of self-supervised alignment techniques for LLMs characterized by the integration of explicit consistency regularization into the self-rewarding training loop. These frameworks aim to mitigate reward bias and instability caused by self-labeled or self-generated rewards, thereby producing more reliable pseudo-preference data and ultimately improving alignment with human intent—without external human annotation.

CREAM is formalized as either (a) a consistency-regularized iterative preference optimization framework (as in "CREAM: Consistency Regularized Self-Rewarding LLMs" (Wang et al., 2024)), (b) a multi-model consistency enforcement paradigm ("Self-Consistent Internal Rewards" (Zhou et al., 13 Feb 2025)), or (c) (via its generative variant) a reinforcement learning setup where reward signals are filtered and modulated for temporal/semantic coherence ("ConsistRM" (Liang et al., 8 Apr 2026)). Experimentally, CREAM variants consistently achieve superior reward-model consistency, downstream accuracy, and robustness compared to vanilla self-rewarding or majority-vote imitation schemes.

1. Generalized Self-Rewarding Framework

CREAM typically operates on the iterative preference fine-tuning paradigm. Given an initial LLM $\pi_{\theta}$ , a reference model $\pi_{\mathrm{ref}}$ (often a snapshot of the model or its base version), a supervised fine-tuning (SFT) set $D_S$ , and a large unlabeled set $D_U$ , the process alternates:

Inference Phase: For each prompt $x \in D_U$ , generate multiple candidate responses $y_i \sim \pi_{\theta}$ .
Self-Labeling: Rank or pairwise compare responses using the model's own scoring proxies (log-likelihood, learned critic, or LLM-as-a-Judge).
Preference Optimization: Construct preferences or soft-labels $z(y, y', x) \in \{0,1\}$ according to the model's own judgments, and fine-tune further via direct preference optimization (DPO) or similar objectives:

$L_DPO(\theta; y, y', x, z) = -z \log \sigma(s(y, y'; \theta)) - (1-z) \log \sigma(-s(y, y'; \theta))$

where $s(y, y'; \theta) = \log \frac{\pi_\theta(y|x)}{\pi_{\mathrm{ref}}(y|x)} - \log \frac{\pi_\theta(y'|x)}{\pi_{\mathrm{ref}}(y'|x)}$ .

Vanilla methods, such as SRLM (self-rewarding LLMs), apply hard zero/one labeling to each pair, resulting in brittle and sometimes noisy training targets, especially as bias accumulates over several self-training rounds (Wang et al., 2024).

2. Consistency Regularization Principles

CREAM introduces a consistency-incentivized augmentation to the above loop to address the inherent unreliability in self-generated preference data. The main mechanisms are:

Label Softening: Replace hard binary preference assignments with soft or probabilistic labels, reflecting the degree of agreement and uncertainty among the model's own scores (Wang et al., 2024).
Temporal Rank Consistency: Enforce agreement in the relative ranking or preference ordering of candidate responses across LLM iterations, usually measured via Kendall’s $\tau$ or other rank metrics.
Multi-Model Agreement: Compute preferences only on cases where multiple internal reward estimators—e.g., DPO-based implicit RM (IRM) and generative RM (GRM, LLM-as-a-Judge)—concur on the outcome, filtering ambiguous data (Zhou et al., 13 Feb 2025).
Explicit Consistency Penalty: Add loss terms penalizing incoherence or overconfidence between multiple internal reward proxies, e.g., KL divergence plus entropy regularization between the output probabilities of different reward heads.

Soft-labeling is mathematically formalized as adding to the optimization objective:

$\pi_{\mathrm{ref}}$ 0

which, in expectation, pushes ambiguous pairs toward uniform preference distribution (thus discouraging forced high-confidence guessing) (Wang et al., 2024). Alternatively, mixture weighting is used:

$\pi_{\mathrm{ref}}$ 1

where $\pi_{\mathrm{ref}}$ 2 sets the balance between original and reversed orderings depending on measured consistency.

3. Implementation Variants and Algorithmic Realizations

3.1. CREAM (Direct Regularization in DPO)

The canonical CREAM implementation uses soft-labeling DPO with an adaptive weight proportional to the inter-iteration ranking consistency (measured via Kendall’s $\pi_{\mathrm{ref}}$ 3 or Spearman's $\pi_{\mathrm{ref}}$ 4 between current and previous model scores on the same candidate set). This dynamic mixture smoothly interpolates between trusting the model's preference and flattening it when consistency is low (Wang et al., 2024).

The iterative update procedure is:

Fine-tune with $\pi_{\mathrm{ref}}$ 5 to obtain $\pi_{\mathrm{ref}}$ 6.
For each unlabeled prompt in $\pi_{\mathrm{ref}}$ $π_{ref}$ 7:
- Sample $\pi_{\mathrm{ref}}$ 8 responses with $\pi_{\mathrm{ref}}$ 9.
- Compute score rankings with current and previous model.
- Calculate Kendall’s $D_S$ 0 for each prompt and average over prompts to get consistency $D_S$ 1.
- Build DPO datasets for both (top-1, bottom-1) and (reversed).
- Update via $D_S$ 2.

3.2. Self-Consistent Internal Rewards (SCIR)

The SCIR variant (also referred to as CREAM in (Zhou et al., 13 Feb 2025)) further augments the architecture with multiple internal reward modules (IRM, GRM). For any response pair, only those where the reward heads agree on the preferred response are used for training. A per-pair consistency penalty drives these heads toward mutual confidence and agreement:

$D_S$ 3

where $D_S$ 4 and $D_S$ 5 are the reward probabilities from IRM and GRM, $D_S$ 6 is a confidence threshold, and $D_S$ 7 denotes entropy.

3.3. ConsistRM: Consistency Regularized Self-Rewarding for Generative Reward Models

The ConsistRM approach targets generative reward models (GRMs), introducing two reward components:

Consistency-Aware Answer Reward (CAAR): Combines current rollouts’ votes (“online” state) with a memory buffer of past pseudo-labels to achieve temporal smoothing; the consensus is ternary, and only confident (“agree”) cases yield supervision.
Consistency-Aware Critique Reward (CACR): Measures semantic similarity among generated free-form critiques. Only those critiques that are both consensus-aligned and reside in dense high-similarity clusters receive additional bonus.
Final reward for each rollout is a sum of answer + critique rewards (with constraints on formatting and penalization of invalid outputs).

Reward hacking is mitigated by skipping supervision on ambiguous pairs, limiting critique bonus magnitudes, and applying KL-regularization to control policy drift (Liang et al., 8 Apr 2026).

4. Empirical Performance and Analysis

CREAM and its instantiations have demonstrated significant consistency and accuracy improvements over standard self-rewarding and external RM-based baselines:

CREAM (Llama-3 7B, Llama-2 7B) shows sustained gains in exact-match accuracy across ARC-E, ARC-C, OBQA, SIQA, GSM8K, outperforming SRLM in all rounds. Rewarding consistency (C) climbs sharply with CREAM, from ~0.73 to ~0.92, while SRLM saturates at ~0.46 (Wang et al., 2024).
SCIR (Mistral-7B, Mistral-7B-Instruct) increases the fraction of consistent preference pairs from ~50–55% (SRLM) to over 90% by iteration 3; length-controlled win-rate versus GPT-4 improves by +10–14% absolute on AlpacaEval 2.0 (Zhou et al., 13 Feb 2025).
ConsistRM (Qwen3-8B) achieves a +2.6% accuracy improvement over vanilla Reinforcement Fine-Tuning (RFT), and strengthens output stability, position bias robustness, and critique length efficiency (e.g., verbosity control from 1,924 to 1,717 tokens on RewardBench) (Liang et al., 8 Apr 2026).

Ablation studies systematically confirm that without regularization, dynamic selection, or multi-judge aggregation, models degrade in both performance and output stability.

Variant	Key Regularization	Model Consistency ↑	Downstream Accuracy ↑
CREAM (Wang et al., 2024)	Iteration-rank, soft DPO	High (~0.92)	Best
SRLM	None	Low (~0.46)	Lower
SCIR (Zhou et al., 13 Feb 2025)	Multi-RM, penalty & filter	High (>90%)	Best
ConsistRM (Liang et al., 8 Apr 2026)	Temporal & semantic	High (position, etc.)	Best

5. Broader Impact and Limitations

CREAM frameworks demonstrably enable scalable, high-quality LLM alignment without the need for external labeled preference or reward data. Their key contributions include mitigating overconfident/brittle reward labeling, reducing susceptibility to reward hacking, and enhancing not only the alignment but also the temporal and cross-critique stability of the resulting models. This advances the feasibility of large-scale, unsupervised RLHF.

Identified limitations include the granularity of semantic consistency enforcement (critique clustering is applied at whole-utterance level, not sub-span or process-level), meaning process-level reward modeling remains an open avenue. Additionally, case studies indicate that model size and strength of the baseline reward model affect the degree to which consistency regularization helps, with Llama-3-class models gaining most.

A plausible implication is that further scaling of this approach, with enhanced fine-grained segmentation and reference adaptation, could close residual gaps with fully human-supervised RLHF or oracle RM-based methods.

6. Extensions and Future Directions

Extensions of the CREAM principle are already manifest in advanced self-rewarding RL frameworks. For instance, trajectory-consistency and volatility-based rewards (e.g., CoVo (Zhang et al., 10 Jun 2025)) generalize the consistency principle to reasoning trajectories, providing strong empirical results on LLM reasoning benchmarks. Similarly, generative self-rewarding (as in ConsistRM (Liang et al., 8 Apr 2026)) is compatible with process-level RLHF and can be extended to finer-grained critique segmentation or more sophisticated clustering.

Future work may explore dynamic hierarchical reward aggregation, adaptive reference model choice, and integrating cross-modal or multi-agent agreement schemes to further stabilize reward signals and generalization.

7. References

"CREAM: Consistency Regularized Self-Rewarding LLMs" (Wang et al., 2024)
"Self-Consistency of the Internal Reward Models Improves Self-Rewarding LLMs" (Zhou et al., 13 Feb 2025)
"ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training" (Liang et al., 8 Apr 2026)
"Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning" (Zhang et al., 10 Jun 2025)