DICE: Self-Alignment with DPO Implicit Rewards

Updated 14 April 2026

The paper introduces DICE, a methodology that uses a model’s own implicit DPO reward to self-improve without costly human feedback.
DICE integrates self-evaluation prompts, margin-based self-review, and on-policy bootstrapping to refine response quality and mitigate overfitting.
Empirical results show that DICE boosts alignment performance and cost-effectiveness over traditional DPO and RLHF methods across various benchmarks.

DICE (Self-Alignment with DPO Implicit Rewards) is a methodology for the alignment of LLMs that leverages the implicit reward signal derived from Direct Preference Optimization (DPO), enabling efficient, high-quality self-improvement of LLMs without additional human feedback or external reward modeling. By systematically using the policy’s own internal knowledge to evaluate and refine responses, DICE enables a range of mechanisms for preference data augmentation, overfitting mitigation, and reward calibration, advancing beyond both RLHF and plain DPO in performance, cost, and practicality (Yu et al., 2024, Ko et al., 2024, Chen et al., 2024).

1. Conceptual Foundations and Motivation

Alignment of LLMs with human preferences is commonly addressed by reinforcement learning from human feedback (RLHF). Traditional RLHF first constructs an explicit reward model $r_\phi(x, y)$ from human preference pairs and then optimizes a policy $\pi_\theta$ by maximizing the reward subject to KL regularization with a reference policy. Direct Preference Optimization (DPO) simplifies this process by using the policy itself to define an implicit reward, thus obviating the need for separate reward models and their associated training cost. After DPO training, the implicit reward is defined as

$r_\theta(x, y) = \beta \left[ \log \pi_\theta(y|x) - \log \pi_{\rm ref}(y|x) \right].$

However, vanilla DPO treats all preference tuples equally and disregards the magnitude of preference gaps between responses, which can lead to sub-optimal or poorly calibrated updates. DICE addresses this limitation by sourcing relative quality information directly from the model itself—using either self-evaluation prompts, implicit margins, or reward shaping methods—to inform more granular and targeted alignment updates (Yu et al., 2024, Ko et al., 2024, Chen et al., 2024).

2. Key Methodological Components

DICE augments DPO with intrinsic self-assessment, carried out via several complementary techniques:

Self-Refinement via Self-Evaluation Prompts (Quality-Aware DICE): A fixed, high-usefulness prompt is prepended to evaluation queries so the model can rate responses relative to an ideal baseline. The refinement (displacement) function is given by:

$\Delta_\pi(y^-, y^+; x) = \beta \log \left( \frac{\pi(y^+|p \oplus x)\, \pi_{\rm ref}(y^-|p \oplus x)}{\pi(y^-|p \oplus x)\, \pi_{\rm ref}(y^+|p \oplus x)} \right).$

This function is integrated into the DPO loss, scaled by a hyperparameter $\lambda$ , and treated as a fixed target via stop-gradient (Yu et al., 2024).

Margin-Based Self-Reviewing (Self-Reviewing DICE): For any policy, the implicit reward margin between two responses is computed as

$\Delta r(x, y^+, y^-) = \frac{1}{\beta} \left[r_\theta(x, y^+) - r_\theta(x, y^-)\right]$

to select high-margin preference pairs for further training, thus filtering out noisy or ambiguous data (Ko et al., 2024).

On-Policy Bootstrapping and Experience Replay: Instead of relying solely on offline, possibly stale datasets, DICE samples new responses from the current model, ranks them via implicit reward, and constructs new preference pairs. These are mixed with original human-annotated pairs in a specified ratio to preserve data diversity and avoid catastrophic forgetting (Chen et al., 2024).
Length-Regularized Reward Shaping: To prevent the model from exploiting verbosity as a proxy for quality, DICE applies a regularization term to penalize response length:

$r_{\rm LR}(x, y; \alpha) = r_\theta(x, y) - \alpha |y|$

where $\alpha$ is optimized to minimize the average length gap between winners and losers (Chen et al., 2024).

3. Formal Objectives and Theoretical Underpinnings

DICE modifies the DPO (and variant IPO) objectives to incorporate the additional intrinsic knowledge derived from the policy:

Self-Refined DPO (Sr-DPO):

$L_{\rm Sr\text{-}DPO}(\theta) = \mathbb{E}_{(x, y^+, y^-)}\left[\log \sigma\left(\beta \log \frac{\pi(y^+|x)}{\pi_{\rm ref}(y^+|x)} - \beta \log \frac{\pi(y^-|x)}{\pi_{\rm ref}(y^-|x)} - \lambda \perp[\Delta_\pi(y^-, y^+; x)]\right)\right]$

where $\perp[\cdot]$ indicates stop-gradient (Yu et al., 2024).

Self-Refined IPO (Sr-IPO):

$\pi_\theta$ 0

(Yu et al., 2024).

Margin-based DICE Loss:

$\pi_\theta$ 1

with batches selected to maximize implicit margins (Ko et al., 2024).

Theoretical properties guarantee that, under alignment and prompt-consistency assumptions, the displacement function faithfully tracks true reward differences, and margin-based selection reduces both empirical risk variance and the impact of spurious correlations (Yu et al., 2024, Ko et al., 2024).

4. Algorithmic Workflow

The typical DICE algorithm iteratively refines a policy via the following steps (Chen et al., 2024, Ko et al., 2024):

On-Policy Data Generation: For each prompt, sample multiple candidate responses from the current policy, score via length-regularized implicit reward, and create new preference pairs as (winner, loser). Highest-margin pairs are preferred.
Sample Selection and Replay: Combine a proportion $\pi_\theta$ 2 of original human-labeled (offline) pairs and $\pi_\theta$ 3 of the bootstrapped on-policy pairs to create a mixed dataset.
Loss Minimization: Fine-tune the policy on the mixed dataset using the modified DPO objective appropriate for the DICE variant.
Reference Policy Update: After each round, optionally update $\pi_\theta$ 4 to the latest policy.
(Optionally) Ensemble Margins: For stability, average implicit reward signals across recent policy checkpoints with weight $\pi_\theta$ 5.
Repeat: Empirical results indicate 1–2 DICE rounds deliver most of the benefit.

A simplified pseudocode for DICE-DPO is as follows:

$\pi_\theta$ 7 (Chen et al., 2024, Ko et al., 2024)

5. Empirical Evaluation and Outcomes

DICE methodologies yield consistent and significant improvements over both vanilla DPO and related direct alignment approaches across a variety of tasks, model sizes, and benchmarks:

Alignment Benchmarks:
- On AlpacaEval 2.0 with Zephyr-7B, DICE-DPO improved length-controlled win rate by 8.02% and on Llama3-8B by 9.35% over base DPO (Chen et al., 2024).
- MT-Bench: Sr-DPO outperformed DPO with win/tie/lose = 45.6%/33.8%/20.6%; Sr-IPO vs IPO = 38.8%/25.0%/26.2% (Yu et al., 2024).
- Vicuna-Bench: Sr-DPO win/tie/lose = 63.8%/13.8%/22.5%; Sr-IPO vs IPO = 60.0%/8.8%/31.3%.
- Open-LLM Leaderboard (average acc): Sr-DPO (64.48%) > DPO (63.04%) > Zephyr-7B-SFT (58.14%) (Yu et al., 2024).
Loss Calibration: Sr-DPO and margin-based DICE show higher Pearson correlation of their learned rewards with external (GPT-4) scoring of answer quality, indicating greater semantic alignment.
Computational Cost: DICE incurs approximately 20–25% higher GPU hours compared to vanilla DPO/IPO due to the need for double forward-passes or increased candidate sampling (Yu et al., 2024).
Robustness: Margin-based sample selection and experience replay reduce overfitting to spurious features (e.g., verbosity), supporting more stable performance across diverse distributions (Ko et al., 2024, Chen et al., 2024).
Ablations: Mixing on-policy and offline pairs (e.g., 0.5 on Zephyr, 0.1 on Llama3) gives the best trade-off between data freshness and stability, while appropriate length regularization ( $\pi_\theta$ 6) minimizes bias.

Model/Setting	LC-WR (%)	WR (%)	Reference
Zephyr-7B DICE Iter 2	20.71	20.16	(Chen et al., 2024)
Llama3-8B DICE Iter 2	27.55	30.99	(Chen et al., 2024)
Zephyr-7B-SFT (baseline)	12.69	10.71	(Chen et al., 2024)
Sr-DPO (Open-LLM Leaderboard)	64.48	—	(Yu et al., 2024)

6. Limitations and Open Challenges

Reliance on Model’s Own Implicit Reward: If the initial DPO-aligned model is poor, reward hacking or reward collapse can occur, especially in iterated bootstrapping (Chen et al., 2024).
Compute Overhead: Double forward-passes or candidate ranking increase resource requirements by ~20–25% over vanilla DPO implementations (Yu et al., 2024, Ko et al., 2024).
Domain and Prompt Sensitivity: The method assumes meaningful self-evaluation is possible and that the selected prompt (for self-refinement) generalizes, which may not hold universally.
Data Diversity and Edge Cases: Hard thresholding on margin may reduce diversity and underrepresent ambiguous or rare preference types (Ko et al., 2024).
Lack of Online Integration: Current published DICE pipelines are primarily offline; online variants and adaptation to continually evolving tasks remain unexplored.

Future directions include adaptive (percentile-based) margin thresholds, diversity-aware sampling, lightweight verification, multi-objective reward vectorization, and formal analysis of iterative bootstrapping dynamics (Ko et al., 2024, Chen et al., 2024).

7. Relationship to Broader Direct Alignment Paradigms

DICE is compatible with any contrastive direct alignment algorithm, including DPO, IPO, Sequence Likelihood Calibration (SLiC-HF), SimPO, and their variants (Ko et al., 2024, Chen et al., 2024). It yields consistent alignment improvements across all tested architectures and loss families, demonstrating its general efficacy and modularity. The technique preserves the key simplicity and stability properties of direct alignment methods, but substantially boosts their capacity for self-calibration and reward discrimination—without reliance on external human or LLM annotation.

In summary, DICE marks a significant advance in the direct alignment of LLMs, providing principled, cost-efficient, and empirically validated mechanisms for model self-improvement and robust, fine-grained reward sensitivity (Yu et al., 2024, Ko et al., 2024, Chen et al., 2024).