CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Published 19 May 2026 in cs.LG, cs.CL, and cs.CV | (2605.19436v1)

Abstract: When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces CEPO, replacing single-reference token credit with a contrastive evidence ratio that compares correct and rejected outputs.
It demonstrates significant performance gains on multimodal mathematical reasoning benchmarks, with improvements of up to 3.13 percentage points over prior methods.
CEPO ensures leakage-free gradients and efficient credit assignment by repurposing rejected rollouts, introducing minimal additional computational overhead.

CEPO: Contrastive Evidence Policy Optimization for RLVR Self-Distillation

Motivation and Background

Recent advances in reinforcement learning with verifiable rewards (RLVR) have established it as a principal approach for fine-tuning LLMs to perform complex reasoning tasks. Core RLVR frameworks, such as Group Relative Policy Optimization (GRPO), operate by assigning uniform sequence-level credit to all tokens in a trajectory labeled as correct, with the drawback that the credit assignment mechanism is unable to differentiate between semantically decisive reasoning steps and non-informative filler tokens. This limitation is particularly acute in settings such as mathematical reasoning, where sparse, critical inference steps drive task success.

Token-level credit assignment methods—either relying on costly Monte Carlo re-simulation or auxiliary process reward models—have demonstrated finer granularity but at the expense of significant computational overhead or additional supervisory signals. A parallel thread, self-distillation in RLVR (e.g., On-Policy Self-Distillation Policy Optimization, OPSD; Self-Distilled RLVR, RLSD), aims to provide dense feedback without auxiliary models by conditioning the policy on ground truth responses. However, these methods have exhibited an intractable information leakage problem, introducing correlations between the input and the privileged reference output (answer) into the gradients. This leakage results in degraded performance as training proceeds.

RLSD addresses leakage structurally by restricting the teacher signal to the sampled token, under a stop-gradient operation, while using an evidence ratio between teacher and student probabilities. Yet, RLSD's signal remains limited by base-rate fluency effects, asymmetric negative feedback, and an inability to distinguish true reasoning from filler when both appear surprising relative to the model's current policy.

The CEPO Method

CEPO (Contrastive Evidence Policy Optimization) introduces a contrastive, token-level evidence ratio which replaces the single reference used in RLSD with a pairwise comparison between correct-answer and wrong-answer teachers. For each token, CEPO calculates the ratio of the probability assigned by the model to the token, conditioned on the correct output, to the probability assigned conditioned on a rejected (incorrect) output, drawn from the current batch's rollouts. This contrastive delta serves as a measure of how much a token is favored by the correct rationale and, simultaneously, disfavored by incorrect rationales.

Notably, the wrong-answer teacher is constructed using rejected samples from within the training batch, introducing no additional sampling overhead. The mathematical construction ensures that, at positions where both the correct- and wrong-answer teachers agree (typically filler), the contrastive signal collapses to neutrality, whereas it is selectively sharp at decisive, discriminative points.

CEPO's advantages are underpinned by three theoretical guarantees, all inherited from RLSD:

Direction anchoring: Token-level update directions remain sign-aligned with the verifier, preventing undesirable flips due to privileged information.
Leakage-free gradients: No vocabulary-wide, r-conditioned summations are present, circumventing information leakage.
RLSD containment: When the negative teacher coincides with the student prior, CEPO exactly recovers RLSD as a limiting case.

CEPO further admits a Bayesian interpretation: the contrastive delta quantifies the differential posterior update for the correct versus wrong rationale, rigorously justifying its use for pinpointed credit assignment.

Empirical Results

CEPO was evaluated on two model scales (Qwen3-VL-2B and 4B) across five challenging multimodal mathematical reasoning benchmarks: DynaMath, Logic Vista, Math Vision-mini, MMMU, and WeMath. All methods were extensively controlled for training budget, rollout structure, and optimizer settings.

Strong numerical results highlight CEPO's effectiveness in remedying the credit assignment bottleneck:

On Qwen3-VL-2B, CEPO achieved 43.43% average accuracy, representing a +2.26 percentage point gain over GRPO and a significant margin over RLSD and distribution-matching self-distillation baselines (OPSD, SDPO).
Qwen3-VL-4B saw average accuracy raised to 60.56% (+3.13 points over GRPO).
On Logic Vista and Math Vision-mini—benchmarks emphasizing multi-step reasoning—CEPO's improvements over GRPO exceeded 6% on the 4B scale.
Baselines OPSD and SDPO performed worse than the untrained initialization, empirically corroborating the degradation predicted by information leakage theory.

Qualitative analyses, including token-level heatmaps and ablation studies, demonstrate that CEPO's contrastive signal reliably sharpens credit on semantically central tokens (e.g., steps involving algebraic manipulation or reasoning), while broadly maintaining neutrality on filler.

Analysis and Practical Implications

CEPO's structural innovation—contrastive, per-token evidence modulation without information leakage—permits practical, scalable application in RLVR pipelines, introducing only a marginal computational overhead versus RLSD. Its construction requires no auxiliary networks, does not alter the core actor-critic architecture, and repurposes already-available rejected rollouts, maximizing computational efficiency.

Ablation studies indicate the optimal teacher configuration uses the current on-policy actor as both teacher and student, with negative references derived from the answer field of rejected rollouts. These design choices further reduce memory and engineering burden, and CEPO's hyperparameter sensitivity is moderate: reasonable evidence clip values and warmup scheduling are recommended, neither of which materially alter its strong performance.

From a theoretical standpoint, CEPO achieves discriminative sharpness: it amplifies credit at precisely those points where the model's output diverges between correct and incorrect rationales, embodying a principled approach to credit assignment in long, complex reasoning chains.

Future Directions

The empirical and theoretical properties of CEPO position it as a key candidate for advancing RLVR credit assignment, particularly in domains characterized by long-horizon, sparse-reward tasks such as code generation, stepwise scientific reasoning, and agentic planning. Natural extensions include scaling CEPO further to larger models, adapting it to text-only settings, and integrating it into model-based RL or RL-from-human-feedback protocols.

Beyond practical improvements, CEPO offers a methodological blueprint for developing structurally safe, interpretable, and computationally efficient self-distillation mechanisms in the context of LLM training.

Conclusion

CEPO advances the state of the art in RLVR self-distillation by exploiting contrastive evidence across correct and incorrect rationales at the token level. It inherits the structural guarantees of RLSD while introducing superior signal quality precisely at semantically decisive positions, yielding consistent and significant improvements over baseline and previously published self-distillation approaches. Theoretical analysis, empirical validation, and analysis of information leakage collectively establish CEPO as a principled and effective solution for credit-aware token-level RLVR training (2605.19436).

Markdown Report Issue