Contrastive Likelihood Reward (CLR)

Updated 4 July 2026

Contrastive Likelihood Reward (CLR) is a training motif that uses paired likelihood differences to rank outputs, aligning model scores with quality metrics.
It is applied across domains such as abstractive summarization, retrieval-augmented generation, reinforcement learning, and controllable text generation.
CLR enhances performance by optimizing the log-likelihood gap between positive and negative candidates, promoting robust, evidence-based model behavior.

Contrastive Likelihood Reward (CLR) denotes a family of training signals in which optimization is driven by a contrast over likelihood or log-likelihood, rather than by an absolute scalar score alone. In the literature covered here, the term appears explicitly in abstractive summarization and retrieval-augmented generation, while closely related formulations appear under other names in reinforcement learning with verifiable rewards, reinforcement learning from human feedback, controllable text generation, and multiagent exploration. The common design move is to prefer outputs whose model-assigned likelihood is higher under a desired condition, ranking, or evidential context than under a competing alternative (Chern et al., 2023, Tan et al., 2 Feb 2026, Zhang et al., 13 May 2026).

1. Terminology and scope

CLR is not a single standardized algorithm. The label is used explicitly in some papers, used implicitly in others, and in several cases the underlying mechanism is described as contrastive, likelihood-based, or reward-like without adopting the exact term “Contrastive Likelihood Reward.”

Paper	Exact designation	Core contrast
"Improving Factuality of Abstractive Summarization via Contrastive Reward Learning" (Chern et al., 2023)	Contrastive Likelihood Reward (CLR)	Higher-ranked vs lower-ranked candidate summaries
"CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models" (Tan et al., 2 Feb 2026)	Contrastive Likelihood Reward (CLR)	Full-context likelihood vs leave-one-out evidence
"Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective" (Zhang et al., 13 May 2026)	ConSPO	Positive rollouts vs negative distractors
"Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards" (Shen et al., 2024)	contrastive reward	Current reward vs offline prompt baseline
"Click: Controllable Text Generation with Sequence Likelihood Contrastive Learning" (Zheng et al., 2023)	sequence likelihood contrastive learning	Positive vs negative continuations
"Counterfactual Conditional Likelihood Rewards for Multiagent Exploration" (Aydeniz et al., 12 Feb 2026)	Counterfactual Conditional Likelihood (CCL) rewards	Actual vs counterfactual observation

In abstractive summarization, the concrete instantiation is explicitly called Contrastive Likelihood Reward and is defined through a pairwise contrastive objective over candidate summaries ranked by a reward or quality metric (Chern et al., 2023). In retrieval-augmented generation, CLR is the central reward in an RL framework and is defined as the log-likelihood gap between responses conditioned on prompts with and without supporting evidence (Tan et al., 2 Feb 2026). In RLVR, the term does not appear verbatim, but the paper states that what is being called CLR maps directly to ConSPO’s contrastive sequence-level policy optimization, because ConSPO replaces GRPO’s clipped-ratio surrogate with a contrastive objective built on sequence log-probabilities (Zhang et al., 13 May 2026). In RLHF, the paper primarily uses the term contrastive reward and defines it by subtracting an offline baseline reward from the current reward (Shen et al., 2024). In multiagent exploration, the method is named CCL rather than CLR, and the paper explicitly frames the relation as conceptual rather than terminological (Aydeniz et al., 12 Feb 2026).

This suggests that CLR is best understood as a methodological motif: likelihood-aligned contrastive training or reward shaping, instantiated differently across domains.

2. Recurrent objective structure

A recurring pattern across CLR-style methods is that the optimized quantity is not a raw score in isolation, but a difference, margin, or ratio between two likelihood-related terms.

In summarization, the score assigned to a candidate summary is the length-normalized log-likelihood

$f(S)=\frac{1}{L}\sum_{t=1}^{L}\log p_{g_\theta}(s_t\mid D,s_{<t}),$

and the contrastive loss penalizes the model when a lower-quality summary receives higher likelihood than a higher-quality one: $L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ The combined fine-tuning objective is

$L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$

The ranking is induced by a factuality or quality metric such as BARTScore, DAE, or ROUGE, so the model is trained to align its own likelihood ordering with the metric ordering (Chern et al., 2023).

In RLVR, ConSPO uses the length-normalized sequence log-probability

$s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$

and then applies a group-wise InfoNCE-style loss over positives and negatives from the same rollout group: $\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ The paper’s framing is that likelihood-aligned scoring and contrast-sensitive credit assignment replace GRPO’s clipped-ratio surrogate and uniform within-group coefficients (Zhang et al., 13 May 2026).

In RAG, CLR is an evidential contribution: $\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ where

$S(y\mid D)=\sum_{t=1}^{T}\log P(y_t\mid y_{<t},q,D)$

and

$S^{-}(y\mid D)=\min_{d_i\in D^+}\sum_{t=1}^{T}\log P(y_t\mid y_{<t},q,D\setminus\{d_i\}).$

The operational reward is

$R_{\mathrm{CLR}}(y)=\frac{\zeta(y)\cdot \mathbb{I}(\zeta(y)>\tau)}{\sqrt{T}}.$

Here the contrast is between the same response with supporting evidence present and with the most critical supporting document removed (Tan et al., 2 Feb 2026).

In RLHF, the contrastive reward used in PPO is

$r_{x,y}^{\rm RL}:=r_{x,y}-g\left(\{r^{\text{base}}_{x,y_j}\}_{j=1}^{k}\right),$

with $L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 0 chosen as the mean. The baseline rewards come from responses sampled offline from the SFT model and scored by the reward model. The method converts PPO into a relative-improvement objective over a prompt-dependent baseline (Shen et al., 2024).

In multiagent exploration, CCL defines a conditional likelihood contrast

$L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 1

where the counterfactual observation $L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 2 is the agent’s previous observation. In controllable generation, Click uses a max-margin contrastive loss on whole-sequence likelihood,

$L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 3

with ordinary LM loss retained as a regularizer (Aydeniz et al., 12 Feb 2026, Zheng et al., 2023).

3. RLVR and the ConSPO interpretation of CLR

In reinforcement learning with verifiable rewards, the immediate context for CLR is the contrastive reinterpretation of GRPO in "Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective" (Zhang et al., 13 May 2026). The paper begins from RLVR, where rollouts are generated by the current policy, checked by an external verifier, and assigned binary or outcome rewards. GRPO is treated as a representative RLVR algorithm that estimates group-relative advantages from those rewards and updates the policy without a learned critic.

The central claim is that GRPO admits an equivalent discriminative reformulation as a weighted positive-negative score difference. Under binary rewards $L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 4,

$L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 5

where $L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 6. This reformulation shows that GRPO already increases the scores of verified positive rollouts and decreases the scores of negative rollouts, but does so with rollout scores defined as averages of clipped token-level importance sampling ratios rather than generation likelihoods.

The paper identifies two structural limitations. The first is likelihood-misaligned scoring. GRPO’s rollout score is

$L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 7

with an analogous max-clipped form for negatives. These are optimization surrogates rather than the likelihoods used in autoregressive generation. The second limitation is score-insensitive credit assignment. In the empirical group objective,

$L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 8

the derivatives with respect to positive and negative rollout scores depend only on the group pass rate $L_{\text{ctr}}=\sum_i\sum_{j>i}\max\bigl(0,\ f(S_j)-f(S_i)+\lambda_{ij}\bigr), \qquad \lambda_{ij}=(j-i)\lambda.$ 9, not on within-group score gaps. Every positive receives the same coefficient, and every negative receives the same coefficient.

ConSPO addresses both limitations. It replaces GRPO’s clipped ratio scores with length-normalized sequence log-probabilities and uses a group-wise InfoNCE-style objective in which each positive rollout is contrasted against negative distractors from the same group. The gradient analysis is central to the method. For positives,

$L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 0

so poorly separated positives receive larger upward pressure. For negatives,

$L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 1

and the relative suppressive signal satisfies

$L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 2

High-scoring incorrect rollouts are therefore penalized more strongly than low-scoring incorrect rollouts.

ConSPO also adds a curriculum-scheduled margin,

$L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 3

to move from coarse positive-negative ordering in early training toward stronger separation later. Empirically, the reported gains are consistent across backbone models, parameter scales, and datasets. On DeepSeek-R1-Distill-Qwen-1.5B, ConSPO reaches 44.4 average over seven math reasoning benchmarks, beating GRPO by 4.4 points and the strongest baseline by 1.6 points. On DeepSeek-R1-Distill-Qwen-7B, it achieves 57.5 average, improving over the best baseline by 2.1 points. On DeepSeek-R1-Distill-Llama-8B, it gets 51.8 average, ahead by 1.9 points. The paper also reports best results for Qwen3-4B-Base at 34.0 average and for DAPO-Math-17k at 44.1 average. Ablations show that removing the contrastive objective causes the biggest drop, replacing likelihood scores with clipped ratio scores hurts performance, and removing or fixing the margin also reduces results (Zhang et al., 13 May 2026).

4. Contrastive rewards in RLHF

In RLHF, CLR-like methods appear as reward shaping rather than as a redesign of the optimizer. "Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards" introduces a prompt-dependent contrastive reward that subtracts an offline baseline reward from the current reward model score during PPO (Shen et al., 2024).

The method has two stages. In the offline sampling stage, for each prompt $L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 4 in the RL dataset, the SFT model samples $L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 5 baseline responses,

$L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 6

forming a baseline dataset of prompt-response pairs. Each baseline response is scored by the reward model to obtain $L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 7. In the PPO stage, the reward is replaced by

$L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 8

with $L_{\text{com}}=L_{\text{mle}}+\gamma L_{\text{ctr}}.$ 9 chosen as the mean. The implementation then rescales the contrastive reward by

$s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 0

and uses $s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 1 as the PPO reward.

Conceptually, the change is from maximizing absolute reward to maximizing improvement over the prompt-specific performance of the initial SFT policy. The paper argues that this baseline subtraction penalizes reward uncertainty, improves robustness, encourages improvement over baselines, calibrates according to task difficulty, and reduces variance in PPO. Under a conditional independence assumption, the authors derive

$s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 2

where $s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 3 and $s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 4 encode reward-model inconsistency and $s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 5 is the ideal perfect reward. This makes the expected contrastive reward smaller when the reward model is noisy and downweights inherently difficult or ambiguous prompts.

The experimental setting uses Anthropic/HH-RLHF, OpenAI/Summary, and PKU/Safety Alignment, with Llama 7B as the base model for the main experiments and Mistral-7B-Instruct on MT-Bench and RED-EVAL. Baselines are SFT, PPO with KL regularization, and DPO. Evaluation uses UltraRM-13B, PairRM, GPT-4 pairwise judging, human-assisted evaluation with majority vote, MT-Bench, and RED-EVAL. The paper reports that the contrastive-reward method outperforms SFT, DPO, and PPO across the evaluated settings, reaches an MT score of 6.90 for Mistral-7B-CR, and achieves the lowest average attack success rate on RED-EVAL. Increasing the number of offline baseline samples $s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 6 generally improves performance, and the gains are stronger on harder prompts identified by low offline reward (Shen et al., 2024).

5. Summarization and sequence-level control

In abstractive summarization, CLR is explicitly formulated as a contrastive ranking objective over candidate summaries for the same document. "Improving Factuality of Abstractive Summarization via Contrastive Reward Learning" defines the setting as follows: a seq2seq model $s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 7 generates a summary $s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 8, standard MLE uses

$s_\theta(o,q)=\frac1{|o|}\sum_{t=1}^{|o|}\log \pi_\theta(o_t\mid q,o_{<t}),$ 9

and CLR supplements this with pairwise preferences induced by a reward or quality metric (Chern et al., 2023).

Candidate summaries are generated by diverse beam search from a pretrained summarization model, ranked by a metric $\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 0, and contrasted so that summaries with higher metric values are assigned higher likelihood. The paper uses BARTScore and DAE as factuality-oriented reward functions and also considers ROUGE as a baseline quality metric. The final objective is

$\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 1

with $\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 2 in the experiments. The method does not use policy gradients or an explicit reward model; rather, it aligns the model’s likelihood ordering with a metric-derived ranking over sampled candidates.

The empirical emphasis is factuality. On CNN/DailyMail, human factuality improves from 0.76 for CRL-COM (R) to 0.99 for both CRL-COM (B) and CRL-COM (D). On XSUM, factuality improves from 0.38 for CRL-COM (R) to 0.51 for CRL-COM (B) and 0.50 for CRL-COM (D). The paper explicitly notes a trade-off with ROUGE, while reporting that coherence and relevance remain competitive. It also states that only two news datasets were studied and suggests future comparison between RL-based reward learning and contrastive reward learning (Chern et al., 2023).

A related but terminologically distinct line appears in controllable text generation. "Click: Controllable Text Generation with Sequence Likelihood Contrastive Learning" trains a model so that, for the same prompt, positive continuations receive higher sequence likelihood than negative continuations by a margin, while ordinary LM loss preserves fluency (Zheng et al., 2023). The core loss is

$\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 3

and the total objective is

$\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 4

The paper emphasizes that there is no separate reward model and no architecture change; the “reward” interpretation is implicit in the likelihood margin. A major contribution is the likelihood ranking-based pairing strategy, which matches each negative sample with the highest-likelihood positive sample whose likelihood is still lower than the negative’s. Across detoxification, sentiment steering, and repetition reduction, Click reports strong performance, including toxicity probability 0.084 on BAD, MT-oriented sentiment control gains such as 85.78% positive continuations on negative prompts in positive steering, and repetition metrics Rep-2 $\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 5, Rep-3 $\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 6, Div $\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 7, and MAUVE $\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 8 on WikiText-103 (Zheng et al., 2023).

6. Retrieval-augmented generation and multiagent exploration

In retrieval-augmented generation, CLR is an explicit RL reward for context faithfulness. "CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models" argues that external rewards such as correctness reward and citation reward do not measure faithfulness well, can misjudge similar answers, and provide only sparse end-of-sequence feedback, while purely internal rewards can cause hallucination accumulation or model collapse without objective evidence-based feedback (Tan et al., 2 Feb 2026).

The proposed solution is an internal-external hybrid reward centered on CLR. The response likelihood under full retrieved context is contrasted with a leave-one-out score that removes the supporting document whose removal hurts the likelihood most. The resulting evidential contribution $\widehat{\mathcal J}_{\rm NCE}(q)=\frac1{N^+}\sum_{i=1}^{N^+}\tau\log \frac{\exp(s_\theta(o_i^+,q)/\tau)} {\exp(s_\theta(o_i^+,q)/\tau)+\sum_{j=1}^{N^-}\exp(s_\theta(o_j^-,q)/\tau)}.$ 9 is thresholded and length-normalized: $\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 0 When combined with correctness, the paper uses batch-normalized CLR and multiplicative gating,

$\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 1

rather than additive blending. The RL optimizer is GRPO, but the appendix states that KL divergence is omitted in practice because it conflicts with CLR and caused training collapse in their experiments.

The training pipeline consists of supervised fine-tuning followed by RL on rollouts with group size $\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 2. For SFT, the paper uses Qwen3-8B-Base and Qwen3-30B-A3B-Base with HotpotQA and MuSiQue data, selected by pass@8 criteria, for 74,109 samples total. RL uses about 10,000 additional samples, and for CLR training the paper keeps examples where the standard deviation of log-probabilities exceeds 10. Evaluation is reported on RagQALeaderboard and PRGB, using 30 retrieved documents for RagQALeaderboard and 10 for PRGB. On Qwen3-8B, the reported averages are SFT 72.0, $\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 3 77.0, $\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 4 77.2, $\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 5 79.0, CLR 79.7, and Hybrid 80.4. On PRGB, the corresponding scores are SFT 43.0, $\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 6 58.2, $\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 7 63.2, $\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 8 74.8, CLR 75.6, and Hybrid 74.8. The paper also reports a reference-reliance measure

$\zeta(y)=S(y\mid D)-S^{-}(y\mid D),$ 9

with 0.559 for the final model versus 0.429 for Qwen3-30B-A3B-Instruct-2507 (Tan et al., 2 Feb 2026).

A conceptually related but distinct development appears in multiagent exploration. "Counterfactual Conditional Likelihood Rewards for Multiagent Exploration" does not define CLR by name; it introduces Counterfactual Conditional Likelihood rewards and explicitly frames the relationship as conceptual rather than terminological (Aydeniz et al., 12 Feb 2026). CCL compares the conditional likelihood of an agent’s actual observation against a counterfactual observation given teammates’ observations: $S(y\mid D)=\sum_{t=1}^{T}\log P(y_t\mid y_{<t},q,D)$ 0 The method uses a fixed random encoder, k-nearest-neighbor density estimation, Softplus with $S(y\mid D)=\sum_{t=1}^{T}\log P(y_t\mid y_{<t},q,D)$ 1, and a clamp cap of 5.0. The intended effect is to isolate each agent’s unique contribution to coordinated exploration, reduce redundancy, and target coordinated regions of the state space rather than uniform local coverage. Experiments use MAPPO under centralized training with decentralized execution in multi-rover exploration and particle environments, and the reported result is that CCL consistently outperforms local entropy maximization, especially in tightly coordinated sparse-reward tasks (Aydeniz et al., 12 Feb 2026).

7. Distinctions, limitations, and common misconceptions

A common misconception is that CLR denotes one canonical procedure. The literature instead shows multiple non-equivalent implementations. In summarization, CLR is a supervised contrastive ranking loss over metric-ranked candidates. In RAG, it is an RL reward based on leave-one-out evidential contribution. In RLVR, the CLR interpretation refers to ConSPO’s contrastive likelihood-aligned policy optimization. In RLHF, the closest analogue is a reward baseline subtraction term rather than a likelihood gap defined directly on model outputs (Chern et al., 2023, Tan et al., 2 Feb 2026, Zhang et al., 13 May 2026, Shen et al., 2024).

A second misconception is that CLR is always reinforcement learning. That is not the case. The summarization formulation explicitly notes that, although it is called reward learning, the optimization is not policy gradient or expected-reward RL; it is a supervised contrastive pairwise ranking loss. Click is even further from explicit reward modeling: it uses a likelihood-margin objective with no learned reward model and no expected-reward objective (Chern et al., 2023, Zheng et al., 2023).

A third distinction concerns what counts as the “negative” side of the contrast. Depending on the paper, the negative may be a lower-ranked candidate summary, a negative continuation, a rollout distractor from the same group, an offline baseline reward from the SFT policy, the same response with critical evidence removed, or an agent’s own previous observation used as a counterfactual. This variation matters because it determines what behavior the contrast actually regularizes (Zhang et al., 13 May 2026, Shen et al., 2024, Tan et al., 2 Feb 2026, Aydeniz et al., 12 Feb 2026).

The limitations are likewise domain-specific. In summarization, factuality gains can come at the cost of lower ROUGE, and only two news datasets were studied. In controllable generation, the method depends on label functions or classifiers that can be biased, and stronger control can slightly harm fluency or perplexity. In RAG, CLR adds computational overhead because it requires token-level likelihoods and leave-one-out scoring, and it can over-prioritize contextual faithfulness when retrieval is wrong, producing faithfully wrong responses or discouraging correct answers that contradict bad retrieval. In RLVR, the critique of GRPO centers on likelihood misalignment and score-insensitive credit assignment, which ConSPO is designed to repair through sequence log-probabilities and relative within-group contrast (Chern et al., 2023, Zheng et al., 2023, Tan et al., 2 Feb 2026, Zhang et al., 13 May 2026).

A plausible implication is that CLR-style methods are most coherent when the optimized contrast is aligned with the model behavior that matters at inference time. This is explicit in ConSPO’s replacement of clipped-ratio surrogate scores with generation likelihoods and in CTRL-RAG’s use of evidence-conditioned likelihood gaps as a faithfulness signal. Across domains, the unifying thesis is not merely “reward the good output,” but “optimize the model so that the good output is more likely than its relevant alternative under the scoring regime that matters for the task” (Zhang et al., 13 May 2026, Tan et al., 2 Feb 2026).