Papers
Topics
Authors
Recent
Search
2000 character limit reached

Response-Conditioned Bradley–Terry (Rc-BT)

Updated 12 February 2026
  • Rc-BT is a model that extends Bradley–Terry by explicitly incorporating response length, separating semantic quality from inherent length bias.
  • It leverages an augmented dataset with explicit length constraints to penalize non-compliant responses and reward adherence within RLHF frameworks.
  • Experimental evaluations of Rc-RM and Rc-DPO demonstrate significant gains in both semantic and length accuracy compared to traditional baselines.

The Response-conditioned Bradley–Terry (Rc-BT) model extends the classical Bradley–Terry framework for pairwise preference learning by explicitly disentangling semantic quality from response length in reward modeling. Rc-BT leverages an augmented comparison set that forces models not only to learn from human semantic preferences but also to follow or penalize explicit response-length instructions. It provides a principled mechanism to mitigate length bias in reward modeling for reinforcement learning from human feedback (RLHF) and to improve adherence to externally specified length constraints in LLMs (Cai et al., 2 Feb 2025).

1. Mathematical Formulation

The standard Bradley–Terry (BT) model encodes preference probability for a triplet (x,yw,yl)(x, y_w, y_l) as

p(ywylx)=exp(r(x,yw))exp(r(x,yw))+exp(r(x,yl))p^*(y_w \succ y_l\,|\,x) = \frac{\exp(r^*(x, y_w))}{\exp(r^*(x, y_w)) + \exp(r^*(x, y_l))}

where r(x,y)r^*(x, y) is the (possibly unknown) reward function. A parametric reward model rϕ(x,y)r_\phi(x, y) is optimized via

LBT(ϕ;D1)=E(x,yw,yl)D1[logσ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_{BT}(\phi; D_1) = -\mathbb{E}_{(x, y_w, y_l)\sim D_1} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]

with σ(u)=1/(1+exp(u))\sigma(u) = 1/(1 + \exp(-u)).

Rc-BT introduces, for each triplet (x,yw,yl)(x, y_w, y_l), two new “response-conditioned” comparisons:

  • (xxl1yw)(x \succ x_l^1 \mid y_w), where xl1x_l^1 is xx augmented with a length constraint that ywy_w violates.
  • (xl2xyl)(x_l^2 \succ x \mid y_l), where xl2x_l^2 is xx augmented with a length constraint that yly_l satisfies.

For any fixed yy,

p(xxl1y)=exp(r(x,y))exp(r(x,y))+exp(r(xl1,y))p^*(x \succ x_l^1\,|\,y) = \frac{\exp(r^*(x, y))}{\exp(r^*(x, y)) + \exp(r^*(x_l^1, y))}

p(xl2xy)=exp(r(xl2,y))exp(r(xl2,y))+exp(r(x,y))p^*(x_l^2 \succ x\,|\,y) = \frac{\exp(r^*(x_l^2, y))}{\exp(r^*(x_l^2, y)) + \exp(r^*(x, y))}

These comparisons populate the set DRcD_{Rc}, yielding the Rc-BT loss: LRc(ϕ)=E(x,xl1,yw)DRc[logσ(rϕ(x,yw)rϕ(xl1,yw))]λE(xl2,x,yl)DRc[logσ(rϕ(xl2,yl)rϕ(x,yl))]\mathcal{L}_{Rc}(\phi) = -\mathbb{E}_{(x, x_l^1, y_w)\sim D_{Rc}} [\log \sigma(r_\phi(x, y_w) - r_\phi(x_l^1, y_w))] -\lambda\,\mathbb{E}_{(x_l^2, x, y_l)\sim D_{Rc}} [\log \sigma(r_\phi(x_l^2, y_l) - r_\phi(x, y_l))] where λ\lambda (typically $1$) balances the loss terms.

2. Integration of Response Length in Preference Modeling

In the standard BT formulation, semantic quality, style, length, and other confounding factors are implicitly conflated in the reward model via observed preferences. Length bias arises when longer or shorter responses systematically affect reward, even when length was not explicitly specified.

Rc-BT introduces explicit response-conditioned comparisons, making length bias a modeled variable: the model must penalize responses not adhering to newly imposed length constraints, and reward those satisfying them, regardless of semantic content. This approach forces rϕ(x,y)r_\phi(x, y) to encode actual ability to follow length constraints, instead of correlating with length incidentally.

A plausible implication is that Rc-BT yields a reward model that more robustly generalizes across prompts and constraints, mitigating spurious exploitation of superficial attributes like response length.

3. Augmented Dataset Construction

Construction of the response-conditioned comparison set DRcD_{Rc} proceeds deterministically from any human-annotated preference set Drm={(x,yw,yl)}D_{rm} = \{(x, y_w, y_l)\} by:

  • For each (x,yw)(x, y_w) (“chosen-side”), sampling a numeric length constraint word_num\text{word\_num} such that yw>word_num|y_w| > \text{word\_num} (violated). Create xl1=x_l^1 =\texttt{"Answer the following instruction using {word_num} words or less. "} + xandrecordand record(x, x_l1, y_w).</li><li>Foreach.</li> <li>For each (x, y_l)(rejectedside),sampling (“rejected-side”), sampling \text{word\_num}so so |y_l| < \text{word\_num}(satisfied).Form (satisfied). Form x_l^2 =$\texttt{&quot;Answer the following instruction using {word_num} words or less. &quot;} + x$and record(xl<sup>2,</sup>x,yl)(x_l<sup>2,</sup> x, y_l).

Additional “or more” constraints are mixed in similarly. No further human annotation is required. This dataset enables learning fine-grained compliance with both length-constrained and unconstrained settings.

4. Training Objectives: Rc-RM and Rc-DPO

Reward Model (Rc-RM):

The Rc-BT reward model is initialized from the SFT-fine-tuned LLM and trained on DrmDRcD_{rm} \cup D_{Rc} using Adam (lr =105= 10^{-5}, batch =64= 64, $5$ epochs), minimizing

L(ϕ)=LBT(ϕ;Drm)+LRc(ϕ;DRc)\mathcal{L}(\phi) = \mathcal{L}_{BT}(\phi; D_{rm}) + \mathcal{L}_{Rc}(\phi; D_{Rc})

This produces rϕ(x,y)r_\phi(x, y) fitting both pairwise semantic preference and explicit length instructions.

Direct Policy Optimization (Rc-DPO):

Rc-DPO extends the DPO objective. Given reference policy πref\pi_{ref} and parametric policy πθ\pi_\theta, reward is approximated as rϕ(x,y)βlog[πθ(yx)πref(yx)]+constr_\phi(x, y) \approx \beta \log\left[\frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}\right] + \text{const}, with

LDPORc(θ)=E(x,xl1,yw)DRc[logσ(βlog[πθ(x,yw)/πref(x,yw)]βlog[πθ(xl1,yw)/πref(xl1,yw)])]\mathcal{L}^{Rc}_{DPO}(\theta) = -\mathbb{E}_{(x, x_l^1, y_w)\sim D_{Rc}} [\log \sigma (\beta \log [\pi_\theta(x, y_w)/\pi_{ref}(x, y_w)] - \beta \log [\pi_\theta(x_l^1, y_w)/\pi_{ref}(x_l^1, y_w)])]

LDPORc(θ)=E(xl2,x,yl)DRc[logσ(βlog[πθ(xl2,yl)/πref(xl2,yl)]βlog[πθ(x,yl)/πref(x,yl)])]\phantom{\mathcal{L}^{Rc}_{DPO}(\theta) =} -\mathbb{E}_{(x_l^2, x, y_l)\sim D_{Rc}} [\log \sigma (\beta \log [\pi_\theta(x_l^2, y_l)/\pi_{ref}(x_l^2, y_l)] - \beta \log [\pi_\theta(x, y_l)/\pi_{ref}(x, y_l)])]

πθ\pi_\theta is updated via gradient descent (lr =106= 10^{-6}) using both DrmD_{rm} and DRcD_{Rc}.

Rc-RM is a reward model that can be plugged into RL-for-reward pipelines (e.g., PPO); Rc-DPO enables direct policy learning without reinforcement learning, achieving both semantic quality and strict length following (Cai et al., 2 Feb 2025).

5. Experimental Evaluation and Metrics

Rc-BT was evaluated on OpenAssistant (with splits DsftD_{sft}, DrmD_{rm}, DevalD_{eval}) and an out-of-distribution split HH-RLHF. Test splits for evaluation included DevalqD_{eval}^q (length-balanced for semantic accuracy) and DevallD_{eval}^l (one response strictly length-compliant).

Key metrics:

  • Quality Eval Acc (accuracy on semantic test set DevalqD_{eval}^q)
  • Length Eval Acc (accuracy on length-only test set DevallD_{eval}^l)
  • Quality Win Ratio and Length Win Ratio in DPO evaluation (measured using GPT-4o on AlpacaEval and AlpacaEval-LI-plus-less/more)

Summary of Results

Setting Rc-RM Rc-DPO Baselines / Others
Quality Acc +10–17 pts vs. base 45–64% Win Ratio 30–50% (baseline/LIFT-plus)
Length Acc 84–92% 82–100% (“or less”) 0% (LIFT-plus)
Out-of-dist +5–6% Qual. Acc gain High Length Acc

Rc-RM flattens the reward-score-vs-length slope relative to baselines and matches the theoretically predicted reward-differences for length constraints. DPO evaluation further demonstrates that Rc-DPO retains both semantic quality and high instruction adherence, outperforming LIFT-plus, ODIN, and R-DPO (Cai et al., 2 Feb 2025).

6. Theoretical Insights and Ablation Studies

Ablation studies show that training with only “chosen-side” or only “rejected-side” response-conditioned comparisons collapses performance, with Quality Acc reverting to baseline and Length Acc to approximately 50%50\%. The full disentanglement effect only emerges when both types of comparisons are present. Varying the mix ratio confirms a performance sweet spot when both sides are balanced. Rc-RM maintains performance gains in out-of-distribution tests.

This suggests that length-sensitivity in preference modeling cannot be robustly induced by only penalizing infractions or rewarding adherence; both are required for the model to truly separate superficial length from underlying semantic quality.

7. Significance and Implications

Rc-BT and its derivatives Rc-RM and Rc-DPO systematically transform implicit sources of bias (e.g., length bias) into explicit length-conditional discrimination tasks. This enables reward models and policies to outperform standard methods in both averting undesirable bias and enforcing user-specified constraints. Adoption of this approach is substantiated by significant gains in both in-distribution and out-of-distribution evaluations, demonstrating robustness and generalizability across multiple model backbones and evaluation datasets (Cai et al., 2 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Response-conditioned Bradley–Terry (Rc-BT).