Response-Conditioned Bradley–Terry (Rc-BT)

Updated 12 February 2026

Rc-BT is a model that extends Bradley–Terry by explicitly incorporating response length, separating semantic quality from inherent length bias.
It leverages an augmented dataset with explicit length constraints to penalize non-compliant responses and reward adherence within RLHF frameworks.
Experimental evaluations of Rc-RM and Rc-DPO demonstrate significant gains in both semantic and length accuracy compared to traditional baselines.

The Response-conditioned Bradley–Terry (Rc-BT) model extends the classical Bradley–Terry framework for pairwise preference learning by explicitly disentangling semantic quality from response length in reward modeling. Rc-BT leverages an augmented comparison set that forces models not only to learn from human semantic preferences but also to follow or penalize explicit response-length instructions. It provides a principled mechanism to mitigate length bias in reward modeling for reinforcement learning from human feedback (RLHF) and to improve adherence to externally specified length constraints in LLMs (Cai et al., 2 Feb 2025).

1. Mathematical Formulation

The standard Bradley–Terry (BT) model encodes preference probability for a triplet $(x, y_w, y_l)$ as

$p^*(y_w \succ y_l\,|\,x) = \frac{\exp(r^*(x, y_w))}{\exp(r^*(x, y_w)) + \exp(r^*(x, y_l))}$

where $r^*(x, y)$ is the (possibly unknown) reward function. A parametric reward model $r_\phi(x, y)$ is optimized via

$\mathcal{L}_{BT}(\phi; D_1) = -\mathbb{E}_{(x, y_w, y_l)\sim D_1} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]$

with $\sigma(u) = 1/(1 + \exp(-u))$ .

Rc-BT introduces, for each triplet $(x, y_w, y_l)$ , two new “response-conditioned” comparisons:

$(x \succ x_l^1 \mid y_w)$ , where $x_l^1$ is $x$ augmented with a length constraint that $y_w$ violates.
$(x_l^2 \succ x \mid y_l)$ , where $x_l^2$ is $x$ augmented with a length constraint that $y_l$ satisfies.

For any fixed $y$ ,

$p^*(x \succ x_l^1\,|\,y) = \frac{\exp(r^*(x, y))}{\exp(r^*(x, y)) + \exp(r^*(x_l^1, y))}$

$p^*(x_l^2 \succ x\,|\,y) = \frac{\exp(r^*(x_l^2, y))}{\exp(r^*(x_l^2, y)) + \exp(r^*(x, y))}$

These comparisons populate the set $D_{Rc}$ , yielding the Rc-BT loss: $\mathcal{L}_{Rc}(\phi) = -\mathbb{E}_{(x, x_l^1, y_w)\sim D_{Rc}} [\log \sigma(r_\phi(x, y_w) - r_\phi(x_l^1, y_w))] -\lambda\,\mathbb{E}_{(x_l^2, x, y_l)\sim D_{Rc}} [\log \sigma(r_\phi(x_l^2, y_l) - r_\phi(x, y_l))]$ where $\lambda$ (typically $1$) balances the loss terms.

2. Integration of Response Length in Preference Modeling

In the standard BT formulation, semantic quality, style, length, and other confounding factors are implicitly conflated in the reward model via observed preferences. Length bias arises when longer or shorter responses systematically affect reward, even when length was not explicitly specified.

Rc-BT introduces explicit response-conditioned comparisons, making length bias a modeled variable: the model must penalize responses not adhering to newly imposed length constraints, and reward those satisfying them, regardless of semantic content. This approach forces $r_\phi(x, y)$ to encode actual ability to follow length constraints, instead of correlating with length incidentally.

A plausible implication is that Rc-BT yields a reward model that more robustly generalizes across prompts and constraints, mitigating spurious exploitation of superficial attributes like response length.

3. Augmented Dataset Construction

Construction of the response-conditioned comparison set $D_{Rc}$ proceeds deterministically from any human-annotated preference set $D_{rm} = \{(x, y_w, y_l)\}$ by:

For each $(x, y_w)$ (“chosen-side”), sampling a numeric length constraint $\text{word\_num}$ such that $|y_w| > \text{word\_num}$ (violated). Create $x_l^1 =$ \texttt{"Answer the following instruction using {word_num} words or less. "} + x $and record$ (x, x_l^1, y_w) $.</li> <li>For each$ (x, y_l) $(“rejected-side”), sampling$ \text{word\_num} $so$ |y_l| < \text{word\_num} $(satisfied). Form$ x_l^2 =$\texttt{"Answer the following instruction using {word_num} words or less. "} + x$and record $(x_l<sup>2,</sup> x, y_l)$ .

Additional “or more” constraints are mixed in similarly. No further human annotation is required. This dataset enables learning fine-grained compliance with both length-constrained and unconstrained settings.

4. Training Objectives: Rc-RM and Rc-DPO

Reward Model (Rc-RM):

The Rc-BT reward model is initialized from the SFT-fine-tuned LLM and trained on $D_{rm} \cup D_{Rc}$ using Adam (lr $= 10^{-5}$ , batch $= 64$ , $5$ epochs), minimizing

$\mathcal{L}(\phi) = \mathcal{L}_{BT}(\phi; D_{rm}) + \mathcal{L}_{Rc}(\phi; D_{Rc})$

This produces $r_\phi(x, y)$ fitting both pairwise semantic preference and explicit length instructions.

Direct Policy Optimization (Rc-DPO):

Rc-DPO extends the DPO objective. Given reference policy $\pi_{ref}$ and parametric policy $\pi_\theta$ , reward is approximated as $r_\phi(x, y) \approx \beta \log\left[\frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}\right] + \text{const}$ , with

$\mathcal{L}^{Rc}_{DPO}(\theta) = -\mathbb{E}_{(x, x_l^1, y_w)\sim D_{Rc}} [\log \sigma (\beta \log [\pi_\theta(x, y_w)/\pi_{ref}(x, y_w)] - \beta \log [\pi_\theta(x_l^1, y_w)/\pi_{ref}(x_l^1, y_w)])]$

$\phantom{\mathcal{L}^{Rc}_{DPO}(\theta) =} -\mathbb{E}_{(x_l^2, x, y_l)\sim D_{Rc}} [\log \sigma (\beta \log [\pi_\theta(x_l^2, y_l)/\pi_{ref}(x_l^2, y_l)] - \beta \log [\pi_\theta(x, y_l)/\pi_{ref}(x, y_l)])]$

$\pi_\theta$ is updated via gradient descent (lr $= 10^{-6}$ ) using both $D_{rm}$ and $D_{Rc}$ .

Rc-RM is a reward model that can be plugged into RL-for-reward pipelines (e.g., PPO); Rc-DPO enables direct policy learning without reinforcement learning, achieving both semantic quality and strict length following (Cai et al., 2 Feb 2025).

5. Experimental Evaluation and Metrics

Rc-BT was evaluated on OpenAssistant (with splits $D_{sft}$ , $D_{rm}$ , $D_{eval}$ ) and an out-of-distribution split HH-RLHF. Test splits for evaluation included $D_{eval}^q$ (length-balanced for semantic accuracy) and $D_{eval}^l$ (one response strictly length-compliant).

Key metrics:

Quality Eval Acc (accuracy on semantic test set $D_{eval}^q$ )
Length Eval Acc (accuracy on length-only test set $D_{eval}^l$ )
Quality Win Ratio and Length Win Ratio in DPO evaluation (measured using GPT-4o on AlpacaEval and AlpacaEval-LI-plus-less/more)

Summary of Results

Setting	Rc-RM	Rc-DPO	Baselines / Others
Quality Acc	+10–17 pts vs. base	45–64% Win Ratio	30–50% (baseline/LIFT-plus)
Length Acc	84–92%	82–100% (“or less”)	0% (LIFT-plus)
Out-of-dist	+5–6% Qual. Acc gain	High Length Acc	—

Rc-RM flattens the reward-score-vs-length slope relative to baselines and matches the theoretically predicted reward-differences for length constraints. DPO evaluation further demonstrates that Rc-DPO retains both semantic quality and high instruction adherence, outperforming LIFT-plus, ODIN, and R-DPO (Cai et al., 2 Feb 2025).

6. Theoretical Insights and Ablation Studies

Ablation studies show that training with only “chosen-side” or only “rejected-side” response-conditioned comparisons collapses performance, with Quality Acc reverting to baseline and Length Acc to approximately $50\%$ . The full disentanglement effect only emerges when both types of comparisons are present. Varying the mix ratio confirms a performance sweet spot when both sides are balanced. Rc-RM maintains performance gains in out-of-distribution tests.

This suggests that length-sensitivity in preference modeling cannot be robustly induced by only penalizing infractions or rewarding adherence; both are required for the model to truly separate superficial length from underlying semantic quality.

7. Significance and Implications

Rc-BT and its derivatives Rc-RM and Rc-DPO systematically transform implicit sources of bias (e.g., length bias) into explicit length-conditional discrimination tasks. This enables reward models and policies to outperform standard methods in both averting undesirable bias and enforcing user-specified constraints. Adoption of this approach is substantiated by significant gains in both in-distribution and out-of-distribution evaluations, demonstrating robustness and generalizability across multiple model backbones and evaluation datasets (Cai et al., 2 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Response-conditioned Bradley–Terry (Rc-BT).

Response-Conditioned Bradley–Terry (Rc-BT)

1. Mathematical Formulation

2. Integration of Response Length in Preference Modeling

3. Augmented Dataset Construction

4. Training Objectives: Rc-RM and Rc-DPO

5. Experimental Evaluation and Metrics

Summary of Results

6. Theoretical Insights and Ablation Studies

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Response-Conditioned Bradley–Terry (Rc-BT)

1. Mathematical Formulation

2. Integration of Response Length in Preference Modeling

3. Augmented Dataset Construction

4. Training Objectives: Rc-RM and Rc-DPO

5. Experimental Evaluation and Metrics

Summary of Results

6. Theoretical Insights and Ablation Studies

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research