Response-Conditioned Bradley-Terry Model

Updated 2 June 2026

The paper introduces the Rc-BT model, extending the classical Bradley-Terry framework by conditioning on responses and mitigating confounding factors like length bias.
It details a rigorous model formulation, analyzing sample complexity via score margin and connectivity, and offering exact parameterization under specific conditions.
The model leverages ordinal feedback and response-conditioned DPO objectives, demonstrating improved reward modeling accuracy and constraint adherence in empirical benchmarks.

The Response-Conditioned Bradley–Terry (Rc-BT) model is a generalization of the classical Bradley–Terry framework, developed to model, recover, and interpret human (or annotator) preferences in settings characterized by context-response comparisons. It represents the probability of one candidate being preferred over another, conditioned on the response and often additional structural constraints (e.g. length constraints or ordinal labels), thereby addressing key limitations in pairwise preference learning such as model misspecification, confounding factors, and coarse feedback representation (Pukdee et al., 10 Feb 2026, Cai et al., 2 Feb 2025, Liu et al., 2024).

1. Foundations and Model Definition

The Rc-BT model operates on tuples or triplets $(x, y^+, y^-)$ , where $x$ denotes the context (e.g., a prompt), and $y^+$ , $y^-$ are candidate responses with $y^+$ preferred over $y^-$ under $x$ . The joint data distribution $P(x, y^+, y^-)$ induces a conditional preference distribution (CPRD) defined as

$\omega_P(y \succ y' \mid x) = P(y \succ y' \mid x, \{y, y'\}) = \frac{P(x, y, y')}{P(x, y, y') + P(x, y', y)}.$

Rc-BT posits a real-valued score function $r : \mathcal{X} \times \mathcal{Y} \to \mathbb{R}$ such that

$x$ 0

and fitting Rc-BT is typically by discriminative log-likelihood minimization: $x$ 1 (Pukdee et al., 10 Feb 2026).

2. Representability and Identifiability

Rc-BT exactly parameterizes the CPRD if and only if $x$ 2 factors in a particular way. Specifically, $x$ 3 is of Rc-BT form exactly when there exists a strictly positive function $x$ 4 such that, for all relevant $x$ 5: $x$ 6 In that case, $x$ 7 (Pukdee et al., 10 Feb 2026).

Moreover, Rc-BT is exact under the "positive–negative conditional independence" (CI) assumption: $x$ 8 Under positive–negative CI, for almost all $x$ 9: $y^+$ 0 If CI fails, Rc-BT recovers the projection of the true CPRD onto the BT-family via minimization of the KL divergence between the empirical CPRD and the BT parameterization (Pukdee et al., 10 Feb 2026).

3. Ordinal Feedback and Response Conditioning

Rc-BT generalizes naturally to ordinal feedback, allowing $y^+$ 1 in an ordered set $y^+$ 2, such as $y^+$ 3 or $y^+$ 4. The "marginal unbiasedness" assumption postulates that the expected annotator feedback $y^+$ 5 matches the (latent) preference probability, facilitating use of soft labels and reducing estimation variance: $y^+$ 6 Probabilities for $y^+$ 7 are interpolated so $y^+$ 8, and the Rc-BT loss aligns with unbiased cross-entropy or hinge-loss formulations (Liu et al., 2024).

Direct policy optimization (DPO) objectives additionally extend naturally to the Rc-BT setting via

$y^+$ 9

with appropriately defined margins (Liu et al., 2024).

4. Response-Conditioned Modeling for Confounder Disentanglement

Rc-BT admits augmentation to address confounders such as length bias in reward modeling for LLMs. Construction proceeds by generating response-conditioned preference pairs to explicitly disentangle content quality from compliance with structural constraints (e.g., length). For each preferred response, a "too-short" length constraint is imposed; for each rejected, a "long-enough" constraint is imposed, forming two dataset partitions:

$y^-$ 0 indicates a preference for $y^-$ 1 under the original prompt, relative to a prompt with a forbidding length constraint.
$y^-$ 2 prefers $y^-$ 3 when coupled with a permissive constraint.

The reward model $y^-$ 4 is then fit by minimizing the response-conditioned negative log-likelihood over both partitions, enforcing orthogonal learning of semantic quality and compliance with explicit instructions (Cai et al., 2 Feb 2025).

This results in reward models and policy optimization (Rc-DPO) objectives that confer substantially higher adherence to constraints (e.g., length control) and higher semantic evaluation accuracy compared to baseline BT or RM approaches.

5. Sample Complexity: Margin and Connectivity

Statistical analysis reveals two key data-dependent factors governing Rc-BT sample complexity:

Margin ( $y^-$ 5): Measures score separation for correct orderings. Larger margins increase error tolerance and improve accuracy.
Connectivity degree ( $y^-$ 6): Quantifies how well frequently-compared pairs relate to variance in test distributions. Calculated as the ratio of expected squared margin differences and variance under a hypothesis class.

Finite-sample estimation error for $y^-$ 7 is bounded as: $y^-$ 8 with corresponding test accuracy scaling with margin and $y^-$ 9 (Pukdee et al., 10 Feb 2026).

6. Empirical Results and Practical Considerations

Empirical benchmarks demonstrate Rc-BT’s advantages across multiple axes:

Reward modeling: Rc-BT achieves 10–16% increased quality accuracy and $y^+$ 030–35% improved length adherence on models such as Qwen2, Llama, and Gemma, relative to standard RMs (Cai et al., 2 Feb 2025, Liu et al., 2024).
Mitigating confounders: Rc-RM eliminates the monotonic score-length correlation typical of length-biased models.
Ordinal feedback: 5-level and 3-level Rc-BT models exhibit lower cross-entropy loss and higher out-of-domain accuracy than binary BT models, consistent with Rademacher complexity reductions (Liu et al., 2024).
Policy learning: Rc-DPO policies consistently show higher win rates and constraint adherence on both standard and length-benchmarks.

Feedback Type	In-Dist. Acc.	OOD Acc.	CE Loss
Binary (0,1)	93.29–94.01%	76.67–86.97%	0.5709–0.5736
5-level	93.71–93.72%	81.00–85.84%	0.5704–0.5714
3-level	93.59–93.81%	80.16–85.80%	0.5704–0.5715
Oracle soft	93.82–94.01%	81.93–86.97%	0.5698–0.5711

Fine-grained annotations and moderate mixing of "tie" labels have been found to further increase accuracy and smooth learning (Liu et al., 2024).

7. Model Misspecification and Limitations

When data violate the positive–negative CI assumption, Rc-BT does not recover the true CPRD but projects it via KL minimization onto the BT-representable set: $y^+$ 1 Experimentally, manipulations of margin and connectivity directly yield predictable effects: increasing minimum margin sharply boosts accuracy in small sample regimes; extremal negative sampling (too easy or too hard) reduces $y^+$ 2 and degrades accuracy; optimizing negative sampling to maximize $y^+$ 3 restores accuracy when connectivity is the bottleneck (Pukdee et al., 10 Feb 2026).

References

"What Does Preference Learning Recover from Pairwise Comparison Data?" (Pukdee et al., 10 Feb 2026)
"Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling" (Cai et al., 2 Feb 2025)
"Reward Modeling with Ordinal Feedback: Wisdom of the Crowd" (Liu et al., 2024)

Markdown Report Issue Upgrade to Chat

References (3)

What Does Preference Learning Recover from Pairwise Comparison Data? (2026)

Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling (2025)

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Response-Conditioned Bradley-Terry (Rc-BT).