Papers
Topics
Authors
Recent
2000 character limit reached

Contextual Dueling Bandits: Theory & Algorithms

Updated 22 December 2025
  • Contextual dueling bandits are a framework that uses contextual features and noisy pairwise feedback to guide sequential decision-making.
  • They employ linear and GLM models with strategies like UCB and Thompson sampling to estimate latent utilities and drive exploration.
  • The approach ensures rigorous regret minimization and scalability, proving effective in recommendation systems, preference learning, and reinforcement learning from human feedback.

Contextual dueling bandits constitute a general framework for sequential decision-making tasks that require incorporating side information (context) and relying solely on noisy, pairwise preference (duel) feedback rather than absolute numeric rewards. This class of algorithms provides rigorous regret minimization guarantees and practical scalability for domains such as recommendation systems, information retrieval, preference learning, and reinforcement learning from human feedback—settings where eliciting graded or scalar rewards is infeasible or unreliable, but comparative feedback (which of two options is preferred) is tractable and robust.

1. Formal Problem Setting and Models

In contextual dueling bandits, at each round t=1,,Tt=1,\ldots,T, the learner observes a context xtx_t (from some context space X\mathcal X) and faces an action set At\mathcal A_t (possibly large or infinite). The learner selects two distinct arms at,btAta_t, b_t \in \mathcal A_t and observes a binary preference outcome ot{0,1}o_t \in \{0,1\} indicating which arm is preferred. The pairwise preference probability is governed by a latent utility function model—commonly linear or generalized linear:

  • Linear model: u(x,a)=ϕ(x,a)θu(x, a) = \phi(x, a)^\top \theta^*, where ϕ\phi is a feature map and θ\theta^* is an unknown parameter.
  • Generalized linear model (GLM): The preference probability is

pt=μ(u(xt,at)u(xt,bt)),p_{t} = \mu\left(u(x_t, a_t) - u(x_t, b_t)\right),

where μ:R(0,1)\mu: \mathbb R \to (0,1) is a monotonic link function, e.g., the Bradley-Terry-Luce (BTL), logistic, or probit models (Bengs et al., 2022, Di et al., 2023, Dudík et al., 2015).

Feedback is always pairwise: only the sign of u(xt,at)u(xt,bt)u(x_t,a_t)-u(x_t,b_t) (possibly corrupted by stochastic or adversarial noise) is observed. Strong, average, and Borda regret and related notions provide theoretical performance metrics.

2. Regret Notions and Minimax Lower Bounds

Major regret formulations include:

  • Average regret: rta=f(xt)12[f(xt,at)+f(xt,bt)]r_t^a = f(x_t^*) - \frac12[f(x_{t,a_t}) + f(x_{t,b_t})].
  • Strong (weak) regret: rtw=f(xt)max{f(xt,at),f(xt,bt)}r_t^w = f(x_t^*) - \max\{ f(x_{t,a_t}), f(x_{t,b_t}) \}.
  • Borda regret: Invariant to pairwise order for general nontransitive models; RT=t=1T[2B(i)B(it)B(jt)]R_T = \sum_{t=1}^T [2B(i^*) - B(i_t) - B(j_t)] (Wu et al., 2023).

Fundamental lower bounds for contextual dueling bandits with a dd-dimensional feature map are as follows:

  • Stochastic GLM: Any algorithm must incur at least Ω(dT)\Omega(d\sqrt{T}) cumulative regret for average/weak regret, tight for linear and GLM with standard regularity (Bengs et al., 2022, Di et al., 2023).
  • Best-response regret: For KK arms, the minimax lower bound is O(KTRegF(T))O(\sqrt{KT \cdot \text{Reg}_\mathcal F(T)}) under a regression oracle, resolving prior statistical-computational gaps (Saha et al., 2021).
  • Borda regret (nontransitive): Lower bound of Ω(d2/3T2/3)\Omega(d^{2/3}T^{2/3}) for general stochastic GLM dueling (Wu et al., 2023).

3. Algorithmic Principles

3.1 Linear and GLM Algorithms

The dominant algorithmic design follows an optimism/exploration principle based on uncertainty estimates for parameter learning, either via UCB (upper confidence bound), posterior (Thompson) sampling, or frequentist design-based methods.

  • CoLSTIM employs a GLM-MLE estimator for θ\theta^* and selects arms by random utility perturbations and confidence-based exploration in linear stochastic transitive models, attaining O(dT)O(\sqrt{dT}) average regret (Bengs et al., 2022).
  • Variance-Aware Contextual Dueling Bandit (VACDB) achieves refined regret O(dt=1Tσt2+d)O(d\sqrt{\sum_{t=1}^T \sigma_t^2} + d), with σt2\sigma_t^2 the pairwise variance, by adapting the SupLinUCB/GLM upper confidence principle to the observed variance in pairwise comparisons (Di et al., 2023).
  • Best-response regret minimization reduces dueling to square-loss online regression with an efficient oracle, achieving optimal O(KTRegF(T))O(\sqrt{KT \operatorname{Reg}_\mathcal{F}(T)}) best-response regret (Saha et al., 2021).

3.2 Neural and Nonlinear Algorithms

Efforts to go beyond linear rewards employ neural architectures and NTK-based statistical analysis:

  • Neural Dueling Bandits (NDB-UCB, NDB-TS): Parameterize f()f(\cdot) by a wide ReLU network and apply UCB or Thompson sampling, leveraging NTK theory for confidence radius, yielding O(d~T)O(\tilde{d}\sqrt{T}) regret where d~\tilde{d} is the effective NTK dimension (Verma et al., 24 Jul 2024, Oh et al., 2 Jun 2025).
  • Variance-Aware Neural Dueling (NVLDB): Uses last-layer-only online learning with random features, weighted by uncertainty, to match linear minimax and variance-aware rates while providing computational efficiency (Oh et al., 2 Jun 2025).
  • Active collection and policy optimization: Neural-ADB and related methods couple active context selection, neural value estimation, and UCB/TS arm selection, guaranteeing O~(d~/T)\tilde O(\sqrt{\tilde{d}/T}) decay of the worst suboptimality gap (Verma et al., 16 Apr 2025).

3.3 Structural and Federated Extensions

  • Clustering of dueling bandits (COLDB/CONDB): Supports multiple users by adaptive graph-based user clustering, sharing data among clusters that share latent reward parameters, yielding O(dmT)O(d\sqrt{mT}) regret for mum \ll u clusters (Wang et al., 4 Feb 2025).
  • Federated dueling bandits (FLDB-OGD): Agents independently conduct context-arm selection and local updates, periodically exchanging summary gradients and Gram matrices, allowing learning with formal trade-offs between communication rounds and regret (2502.01085).
  • Adversarial feedback robustness (RCDB): Uncertainty-weighted GLM-MLE yields O~(dT/κ+dC/κ)\tilde O(d\sqrt{T}/\kappa + dC/\kappa) regret, CC being the adversarial corruption budget; refined analyses for sigmoid links reduce dependence on curvature (Di et al., 16 Apr 2024).

4. Recent Innovations and Representative Algorithms

Algorithm/Advance Setting Regret Bound
VACDB (Di et al., 2023) Linear+GLM, variance O(dσt2+d)O(d\sqrt{\sum\sigma_t^2} + d)
ROAM: Recycle History (Sankagiri et al., 26 Aug 2025) History features O(dT)O(d\sqrt{T})
CoLSTIM (Bengs et al., 2022) LST (GLM) O(dT)O(d\sqrt{T}) (average regret)
Neural Dueling (Verma et al., 24 Jul 2024) Nonlinear/NTK O(d~T)O(\tilde{d}\sqrt{T})
COLDB/CONDB (Wang et al., 4 Feb 2025) Multiuser, clusters O(dmT)O(d\sqrt{mT})/neural: O(d~mT)O(\sqrt{\tilde{d}mT})
FGTS.CDB (Thompson) (Li et al., 9 Apr 2024) Linear, Bayesian O~(dT)\tilde O(d\sqrt{T})
RCDB (adversarial) (Di et al., 16 Apr 2024) Corrupted feedback O~(dT/κ+dC/κ)\tilde O(d\sqrt{T}/\kappa + dC/\kappa)
Kernelized/Borda (offline) (Mehta et al., 2023) Nonparametric, offline Suboptimality O(1/T)O(1/\sqrt{T}) (kernel info gain)
Federated dueling (2502.01085) Multi-agent O~(Md+Td/κ)\tilde O(Md + \sqrt{T}d/\kappa)

5. Refinements: Structure, Feedback, and Application-Specific Models

  • Offline and active query selection: Approaches that enable active selection of context/arm pairs for data-efficient RLHF-style feedback acquisition use kernel ridge regression and UCB-driven context-action selection (Mehta et al., 2023).
  • Conversational and attribute-based dueling: Incorporate relative feedback on arm attributes (“key-terms”) to accelerate learning in conversational recommender systems (Yang et al., 26 Jul 2024).
  • Borda regret for nontransitive models: Explore-then-commit and adversarial algorithms achieve O~(d2/3T2/3)\tilde O(d^{2/3}T^{2/3}) Borda regret, optimal even under GLMs with nontransitive feedback (Wu et al., 2023).
  • History recycling: ROAM and related history-augmented algorithms exploit free comparisons with past arms to drive efficient exploration, attaining strong empirical and theoretical performance (Sankagiri et al., 26 Aug 2025).

6. Connections, Open Questions, and Future Directions

Key open questions and directions:

  • Variance adaptivity: Exploit fine-grained feedback variance for sharper regret, especially in near-deterministic feedback regimes (Di et al., 2023, Oh et al., 2 Jun 2025).
  • Scalable computation: Shallow exploration, last-layer neural parametrizations, and mosaic bandit–clustering/federated schemes enhance computational practicality (Wang et al., 4 Feb 2025, 2502.01085, Oh et al., 2 Jun 2025).
  • Nonlinear and nonparametric models: Deeper theoretical analysis for deep neural and kernelized dueling bandits (effective dimension, bias-variance tradeoff, sample complexity) (Verma et al., 24 Jul 2024, Verma et al., 16 Apr 2025, Mehta et al., 2023).
  • Adversarial feedback and robust estimation: Quantify resilience to systematic preference misreporting and partial observability, and refine confidence estimation under model misspecification (Di et al., 16 Apr 2024).
  • Active and interactive feedback: Principles for minimizing query and label costs by active selection in learning from expensive human feedback (Verma et al., 16 Apr 2025, Mehta et al., 2023).
  • Multi-agent, federated, and privacy-aware learning: Exploit collaboration without compromising user data privacy or incurring undue communication cost (2502.01085).

Contextual dueling bandits synergize bandit theory, preference learning, robust estimation, and deep representation. Their formulation and refinement are central to next-generation preference-centric learning systems, including RLHF, personalized ranking, and federated recommendation.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Contextual Dueling Bandits.