Papers
Topics
Authors
Recent
2000 character limit reached

Neural Dueling Bandit Algorithms

Updated 22 December 2025
  • Neural dueling bandit algorithms are preference-based online learning methods that use deep neural networks to model complex, non-linear reward functions from pairwise comparisons.
  • They incorporate strategies like Neural-UCB and Neural-TS, leveraging NTK-based uncertainty estimates to balance exploration and exploitation effectively.
  • Empirical and theoretical analyses reveal improved regret bounds, scalability, and practical benefits in applications such as recommender systems and RLHF.

Neural dueling bandit algorithms constitute a class of preference-based online learning methods that leverage neural networks to model complex, non-linear reward functions in contextual dueling or preference-based bandit problems. Unlike classical approaches, which assume linear or simple parametric reward structures, neural dueling bandit methods are designed to handle settings with rich context, highly non-linear user utilities, possibly multiple users, and feedback limited to noisy pairwise comparisons. These algorithms have yielded rigorous theoretical guarantees and demonstrated empirical improvements in regret and scalability on challenging synthetic and real-world tasks (Verma et al., 24 Jul 2024, Wang et al., 4 Feb 2025, Verma et al., 16 Apr 2025, Oh et al., 2 Jun 2025).

1. Problem Formulation and Motivation

Neural dueling bandit settings embed the contextual dueling bandit framework into deep-representation models. At each round tt:

  • A context ctCc_t \in \mathcal{C} is observed.
  • The learner selects a pair of arms (at,1,at,2)A×A(a_{t,1}, a_{t,2}) \in \mathcal{A} \times \mathcal{A} (sometimes from a context-dependent set XtX_t).
  • A binary preference outcome yt{0,1}y_t \in \{0,1\} is observed, indicating which arm is preferred under ctc_t.
  • The outcome is assumed to follow a stochastic model such as Bradley–Terry–Luce:

Pr[yt=1ct,at,1,at,2]=μ(f(ϕ(ct,at,1))f(ϕ(ct,at,2))),\Pr[y_t = 1 \mid c_t, a_{t,1}, a_{t,2}] = \mu(f(\phi(c_t, a_{t,1})) - f(\phi(c_t, a_{t,2}))),

where μ(z)=1/(1+ez)\mu(z) = 1/(1+e^{-z}) and ff is an unknown latent utility function.

  • The objective is to minimize cumulative regret or suboptimality gap, e.g.,

RT=t=1Trta, rta=f(xt)12(f(xt,1)+f(xt,2))R_T = \sum_{t=1}^T r_t^a,\ \quad r_t^a = f(x_t^*) - \tfrac{1}{2}(f(x_{t,1}) + f(x_{t,2}))

where xtx_t^* is the optimal arm in context ctc_t.

Neural dueling bandit algorithms address limitations of linear and kernel methods by learning ff with deep neural networks, enabling improved performance in settings such as recommendation with implicit user feedback, prompt optimization, and RLHF (Verma et al., 24 Jul 2024, Verma et al., 16 Apr 2025).

2. Neural Network-Based Utility Estimation

The latent utility ff is modeled via a deep (typically ReLU) network h(x;θ)h(x;\theta), often initialized in the Neural Tangent Kernel (NTK) regime:

  • Model:

h(x;θ)=WLReLU(WL1ReLU(W1x))h(x;\theta) = W_L\,\mathrm{ReLU}(W_{L-1}\,\cdots\,\mathrm{ReLU}(W_1 x)\cdots)

with parameters θ=vec(W1,...,WL)\theta = \mathrm{vec}(W_1, ..., W_L) and width mm.

  • Feature representer:

g(x;θ0)=θh(x;θ)θ=θ0g(x;\theta_0) = \nabla_\theta h(x;\theta)\vert_{\theta = \theta_0}

  • Pairwise utility difference: h(x;θ)h(x;θ)h(x;\theta) - h(x';\theta) estimates f(x)f(x)f(x) - f(x').

The network is trained at each round (or batch) on previous duels using the negative log-likelihood:

Lt(θ)=1t1s=1t1[yslogμ(h(xs,1;θ)h(xs,2;θ))+(1ys)log(1μ(...))]+λ2θθ022L_t(\theta) = -\frac{1}{t-1} \sum_{s=1}^{t-1} \left[y_s \log \mu(h(x_{s,1};\theta) - h(x_{s,2};\theta)) + (1-y_s) \log(1 - \mu(...))\right] + \frac{\lambda}{2}\|\theta-\theta_0\|_2^2

This formulation is universal across principal neural dueling bandit algorithms (Verma et al., 24 Jul 2024, Wang et al., 4 Feb 2025, Verma et al., 16 Apr 2025, Oh et al., 2 Jun 2025).

3. Exploration, Exploitation, and Algorithmic Principles

Exploration and exploitation are balanced via uncertainty quantification on the neural utility estimates. Two primary classes of algorithms are used:

  • Neural-UCB: After greedy selection of the first arm, the second arm is chosen to maximize

UCB(xxt,1)=h(x;θt)+νTσt1(x,xt,1)\mathrm{UCB}(x \mid x_{t,1}) = h(x;\theta_t) + \nu_T \sigma_{t-1}(x, x_{t,1})

where σt1\sigma_{t-1} is the Mahalanobis distance in the NTK feature space and νT\nu_T a problem/calibration-dependent parameter.

  • Neural-TS (Thompson Sampling): For each xx, sample

rt(x)N(h(x;θt)h(xt,1;θt),νT2σt12(x,xt,1))r_t(x) \sim \mathcal{N}(h(x;\theta_t) - h(x_{t,1};\theta_t), \nu_T^2 \sigma_{t-1}^2(x, x_{t,1}))

and select the maximizing xx as the second arm.

Variance-aware strategies (NVLDB) restrict the Gram matrix to last-layer features for computational efficiency and calibrate exploration bonuses by the empirical variance σ^i2=g(Δfi)[1g(Δfi)]\hat{\sigma}_i^2 = g(\Delta f_i)[1 - g(\Delta f_i)] of pairwise outcomes at each round (Oh et al., 2 Jun 2025).

Tables organizing the main algorithmic variants:

Algorithm Utility Model Exploration Gram Matrix Scope
NDB-UCB/TS Deep ReLU NN UCB/TS Full NTK
NVLDB-UCB/TS Deep NN (shallow) UCB/TS, variance Last-layer only
CONDB Deep NN + clustering UCB (optimistic) NTK (per-cluster)

4. Multi-User Collaboration and Clustering

CONDB (Wang et al., 4 Feb 2025) extends the neural dueling bandit paradigm to multiple users by combining online clustering and neural modeling:

  • Users are clustered via an online graph Gt=(U,Et)G_t=(\mathcal{U}, E_t). Each connected component is a cluster.
  • Shared cluster-level neural networks Θtclust\Theta^{\mathrm{clust}}_t aggregate preference data from users in a cluster, trained via pooled loss on all in-cluster duels.
  • After each duel, per-user networks Θit,t\Theta_{i_t,t} are fine-tuned with user-specific histories.
  • Clustering is refined by removing edges (i,)(i,\ell) if Θi,tΘ,t2\|\Theta_{i,t} - \Theta_{\ell,t}\|_2 exceeds a theory-derived threshold, ensuring separation of distinct utility functions.
  • Collaborative data sharing reduces sample complexity, leading to improved regret scaling in the true number of user types mm rather than total users uu.

5. Theoretical Guarantees

Theoretical analysis leverages NTK linearization, confidence-ellipsoid arguments, and the Mahalanobis geometry of neural feature spaces:

RT=O~((d~κμ+Bλκμ)Td~)R_T = \widetilde{O}\left(\left(\frac{\sqrt{\widetilde{d}}}{\kappa_\mu} + B\sqrt{\frac{\lambda}{\kappa_\mu}}\right)\sqrt{T \widetilde{d}}\right)

where d~=logdet((κμ/λ)H+I)\widetilde{d} = \log \det( (\kappa_\mu/\lambda) H' + I ) is the effective neural dimension, κμ\kappa_\mu the BTL strong convexity, and BB an NTK bound.

  • Variance-aware approach (NVLDB) achieves

RTa=O~(dt=1Tσt2+dT)R_T^a = \widetilde{O}\left( d\sqrt{\sum_{t=1}^T \sigma_t^2} + \sqrt{dT} \right)

where σt2\sigma_t^2 is the true Bernoulli variance per round (Oh et al., 2 Jun 2025).

  • Multi-user collaboration (CONDB):

RT=O(u(d~κμ2λ~xγ2+1λ~x2)logT+(d~κμ+Bλ/κμ)d~mT)R_T = O\left(u\left(\frac{\widetilde{d}}{\kappa_\mu^2 \tilde{\lambda}_x \gamma^2} + \frac{1}{\tilde{\lambda}_x^2}\right)\log T + \left(\frac{\sqrt{\widetilde{d}}}{\kappa_\mu} + B\sqrt{\lambda/\kappa_\mu}\right)\sqrt{\widetilde{d} m T}\right)

The second term demonstrates that, once clusters are separated, regret grows as mT\sqrt{m T} rather than uT\sqrt{u T} (Wang et al., 4 Feb 2025).

  • Suboptimality gap bounds for Neural-ADB:

ΔTπO~(d~/T)\Delta_T^\pi \leq \widetilde{O}\left(\sqrt{\widetilde{d}/T}\right)

representing a sublinear rate in the number of preference queries (Verma et al., 16 Apr 2025).

Key proof tools include linearization via infinite-width NTK theory, confidence-set constructions on neural parameters, and spectral bounds on Gram matrices.

6. Algorithmic Recipes and Practical Implementation

Principal neural dueling bandit algorithms employ the following operational sequence:

  1. Model initialization: Initialize the neural network at NTK or Gaussian weights.
  2. Data collection: At each round, observe context(s) and available arms.
  3. Arm/duel selection:
    • First arm: greedy maximization of neural utility.
    • Second arm: UCB bonus or Thompson sampling, with variance-aware bonuses in scalable variants.
    • Optionally, context selection is done to maximize exploration (Neural-ADB).
  4. Feedback and update: Observe pairwise preference; update the neural model by minimizing cross-entropy plus regularization.
  5. Exploration statistics update: Update Gram matrices for UCB/TS computation.
  6. Clustering (CONDB): Update user graphs based on model parameter divergence.

Empirical protocols use batch or incremental gradient steps, often leveraging warm starts for efficiency in high-volume, online settings. Clustering and per-user adaptation are especially beneficial when user heterogeneity is moderate (Wang et al., 4 Feb 2025).

7. Empirical Results, Computational Aspects, and Applications

Experimental validation demonstrates:

  • Robust sublinear regret for non-linear reward functions (e.g., quadratic, trigonometric forms).
  • NVLDB variants (variance-aware, shallow exploration) outperform earlier neural methods and classic linear-dueling bandits both in regret and computational efficiency: e.g., NVLDB–UCB–ASYM runs in \sim1.7 min/2000 rounds versus \sim49 min for full-NTK NDB–UCB (Oh et al., 2 Jun 2025).
  • Multi-user clustering variants (CONDB) achieve 20–40% reduction in total regret versus independent models, with most pronounced advantages in settings with intermediate cluster sizes (Wang et al., 4 Feb 2025).
  • Neural-ADB shows lowest suboptimality gaps in active preference collection tasks, especially in high-context or non-linear reward scenarios (Verma et al., 16 Apr 2025).

Application domains include scalable recommender systems, search/ranking with pairwise preference feedback, medical-trial design, and RLHF for LLMs, notably for active query selection in human preference acquisition (Wang et al., 4 Feb 2025, Verma et al., 16 Apr 2025, Verma et al., 24 Jul 2024).


Neural dueling bandit algorithms, through deep model-based utility estimation, exploration-calibrated uncertainty, and data-sharing strategies such as clustering, provide a rigorous and empirically validated foundation for sequential preference-based optimization under complex, real-world feedback channels.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Neural Dueling Bandit Algorithms.