Neural Dueling Bandit Algorithms

Updated 22 December 2025

Neural dueling bandit algorithms are preference-based online learning methods that use deep neural networks to model complex, non-linear reward functions from pairwise comparisons.
They incorporate strategies like Neural-UCB and Neural-TS, leveraging NTK-based uncertainty estimates to balance exploration and exploitation effectively.
Empirical and theoretical analyses reveal improved regret bounds, scalability, and practical benefits in applications such as recommender systems and RLHF.

Neural dueling bandit algorithms constitute a class of preference-based online learning methods that leverage neural networks to model complex, non-linear reward functions in contextual dueling or preference-based bandit problems. Unlike classical approaches, which assume linear or simple parametric reward structures, neural dueling bandit methods are designed to handle settings with rich context, highly non-linear user utilities, possibly multiple users, and feedback limited to noisy pairwise comparisons. These algorithms have yielded rigorous theoretical guarantees and demonstrated empirical improvements in regret and scalability on challenging synthetic and real-world tasks (Verma et al., 24 Jul 2024, Wang et al., 4 Feb 2025, Verma et al., 16 Apr 2025, Oh et al., 2 Jun 2025).

1. Problem Formulation and Motivation

Neural dueling bandit settings embed the contextual dueling bandit framework into deep-representation models. At each round $t$ :

A context $c_t \in \mathcal{C}$ is observed.
The learner selects a pair of arms $(a_{t,1}, a_{t,2}) \in \mathcal{A} \times \mathcal{A}$ (sometimes from a context-dependent set $X_t$ ).
A binary preference outcome $y_t \in \{0,1\}$ is observed, indicating which arm is preferred under $c_t$ .
The outcome is assumed to follow a stochastic model such as Bradley–Terry–Luce:

$\Pr[y_t = 1 \mid c_t, a_{t,1}, a_{t,2}] = \mu(f(\phi(c_t, a_{t,1})) - f(\phi(c_t, a_{t,2}))),$

where $\mu(z) = 1/(1+e^{-z})$ and $f$ is an unknown latent utility function.

The objective is to minimize cumulative regret or suboptimality gap, e.g.,

$R_T = \sum_{t=1}^T r_t^a,\ \quad r_t^a = f(x_t^*) - \tfrac{1}{2}(f(x_{t,1}) + f(x_{t,2}))$

where $x_t^*$ is the optimal arm in context $c_t$ .

Neural dueling bandit algorithms address limitations of linear and kernel methods by learning $f$ with deep neural networks, enabling improved performance in settings such as recommendation with implicit user feedback, prompt optimization, and RLHF (Verma et al., 24 Jul 2024, Verma et al., 16 Apr 2025).

2. Neural Network-Based Utility Estimation

The latent utility $f$ is modeled via a deep (typically ReLU) network $h(x;\theta)$ , often initialized in the Neural Tangent Kernel (NTK) regime:

Model:

$h(x;\theta) = W_L\,\mathrm{ReLU}(W_{L-1}\,\cdots\,\mathrm{ReLU}(W_1 x)\cdots)$

with parameters $\theta = \mathrm{vec}(W_1, ..., W_L)$ and width $m$ .

Feature representer:

$g(x;\theta_0) = \nabla_\theta h(x;\theta)\vert_{\theta = \theta_0}$

Pairwise utility difference: $h(x;\theta) - h(x';\theta)$ estimates $f(x) - f(x')$ .

The network is trained at each round (or batch) on previous duels using the negative log-likelihood:

$L_t(\theta) = -\frac{1}{t-1} \sum_{s=1}^{t-1} \left[y_s \log \mu(h(x_{s,1};\theta) - h(x_{s,2};\theta)) + (1-y_s) \log(1 - \mu(...))\right] + \frac{\lambda}{2}\|\theta-\theta_0\|_2^2$

This formulation is universal across principal neural dueling bandit algorithms (Verma et al., 24 Jul 2024, Wang et al., 4 Feb 2025, Verma et al., 16 Apr 2025, Oh et al., 2 Jun 2025).

3. Exploration, Exploitation, and Algorithmic Principles

Exploration and exploitation are balanced via uncertainty quantification on the neural utility estimates. Two primary classes of algorithms are used:

Neural-UCB: After greedy selection of the first arm, the second arm is chosen to maximize

$\mathrm{UCB}(x \mid x_{t,1}) = h(x;\theta_t) + \nu_T \sigma_{t-1}(x, x_{t,1})$

where $\sigma_{t-1}$ is the Mahalanobis distance in the NTK feature space and $\nu_T$ a problem/calibration-dependent parameter.

Neural-TS (Thompson Sampling): For each $x$ , sample

$r_t(x) \sim \mathcal{N}(h(x;\theta_t) - h(x_{t,1};\theta_t), \nu_T^2 \sigma_{t-1}^2(x, x_{t,1}))$

and select the maximizing $x$ as the second arm.

Variance-aware strategies (NVLDB) restrict the Gram matrix to last-layer features for computational efficiency and calibrate exploration bonuses by the empirical variance $\hat{\sigma}_i^2 = g(\Delta f_i)[1 - g(\Delta f_i)]$ of pairwise outcomes at each round (Oh et al., 2 Jun 2025).

Tables organizing the main algorithmic variants:

Algorithm	Utility Model	Exploration	Gram Matrix Scope
NDB-UCB/TS	Deep ReLU NN	UCB/TS	Full NTK
NVLDB-UCB/TS	Deep NN (shallow)	UCB/TS, variance	Last-layer only
CONDB	Deep NN + clustering	UCB (optimistic)	NTK (per-cluster)

4. Multi-User Collaboration and Clustering

CONDB (Wang et al., 4 Feb 2025) extends the neural dueling bandit paradigm to multiple users by combining online clustering and neural modeling:

Users are clustered via an online graph $G_t=(\mathcal{U}, E_t)$ . Each connected component is a cluster.
Shared cluster-level neural networks $\Theta^{\mathrm{clust}}_t$ aggregate preference data from users in a cluster, trained via pooled loss on all in-cluster duels.
After each duel, per-user networks $\Theta_{i_t,t}$ are fine-tuned with user-specific histories.
Clustering is refined by removing edges $(i,\ell)$ if $\|\Theta_{i,t} - \Theta_{\ell,t}\|_2$ exceeds a theory-derived threshold, ensuring separation of distinct utility functions.
Collaborative data sharing reduces sample complexity, leading to improved regret scaling in the true number of user types $m$ rather than total users $u$ .

5. Theoretical Guarantees

Theoretical analysis leverages NTK linearization, confidence-ellipsoid arguments, and the Mahalanobis geometry of neural feature spaces:

Regret bounds for neural dueling bandit algorithms (single-user case) (Verma et al., 24 Jul 2024, Verma et al., 16 Apr 2025, Oh et al., 2 Jun 2025):

$R_T = \widetilde{O}\left(\left(\frac{\sqrt{\widetilde{d}}}{\kappa_\mu} + B\sqrt{\frac{\lambda}{\kappa_\mu}}\right)\sqrt{T \widetilde{d}}\right)$

where $\widetilde{d} = \log \det( (\kappa_\mu/\lambda) H' + I )$ is the effective neural dimension, $\kappa_\mu$ the BTL strong convexity, and $B$ an NTK bound.

Variance-aware approach (NVLDB) achieves

$R_T^a = \widetilde{O}\left( d\sqrt{\sum_{t=1}^T \sigma_t^2} + \sqrt{dT} \right)$

where $\sigma_t^2$ is the true Bernoulli variance per round (Oh et al., 2 Jun 2025).

Multi-user collaboration (CONDB):

$R_T = O\left(u\left(\frac{\widetilde{d}}{\kappa_\mu^2 \tilde{\lambda}_x \gamma^2} + \frac{1}{\tilde{\lambda}_x^2}\right)\log T + \left(\frac{\sqrt{\widetilde{d}}}{\kappa_\mu} + B\sqrt{\lambda/\kappa_\mu}\right)\sqrt{\widetilde{d} m T}\right)$

The second term demonstrates that, once clusters are separated, regret grows as $\sqrt{m T}$ rather than $\sqrt{u T}$ (Wang et al., 4 Feb 2025).

Suboptimality gap bounds for Neural-ADB:

$\Delta_T^\pi \leq \widetilde{O}\left(\sqrt{\widetilde{d}/T}\right)$

representing a sublinear rate in the number of preference queries (Verma et al., 16 Apr 2025).

Key proof tools include linearization via infinite-width NTK theory, confidence-set constructions on neural parameters, and spectral bounds on Gram matrices.

6. Algorithmic Recipes and Practical Implementation

Principal neural dueling bandit algorithms employ the following operational sequence:

Model initialization: Initialize the neural network at NTK or Gaussian weights.
Data collection: At each round, observe context(s) and available arms.
Arm/duel selection:
- First arm: greedy maximization of neural utility.
- Second arm: UCB bonus or Thompson sampling, with variance-aware bonuses in scalable variants.
- Optionally, context selection is done to maximize exploration (Neural-ADB).
Feedback and update: Observe pairwise preference; update the neural model by minimizing cross-entropy plus regularization.
Exploration statistics update: Update Gram matrices for UCB/TS computation.
Clustering (CONDB): Update user graphs based on model parameter divergence.

Empirical protocols use batch or incremental gradient steps, often leveraging warm starts for efficiency in high-volume, online settings. Clustering and per-user adaptation are especially beneficial when user heterogeneity is moderate (Wang et al., 4 Feb 2025).

7. Empirical Results, Computational Aspects, and Applications

Experimental validation demonstrates:

Robust sublinear regret for non-linear reward functions (e.g., quadratic, trigonometric forms).
NVLDB variants (variance-aware, shallow exploration) outperform earlier neural methods and classic linear-dueling bandits both in regret and computational efficiency: e.g., NVLDB–UCB–ASYM runs in $\sim$ 1.7 min/2000 rounds versus $\sim$ 49 min for full-NTK NDB–UCB (Oh et al., 2 Jun 2025).
Multi-user clustering variants (CONDB) achieve 20–40% reduction in total regret versus independent models, with most pronounced advantages in settings with intermediate cluster sizes (Wang et al., 4 Feb 2025).
Neural-ADB shows lowest suboptimality gaps in active preference collection tasks, especially in high-context or non-linear reward scenarios (Verma et al., 16 Apr 2025).

Application domains include scalable recommender systems, search/ranking with pairwise preference feedback, medical-trial design, and RLHF for LLMs, notably for active query selection in human preference acquisition (Wang et al., 4 Feb 2025, Verma et al., 16 Apr 2025, Verma et al., 24 Jul 2024).

Neural dueling bandit algorithms, through deep model-based utility estimation, exploration-calibrated uncertainty, and data-sharing strategies such as clustering, provide a rigorous and empirically validated foundation for sequential preference-based optimization under complex, real-world feedback channels.