Neural Dueling Bandit Algorithms
- Neural dueling bandit algorithms are preference-based online learning methods that use deep neural networks to model complex, non-linear reward functions from pairwise comparisons.
- They incorporate strategies like Neural-UCB and Neural-TS, leveraging NTK-based uncertainty estimates to balance exploration and exploitation effectively.
- Empirical and theoretical analyses reveal improved regret bounds, scalability, and practical benefits in applications such as recommender systems and RLHF.
Neural dueling bandit algorithms constitute a class of preference-based online learning methods that leverage neural networks to model complex, non-linear reward functions in contextual dueling or preference-based bandit problems. Unlike classical approaches, which assume linear or simple parametric reward structures, neural dueling bandit methods are designed to handle settings with rich context, highly non-linear user utilities, possibly multiple users, and feedback limited to noisy pairwise comparisons. These algorithms have yielded rigorous theoretical guarantees and demonstrated empirical improvements in regret and scalability on challenging synthetic and real-world tasks (Verma et al., 24 Jul 2024, Wang et al., 4 Feb 2025, Verma et al., 16 Apr 2025, Oh et al., 2 Jun 2025).
1. Problem Formulation and Motivation
Neural dueling bandit settings embed the contextual dueling bandit framework into deep-representation models. At each round :
- A context is observed.
- The learner selects a pair of arms (sometimes from a context-dependent set ).
- A binary preference outcome is observed, indicating which arm is preferred under .
- The outcome is assumed to follow a stochastic model such as Bradley–Terry–Luce:
where and is an unknown latent utility function.
- The objective is to minimize cumulative regret or suboptimality gap, e.g.,
where is the optimal arm in context .
Neural dueling bandit algorithms address limitations of linear and kernel methods by learning with deep neural networks, enabling improved performance in settings such as recommendation with implicit user feedback, prompt optimization, and RLHF (Verma et al., 24 Jul 2024, Verma et al., 16 Apr 2025).
2. Neural Network-Based Utility Estimation
The latent utility is modeled via a deep (typically ReLU) network , often initialized in the Neural Tangent Kernel (NTK) regime:
- Model:
with parameters and width .
- Feature representer:
- Pairwise utility difference: estimates .
The network is trained at each round (or batch) on previous duels using the negative log-likelihood:
This formulation is universal across principal neural dueling bandit algorithms (Verma et al., 24 Jul 2024, Wang et al., 4 Feb 2025, Verma et al., 16 Apr 2025, Oh et al., 2 Jun 2025).
3. Exploration, Exploitation, and Algorithmic Principles
Exploration and exploitation are balanced via uncertainty quantification on the neural utility estimates. Two primary classes of algorithms are used:
- Neural-UCB: After greedy selection of the first arm, the second arm is chosen to maximize
where is the Mahalanobis distance in the NTK feature space and a problem/calibration-dependent parameter.
- Neural-TS (Thompson Sampling): For each , sample
and select the maximizing as the second arm.
Variance-aware strategies (NVLDB) restrict the Gram matrix to last-layer features for computational efficiency and calibrate exploration bonuses by the empirical variance of pairwise outcomes at each round (Oh et al., 2 Jun 2025).
Tables organizing the main algorithmic variants:
| Algorithm | Utility Model | Exploration | Gram Matrix Scope |
|---|---|---|---|
| NDB-UCB/TS | Deep ReLU NN | UCB/TS | Full NTK |
| NVLDB-UCB/TS | Deep NN (shallow) | UCB/TS, variance | Last-layer only |
| CONDB | Deep NN + clustering | UCB (optimistic) | NTK (per-cluster) |
4. Multi-User Collaboration and Clustering
CONDB (Wang et al., 4 Feb 2025) extends the neural dueling bandit paradigm to multiple users by combining online clustering and neural modeling:
- Users are clustered via an online graph . Each connected component is a cluster.
- Shared cluster-level neural networks aggregate preference data from users in a cluster, trained via pooled loss on all in-cluster duels.
- After each duel, per-user networks are fine-tuned with user-specific histories.
- Clustering is refined by removing edges if exceeds a theory-derived threshold, ensuring separation of distinct utility functions.
- Collaborative data sharing reduces sample complexity, leading to improved regret scaling in the true number of user types rather than total users .
5. Theoretical Guarantees
Theoretical analysis leverages NTK linearization, confidence-ellipsoid arguments, and the Mahalanobis geometry of neural feature spaces:
- Regret bounds for neural dueling bandit algorithms (single-user case) (Verma et al., 24 Jul 2024, Verma et al., 16 Apr 2025, Oh et al., 2 Jun 2025):
where is the effective neural dimension, the BTL strong convexity, and an NTK bound.
- Variance-aware approach (NVLDB) achieves
where is the true Bernoulli variance per round (Oh et al., 2 Jun 2025).
- Multi-user collaboration (CONDB):
The second term demonstrates that, once clusters are separated, regret grows as rather than (Wang et al., 4 Feb 2025).
- Suboptimality gap bounds for Neural-ADB:
representing a sublinear rate in the number of preference queries (Verma et al., 16 Apr 2025).
Key proof tools include linearization via infinite-width NTK theory, confidence-set constructions on neural parameters, and spectral bounds on Gram matrices.
6. Algorithmic Recipes and Practical Implementation
Principal neural dueling bandit algorithms employ the following operational sequence:
- Model initialization: Initialize the neural network at NTK or Gaussian weights.
- Data collection: At each round, observe context(s) and available arms.
- Arm/duel selection:
- First arm: greedy maximization of neural utility.
- Second arm: UCB bonus or Thompson sampling, with variance-aware bonuses in scalable variants.
- Optionally, context selection is done to maximize exploration (Neural-ADB).
- Feedback and update: Observe pairwise preference; update the neural model by minimizing cross-entropy plus regularization.
- Exploration statistics update: Update Gram matrices for UCB/TS computation.
- Clustering (CONDB): Update user graphs based on model parameter divergence.
Empirical protocols use batch or incremental gradient steps, often leveraging warm starts for efficiency in high-volume, online settings. Clustering and per-user adaptation are especially beneficial when user heterogeneity is moderate (Wang et al., 4 Feb 2025).
7. Empirical Results, Computational Aspects, and Applications
Experimental validation demonstrates:
- Robust sublinear regret for non-linear reward functions (e.g., quadratic, trigonometric forms).
- NVLDB variants (variance-aware, shallow exploration) outperform earlier neural methods and classic linear-dueling bandits both in regret and computational efficiency: e.g., NVLDB–UCB–ASYM runs in 1.7 min/2000 rounds versus 49 min for full-NTK NDB–UCB (Oh et al., 2 Jun 2025).
- Multi-user clustering variants (CONDB) achieve 20–40% reduction in total regret versus independent models, with most pronounced advantages in settings with intermediate cluster sizes (Wang et al., 4 Feb 2025).
- Neural-ADB shows lowest suboptimality gaps in active preference collection tasks, especially in high-context or non-linear reward scenarios (Verma et al., 16 Apr 2025).
Application domains include scalable recommender systems, search/ranking with pairwise preference feedback, medical-trial design, and RLHF for LLMs, notably for active query selection in human preference acquisition (Wang et al., 4 Feb 2025, Verma et al., 16 Apr 2025, Verma et al., 24 Jul 2024).
Neural dueling bandit algorithms, through deep model-based utility estimation, exploration-calibrated uncertainty, and data-sharing strategies such as clustering, provide a rigorous and empirically validated foundation for sequential preference-based optimization under complex, real-world feedback channels.