Neural Surrogates & Bandit Optimization

Updated 28 February 2026

Neural surrogates and bandit optimization use neural networks to model complex reward functions for effective sequential decision-making.
NTK theory underpins methods like Neural-UCB, Neural-TS, and NPR, providing sublinear regret guarantees and practical uncertainty estimation.
Recent advances integrate structural priors, pre-trained features, and scalable computation for robust performance in offline, online, and non-stationary environments.

Neural surrogates and bandit optimization refer to a class of algorithms that embed neural networks as flexible surrogate models for reward or preference signals in sequential decision-making, typically within multi-armed or contextual bandit frameworks. The core premise is to exploit the representational capacity of neural networks to approximate complex, non-linear reward functions and to drive exploration–exploitation trade-offs via methods such as upper confidence bounds (UCB), Thompson sampling (TS), or reward perturbations. The primary goal is to minimize cumulative regret, often under minimal parametric assumptions on the true reward process.

1. Neural Surrogate Models in Bandit Frameworks

Neural surrogates are neural networks—typically fully connected ReLU nets or graph neural networks—trained to model the reward function $f:\mathcal{X} \to \mathbb{R}$ or, more generally, its pairwise or contextualized variants. Given observed context–arm pairs and corresponding rewards or preference feedback, these models serve as the basis for arm selection and uncertainty quantification. Neural surrogates replace classical linear or kernel-based reward approximators and drive the computation of acquisition functions for exploration.

Theoretical underpinnings largely rely on neural tangent kernel (NTK) theory. In sufficiently wide networks, optimization remains in a "lazy training" regime where the neural predictor is well-approximated by its first-order Taylor expansion at initialization. Empirical gradients $\nabla_{\theta} f(x; \theta_0)$ act as high-dimensional random features, and all confidence bounds, variance estimates, or posterior updates become tractable linear-algebraic operations in this random feature space (Zhou et al., 2019, Zhang et al., 2020).

2. Regret Guarantees and the Effective Dimension

State-of-the-art neural bandit algorithms establish high-probability sublinear regret bounds that directly mirror classical linear and kernel methods, but with complexity controlled by the NTK's effective dimension $\tilde d$ . Given $T$ rounds and $K$ arms, for a network with parameter vector $\theta$ , the effective dimension is

$\tilde d = \frac{\log \det(I + H/\lambda)}{\log(1 + TK/\lambda)}$

where $H$ is the NTK Gram matrix on observed contexts and $\lambda$ the regularization parameter (Zhou et al., 2019, Zhang et al., 2020, Jia et al., 2022, Qi et al., 2022, Kassraie et al., 2021, Verma et al., 2024).

Canonical regret bounds for UCB or Thompson sampling variants attain

$R_T = \tilde{O}(\tilde{d} \sqrt{T})$

under standard sub-Gaussian reward noise, sufficient exploration, and network widths $m = \mathrm{poly}(T,K,L,...)$ large enough to guarantee NTK concentration. This rate is realized for vanilla rewards (Zhang et al., 2020), preference feedback (Verma et al., 2024), grouped-structure arms (Qi et al., 2022), and variants with perturbed surrogate rewards (Jia et al., 2022).

3. Exploration Strategies: UCB, Thompson Sampling, and Reward Perturbation

Neural-UCB: Uses the surrogate reward $f(x; \theta)$ as the acquisition mean and the NTK-based variance as the exploration bonus: $U_{t,a} = f(x_{t,a}; \theta_{t-1}) + \gamma_{t-1} \lVert \nabla_{\theta} f(x_{t,a}; \theta_{t-1}) / \sqrt{m} \rVert_{Z_{t-1}^{-1}}$ where $Z_{t-1}$ is the regularized empirical covariance over past gradients (Zhou et al., 2019, Verma et al., 2024).

Neural Thompson Sampling: Constructs a surrogate Gaussian posterior with mean $f(x; \theta_{t-1})$ and NTK-derived variance, and selects actions by sampling from this posterior (Zhang et al., 2020, Verma et al., 2024).

NPR (Neural Perturbed Rewards): Avoids explicit confidence-set or posterior computations by adding i.i.d. Gaussian pseudo-noise to previous rewards when retraining the surrogate, thereby inducing optimism and efficient exploration without constructing an explicit uncertainty estimate. The policy is greedy with respect to the perturbed surrogate (Jia et al., 2022).

The table below summarizes these exploration strategies:

Algorithm	Exploration Mechanism	Complexity
Neural-UCB	NTK-based explicit bonus	$O(p^2)$ for matrix inversion
Neural-TS	Sample from NTK-Gaussian posterior	$O(p^2)$ for variance calculation
NPR	Randomized reward perturbation	No bonus, no matrix inversion

4. Structural Priors and Surrogate Model Variants

Recent advances extend neural surrogate bandit methods to capture structural priors and data symmetries.

Dueling/Preference Feedback: Models such as Neural Dueling Bandits employ neural surrogates trained with pairwise cross-entropy loss under Bradley–Terry–Luce models, with regret guarantees mirroring the vanilla reward setting but with respect to pairwise differences (Verma et al., 2024).
Group-Structured Rewards (AGG-UCB): Bandit algorithms leveraging group structure, as in Arm Group Graph UCB, incorporate GNN-based surrogates to exploit inter-group correlations. This enables propagation of uncertainty and information across correlated arms, with regret guarantees scaling in the NTK dimension of the GNN-parameterized surrogate (Qi et al., 2022).
Pre-Trained Representations: E2TC algorithms leverage pre-trained neural representations, adapting only the last linear layer or conducting local fine-tuning; regret bounds depend on a misspecification parameter $\epsilon_0$ and are proved even for non-asymptotic, finite-width networks (Terekhov, 9 Jan 2025).

5. Surrogates in Offline-Online and Non-Stationary Bandits

Neural surrogates enable efficient integration of auxiliary or pre-training data, particularly in settings where only surrogate or auxiliary reward signals are available offline.

ML-Generated Surrogate Rewards: The ML-Assisted UCB (MLA-UCB) wraps arbitrary ML reward predictions as potentially biased surrogates for online bandit optimization. Under joint Gaussianity, MLA-UCB constructs unbiased and variance-reduced estimates via control variates, achieving regret improvements proportional to the correlation squared $1/\left(1-\rho_k^2\right)$ (Ji et al., 20 Jun 2025).
Nonstationary Regimes: NeuralBandit and committee-based strategies demonstrate resilience to non-stationary reward processes due to rapid adaptation via retraining and model selection among committees of neural surrogates (Allesiardo et al., 2014).

6. Practical Considerations and Computational Scalability

A salient theme concerns the computational cost of neural bandit algorithms. NTK-based uncertainty estimation, while theoretically well-controlled, involves maintaining and inverting large $p \times p$ covariance matrices ( $p \sim m^2L$ ) per round or per update. NPR-type algorithms (Jia et al., 2022) achieve major practical speedups by avoiding explicit uncertainty computation, matching regret rates of explicit UCB/TS methods on synthetic and real-world benchmarks with significantly lower runtime.

Empirically, methods employing neural surrogates uniformly outperform classical linear/kernels on complex reward structures, provided that the network width is sufficient and the surrogate is updated with adequate frequency (Zhang et al., 2020, Jia et al., 2022, Zhou et al., 2019). The NTK approximation requires large overparameterization for guarantees; in regimes with pre-trained features and small models, exploration and regularization must be rebalanced (Terekhov, 9 Jan 2025).

7. Limitations, Open Problems, and Extensions

The current theory and regret guarantees are contingent on the network operating in the NTK/lazy-training regime, requiring wide architectures. Extensions to finite-width deep nets with significant feature learning remain open, as does robust uncertainty quantification outside NTK locality. The E2TC framework makes progress for locally convex non-lazy terrains by leveraging pre-training (Terekhov, 9 Jan 2025).

Reward perturbation methods suggest that optimism and exploration can be decoupled from explicit uncertainty metrics, pointing towards new "randomization-as-exploration" paradigms for deep surrogates (Jia et al., 2022). Group-structured and preference-based surrogates highlight the flexibility of the approach but raise challenges in scaling to extremely high-dimensional or continuous action spaces.

Future developments include streaming or amortized training regimes, further extension to infinite arm-spaces with NTK-kernel regression, and integration of model-based surrogates, such as GNNs or decision-tree ensembles, provided that rigorous regret controls can be established.