MNL Contextual Bandits & Non-linear Utilities

Updated 18 January 2026

The paper introduces MNL contextual bandits with non-linear utility functions to capture complex relationships in choice modeling.
It details algorithmic frameworks including Optimistic UCB, batched methods, regression-oracle schemes, and Thompson Sampling with theoretical regret guarantees.
The study provides rigorous analysis and practical guidelines, ensuring near-optimal regret under proper smoothness, boundedness, and realizability conditions.

Multinomial Logit (MNL) contextual bandits with non-linear utilities extend the standard contextual bandit framework by incorporating choice models and utility functions that capture complex, non-linear relationships between observed features and expected rewards. These models provide the statistical backbone for a wide range of sequential decision-making problems, such as online assortment in retail, personalized recommendation, and adaptive clinical trial design, by modeling user (or environment) choices among multiple alternatives according to probabilistic rules parameterized by utility functions that need not be linear.

1. Formal Problem Setting

The learner interacts with the environment over $T$ rounds. At each round $t$ , the learner observes a context (e.g., feature vector(s) of available items) and selects an action, such as a context vector $x_t$ or an assortment $S_t$ of items, depending on the model variant. The user (or the environment) then selects an outcome $y_t \in \{0,1,\dots,K\}$ , where $0$ denotes an outside option (no selection), according to the MNL choice model parameterized by a non-linear utility function. The outcome probabilities are governed by

$P(y_t = i \mid x_t) = \frac{\exp(f_{\wb^*}(x_{ti}))}{1 + \sum_{j=1}^K \exp(f_{\wb^*}(x_{tj}))}$

for $i = 1, \dots, K$ , and $P(y_t = 0 \mid x_t) = [1 + \sum_{j=1}^K \exp(f_{\wb^*}(x_{tj}))]^{-1}$, where $f_{\wb^*}$ is an unknown utility function belonging to a parametric or non-parametric class (e.g., neural networks, kernel machines).

The learner observes a reward $r_{t, y_t}$ and aims to minimize cumulative regret relative to the best sequence of actions in hindsight: $\operatorname{Regret}_T = \mathbb{E}\left[ \sum_{t=1}^T R_t(S^*_t, \wb^*) - R_t(S_t, \wb^*) \right]$ where $R_t(S, \wb)$ denotes the expected reward for assortment $S$ under parameters $\wb$.

2. Utility Function Classes and Key Assumptions

Recent work has relaxed the linear assumption in $f_\wb$ to allow for expressive families such as multi-layer neural networks or general bounded, smooth function classes. The only essential structural requirements are:

Realizability: $f_{\wb^*} \in \Fcal$, the function class used by the learner.
Boundedness and Smoothness: For all $\wb \in \Fcal$, $\|\xb\|\le 1$, $|f_{\wb}(\xb)|\le C_f$ and the gradients/Hessians are bounded, ensuring Lipschitz and curvature control.
Generalized Geometric Condition: The (expected) squared loss for parameter estimation is lower-bounded in a neighborhood of the ground-truth parameter to ensure learnability, as formalized by second-order growth or (locally) strong convexity moduli (Hwang et al., 11 Jan 2026).

These conditions underpin regret guarantees in non-linear MNL bandit algorithms and facilitate the use of confidence-set and regression-oracle-based techniques even in overparameterized or non-convex models.

3. Representative Algorithms

A variety of statistically and computationally efficient algorithms have been developed for MNL contextual bandits with non-linear utilities:

Optimistic UCB with Non-linear Utilities (ONL-MNL): This paradigm constructs confidence intervals over the parametric utility-function class and selects assortments/items by maximizing an upper confidence bound on the expected MNL reward. The algorithm proceeds by pure exploration for a fixed period, fitting a "pilot" estimator by minimizing regularized negative log-likelihood, then continually updating a confidence set and using a linearized MNL loss for efficient parameter tracking and exploration-exploitation (Hwang et al., 11 Jan 2026). The UCB at round $t$ is given by

$z_{ti} = f_{\hat \wb_t}(x_{ti}) + \sqrt{\beta_t}\|\nabla f_{\hat \wb_t}(x_{ti})\|_{\Vb_t^{-1}} + \frac{\beta_t C_h}{\lambda}$

and the best assortment is selected according to the resulting estimated MNL reward.

Batched and Rarely-Switching Algorithms (B-MNL-CB, RS-MNL): For scenarios where policy update frequency is constrained, these algorithms compute batchwise or infrequent parameter updates, exploiting MNL self-concordance and distributional optimal designs extended to multinomial settings. They retain near-optimal regret with $O(\log \log T)$ or $O(\log T)$ policy changes (Midigeshi et al., 5 Aug 2025).
Oracle-Based Exploration (Regression/Oracle-OFU Frameworks): Algorithms use regression oracles—offline or online log-loss minimization procedures for the function class—to periodically fit models on collected data. Exploration policies can be $\varepsilon$ -greedy or use sophisticated barrier regularization for better coverage of the context-action space. These strategies lift to very general classes $\F$ provided suitable regression oracles exist (Zhang et al., 2024).
Feel-Good Thompson Sampling (FGTS): Thompson sampling under non-linear value-function classes is implemented by maintaining and updating density over $\F$ using self-normalized martingale concentration results. This method is information-theoretically optimal but intractable to sample from in expressive classes (Zhang et al., 2024).

4. Regret Bounds and Theoretical Guarantees

The regret analysis for MNL contextual bandits with non-linear utilities hinges on the interplay between the expressiveness of the utility class, curvature constants associated with the link function, and statistical properties of the parameter space.

Local Non-linearity Constant $\kappa_*$ : In the special case of (generalized) linear MNL, sharp regret bounds depend on a problem-dependent non-linearity constant $\kappa_*$ , which quantifies the local curvature of the softmax/reward map at the optimal action. Efficient optimistic-style algorithms achieve a rate of $\widetilde O(Kd \sqrt{T/\kappa_*})$ , outperforming prior bounds that depended on global curvature constants or suffered exponential-in-parameter dependence (Boudart et al., 7 Jul 2025).
Non-linear Utilities: For general parametric $f_\wb$ (e.g., neural networks), under mild geometric and smoothness conditions, computationally efficient UCB-type algorithms achieve $\tilde O(\sqrt{T})$ regret, matching the information-theoretic lower bound modulo polylogarithmic factors. The analysis relies on concentration inequalities for MNL maximum-likelihood estimators, self-concordant inequalities for the log-likelihood, and elliptical potential arguments for confidence radii (Hwang et al., 11 Jan 2026).
Oracle-Based and Batched Regimes: With appropriate offline/online regression oracles, $\varepsilon$ -greedy and barrier-regularized algorithms can achieve sublinear regret— $O((NK)^{1/3}T^{2/3})$ for straightforward $\varepsilon$ -greedy, and $O(K^2\sqrt{NT})$ for barrier-based methods when $K$ is small. Rarely-switching approaches sustain $O(\sqrt{T})$ regret with $O(\log T)$ policy switches (Midigeshi et al., 5 Aug 2025, Zhang et al., 2024).

All these results are subject to either computational intractability in highly expressive models (e.g., general non-convex neural nets), dependence on the availability of efficient regression oracles, or mild geometric regularity/realizability assumptions.

5. Connections to Prior Work and Impact

Historically, contextual MNL bandits were studied under (generalized) linear utility functions, resulting in regret rates that scale poorly with parameter magnitude due to highly non-uniform curvature (factor $\kappa = \exp(O(S))$ in regret bounds). The extension to non-linear utilities marks a critical advance on several fronts:

Algorithmic Generality: Oracle-based and UCB-style approaches accommodate neural nets, kernel methods, and other highly expressive function classes, provided they admit sufficiently accurate regression (log-loss) oracles and satisfy basic regularity (Hwang et al., 11 Jan 2026, Zhang et al., 2024).
Computational-Statistical Trade-offs: Efficient algorithms now achieve nearly minimax regret—for both stochastic and adversarial contexts—without the computational bottleneck associated with non-linear link inversion or intractable exploration distributions.
Practical Adaptivity: Batched and rarely-switching schemes address real-world deployment constraints, where policy updates are expensive or must be limited, with negligible statistical penalty compared to fully adaptive policies (Midigeshi et al., 5 Aug 2025).
Analysis Tools: The field has leveraged self-concordant analysis for the MNL link, concentration for martingale difference sequences, and a variety of trace-determinant and potential-based arguments to ensure tight confidence sets and robust exploration.

6. Practical Considerations and Extensions

Deployment of MNL contextual bandit algorithms with non-linear utilities requires evaluation of the computational tractability of underlying regression oracles and the magnitude of the polynomial dependence on the number of items $K$ and utility parameters $d$ . Real-world experiments confirm theoretical performance, with UCB-type non-linear MNL algorithms demonstrating $O(\sqrt{T})$ regret scaling and outperforming baselines under both realizable and mild misspecification conditions (Hwang et al., 11 Jan 2026). Empirical scalability in assortment size and robustness to model misspecification are key strengths.

Active research directions include:

Reducing the polynomial dependence on $K$ in regret bounds and computation (Midigeshi et al., 5 Aug 2025).
Incorporating richer choice models beyond simple MNL (nested logit, probit).
Adapting to non-stationary environments via dynamic batching or adaptive regularization.
Developing tractable posterior sampling (TS) for non-linear MNL models or scalable approximation schemes for complex classes (Zhang et al., 2024).
Automating batch size or policy switch budget to optimize regret-adaptivity trade-off under operational constraints.

7. Summary Table of Representative Approaches

Algorithmic Framework	Utility Model	Regret Bound
Optimistic UCB (ONL-MNL)	Non-linear parametric	$\tilde O(\sqrt T)$
Batched B-MNL-CB / RS-MNL	Linear MNL	$\tilde O(\sqrt T)$ (few updates)
Regression Oracle $\varepsilon$ -greedy	General $\F$	$O((NK)^{1/3}T^{2/3})$
Barrier regularization	General $\F$	$O(K^2\sqrt{NT})$ (small $K$ )
Thompson Sampling (FGTS)	General $\F$	$O(K^{2.5}\sqrt{dNT})$

This table summarizes main algorithmic paradigms and bounds as reported in (Hwang et al., 11 Jan 2026, Midigeshi et al., 5 Aug 2025), and (Zhang et al., 2024).

MNL contextual bandits with non-linear utilities now admit algorithms that are, in a wide range of practical and theoretical scenarios, both computationally and statistically efficient, and whose regret guarantees fundamentally match the optimal rates as dictated by local curvature and model complexity.