Local Asymptotic Normality for Multi-Armed Bandits
(2512.12192v1)
Published 13 Dec 2025 in math.ST
Abstract: Van den Akker, Werker, and Zhou (2025) showed that the limit experiment, in the sense of H\a'{a}jek-Le Cam, for (contextual) bandits whose arms' expected payoffs differ by $O(T{-1/2})$, is Locally Asymptotically Quadratic (LAQ) but highly non-standard, being characterized by a system of coupled stochastic differential equations. The present paper considers the complementary case where the arms' expected payoffs are fixed with a unique optimal (in the sense of highest expected payoff) arm. It is shown that, under sampling schemes satisfying mild regularity conditions (including UCB and Thompson sampling), the model satisfies the standard Locally Asymptotically Normal (LAN) property.
Sponsor
Organize your preprints, BibTeX, and PDFs with Paperpile.
The paper establishes the LAN property under classical MAB regimes by deriving a quadratic log-likelihood expansion using DQM and arm-specific Fisher information.
It contrasts classical asymptotics with weak-signal regimes, showing that insufficient sampling of suboptimal arms leads to significant deviations from normal approximations.
Empirical simulations demonstrate algorithmic sensitivity and finite-sample limitations, highlighting the need for robust inferential methods in adaptive experiments.
Local Asymptotic Normality for Multi-Armed Bandits
Overview
The paper "Local Asymptotic Normality for Multi-Armed Bandits" (2512.12192) establishes conditions under which the classical Locally Asymptotically Normal (LAN) property holds in stochastic multi-armed bandit (MAB) settings. It rigorously contrasts this regime with recent literature on weak-signal and nearly-tied bandits, where nonstandard limit experiments emerge. The work provides a thorough treatment of likelihood expansions, convergence of normalized score sequences, and finite-sample accuracy of asymptotic approximations, with practical implications both for statistical inference and the application of bandit algorithms.
Problem Setting and Assumptions
The analysis considers the canonical K-armed bandit framework, where each arm generates i.i.d. rewards from parameterized families. The focus is on models where one arm is uniquely optimal in terms of expected mean payoff, and all arm distributions satisfy differentiability in quadratic mean (DQM). Two regimes for the sampling strategy are considered:
Classical Asymptotics (LAN regime): The means of the arms are strictly separated, not shrinking with T, and the sampling rules are such that the proportion of suboptimal arm draws is O(logT) (as with Thompson Sampling and UCB-type algorithms).
Weak-Signal Asymptotics: The arms' means differ by O(T−1/2), a regime where prior work has shown the limit experiment is Locally Asymptotically Quadratic (LAQ) and characterized via coupled SDEs [zhou2025bandit].
Key technical assumptions include the existence and structure of the score and Fisher information matrices, and weak regularity requirements on the adaptive sampling rules, all formalized to allow for possibly decoupled parameters between arms.
Main Theoretical Contributions
The central result is that under the classical regime—with unique optimality and appropriate regularity conditions on both parameterization and sampling rules—the experiment satisfies the standard LAN property. The log-likelihood ratio admits a quadratic expansion,
logdPθdPθT=h′ΔT−21h′I(θ)h+oP(1)
where ΔT is the central sequence and I(θ) is the Fisher information, appropriately aggregated over arms. The paper offers the following formal advances:
Explicit Quadratic Expansion: The log-likelihood ratio over the adaptively sampled sequence is expanded using DQM and arm-sampling filtration, with precise tracking of effective information for each parameter component as a function of both the reward structure and sampling allocation (Proposition 1).
Component-wise LAN Structure: Asymptotic normality and convergence for the normalized score sequences per arm are established, with an aggregated LAN result over the full parameter vector (Proposition 2).
Clarification of Rates: The work makes explicit how rates of information accumulation (T for optimal arms, logT for suboptimal arms) impact the estimator efficiency and validity of the LAN approximation, including cases where parameter components are arm-specific.
The results hold for a wide class of sampling algorithms including Thompson sampling, UCB1, and their clipped or randomized variants. The analysis accommodates adaptive allocation and nonstandard parameter assignments.
Contrasts with Nonstandard Asymptotics
This LAN result stands in marked contrast with the weak-signal or nearly-tied regimes studied recently [zhou2025bandit; kuang2024weak; fan2025diffusion], where the H\'ajek-Le Cam limit experiment for MABs can be LAQ but not LAN, and the likelihood is governed by complex, high-dimensional diffusions. The current paper thus delineates a boundary: when the optimality gap is fixed and positive, "classical" asymptotic theory applies; when it vanishes at the O(T−1/2) rate, strong nonstandard asymptotic phenomena appear.
Strong Numerical and Counterintuitive Findings
Monte Carlo simulations rigorously compare the LAN and LAQ limit regimes. Despite the theoretical guarantee of LAN under the stated assumptions, the paper demonstrates empirically that, for realistic sample sizes (T=500) and even with moderate optimality gaps, the actual finite-sample behavior of standard inferential statistics (e.g., Student's t for arm-means or their differences) can deviate sharply from normal distributions. This observation holds strongly for suboptimal arms: the central limit theory underlying the LAN regime yields poor approximations to the empirical distribution of test statistics associated with suboptimal arms or differences, often until T becomes very large. This is especially pronounced when suboptimal arms are rarely sampled (with probabilities decaying as O(logT/T)), resulting in pathological or degenerate statistics.
Specific empirical findings include:
For small gaps:t-statistic distributions for both means and differences are non-normal, best described by the LAQ regime.
For moderate/large gaps: Standard normal approximation improves for the optimal arm mean, but not for suboptimal arms or contrasts. In some settings, the suboptimal arm is pulled only once, rendering variance estimates trivial or undefined.
Algorithmic sensitivity: The discrepancy from asymptotic normality is exacerbated under UCB1 compared to Thompson sampling, due to even slower allocation to suboptimal arms.
Implications for Inference and Practice
The theoretical guarantee of the LAN property enables classical asymptotic statistics in fixed-gap MABs, including the use of maximum likelihood estimators, Wald tests, and consistent confidence intervals for arm-specific parameters. However, the negative empirical findings highlight important caveats for practical inference: standard asymptotic regimes often provide misleading inferential results when samples are moderate in size and/or the optimality gap is not large. In practice, for parameters associated with suboptimal arms or contrasts, practitioners should be aware that t-type inference can be highly anti-conservative or erratic due to insufficient information accumulation.
This points to the need for caution when pursuing post-adaptive-inference or statistical testing in MABs, and suggests that alternative asymptotic regimes (such as diffusion-based LAQ or bootstrap corrections) may offer more accurate uncertainty quantification in finite samples, especially for adaptive experiments and weak-signal settings.
Theoretical and Future Directions
The precise characterization of the LAN property deepens the statistical understanding of adaptive experiments and sequential allocation. The results provide a foundation for extending semiparametric and optimal inference tools to bandit settings with uniquely optimal arms and inform the design of inferentially robust algorithms.
For future research, several directions are natural:
Bridging Asymptotic Regimes: Development of hybrid or interpolating limit theorems that seamlessly transition between the classical LAN and SDE-based LAQ regimes, to cover all parameter gaps.
Finite Sample Corrections: Construction of finite-sample robust inferential methods, possibly leveraging high-fidelity simulations or explicit correction factors for low-information arms.
Extensions to Contextual and Nonparametric Bandits: Generalization to nonstationary, high-dimensional, or flexible model classes, incorporating modern contextual or structured bandit architectures.
Policy Design for Inferential Validity: Algorithmic modifications that balance regret minimization with sufficient exploration for valid inference across all arms.
Conclusion
This paper establishes the Locally Asymptotically Normal property for a broad class of classical MAB models under standard regularity conditions and uniquely optimal arms. While providing an affirmative answer for the validity of classical asymptotic inference in these settings, the study also issues a strong warning—finite-sample deviations from normality can be severe, particularly for inferential targets associated with under-sampled arms. This has both theoretical importance for the foundations of adaptive inference and practical consequences for policy evaluation and experimentation using bandit algorithms. The clear delineation of the classical and nonstandard regimes is an essential contribution toward the statistical theory of adaptive learning and dynamic experimentation.