Papers
Topics
Authors
Recent
2000 character limit reached

Lai–Wei Stability for Adaptive Bandits

Updated 25 December 2025
  • Lai–Wei stability is defined by criteria ensuring that adaptively sampled data yields estimators satisfying the central limit theorem, crucial for valid confidence intervals.
  • It guarantees that empirical allocations or covariance matrices converge to deterministic limits, mimicking i.i.d. sampling despite adaptive strategies.
  • Algorithmic approaches like penalized EXP4 can enforce this stability, achieving near-optimal regret while supporting robust inferential procedures.

The Lai–Wei stability condition is a fundamental criterion governing the validity of classical inferential procedures when data are collected adaptively, as in multi-armed bandits and linear contextual bandits. It offers a structural guarantee under which ordinary least squares (OLS) or sample mean estimators for bandit-collected data asymptotically mirror their behavior in the i.i.d. regime, permitting valid Wald-type confidence intervals without the “price of adaptivity” typically incurred in general adaptive analysis. This condition is both a theoretical tool for understanding bandit inference and a practical design principle for constructing algorithms enabling rigorous post-hoc statistical inference.

1. Formal Definition of Lai–Wei Stability

The original Lai–Wei stability condition, as formulated by Lai and Wei (1982), is defined in terms of the empirical allocation or covariance structure induced by the bandit sampling policy.

  • Multi-Armed Bandits: For a KK-armed bandit with policy A\mathcal{A}, let na,T(A)n_{a,T}(\mathcal{A}) denote the (random) number of times arm aa is pulled up to round TT. The policy is stable if there exist deterministic sequences {na,T}T1\{n^*_{a,T}\}_{T\ge1} for each arm such that

na,T(A)na,TP1,na,T.\frac{n_{a,T}(\mathcal{A})}{n^*_{a,T}} \xrightarrow{P} 1,\qquad n^*_{a,T}\to\infty.

Stability equivalently requires that the conditional variances of the martingale increments associated with each arm grow deterministically.

  • Linear Contextual Bandits: For a linear model with contexts xtx_t and arms ata_t, let the empirical Gram matrix be ST=t=1Tϕ(at,xt)ϕ(at,xt)S_T = \sum_{t=1}^T \phi(a_t, x_t)\phi(a_t, x_t)^\top. Stability holds if for some deterministic, positive-definite sequence ΣT\Sigma_T,

ΣT1STIopp0,\|\Sigma_T^{-1} S_T - I\|_{op} \to_p 0,

so STS_T concentrates in operator norm around ΣT\Sigma_T (Praharaj et al., 23 Dec 2025).

This criterion demands that, asymptotically, the empirical allocation or Gram matrix be well-approximated by a deterministic limit, suppressing the random fluctuations typically induced by adaptive sampling.

2. Stability and the Central Limit Theorem for Adaptive Bandits

The main probabilistic implication of Lai–Wei stability is the restoration of the classical central limit theorem (CLT) for estimators computed from adaptively collected bandit data. When the stability condition holds alongside standard noise assumptions (e.g., i.i.d. sub-Gaussian noise):

  • Bandit Mean Estimation: For each arm aa,

na,T(μ^a,Tμa)dN(0,1).\sqrt{n_{a,T}} (\hat\mu_{a,T} - \mu_a) \xrightarrow{d} N(0,1).

  • Contextual Bandits: The OLS estimator

β^T=ST1t=1Tϕ(at,xt)t\hat\beta_T = S_T^{-1} \sum_{t=1}^T \phi(a_t, x_t)\ell_t

obeys

ΣT1/2(β^Tβ)dN(0,Id),\Sigma_T^{1/2}(\hat\beta_T - \beta^*) \xrightarrow{d} N(0, I_d),

and for any fixed direction aRda\in\mathbb{R}^d,

a(β^Tβ)σaST1adN(0,1).\frac{a^\top(\hat\beta_T - \beta^*)}{\sigma\sqrt{a^\top S_T^{-1} a}} \xrightarrow{d} N(0,1).

These results restore the validity of standard inferential machinery to the adaptive context, conditional on the stability property (Praharaj et al., 23 Dec 2025, Praharaj et al., 24 Nov 2025).

3. Consequences for Confidence Intervals and the “Price of Adaptivity”

In general adaptively collected data, valid confidence intervals for linear functionals of model parameters require a substantial inflation (typically of order dlogT\sqrt{d \log T}) to guarantee nominal coverage, referred to as the “price of adaptivity.” However, under Lai–Wei stability:

  • Wald-Type Intervals: For target uβu^\top\beta^*,

ITWald(u)=[uβ^T±z1α/2σ^uST1u]I^{Wald}_T(u) = [u^\top \hat\beta_T \pm z_{1-\alpha/2} \hat\sigma \sqrt{u^\top S_T^{-1} u}]

achieves correct asymptotic coverage 1α1-\alpha as TT\to\infty, without inflation by dlogT\sqrt{d\log T}.

  • Empirical Coverage: Simulation studies confirm that standardized errors are approximately normal and Wald intervals have sharp coverage whenever the algorithm enforces Lai–Wei stability (Praharaj et al., 23 Dec 2025).

This avoidance of the adaptivity penalty enables statistical confidence in inference tasks without sacrificing efficiency, conditional on stability.

4. Algorithmic Enforcement and Violation of Stability

Enforcement via Penalized EXP4

A notable approach to enforcing Lai–Wei stability is the use of a penalized EXP4-type algorithm in the contextual bandit setting, incorporating:

  • A floor ϵ\epsilon on mixture weights (to guarantee all experts are sampled with nonvanishing probability).
  • An explicit curvature penalty λR(w)\lambda R(w) (entropy-like) in the mirror-descent update.

These mechanisms cause the random data collection weights wˉT\bar{w}_T to concentrate on a deterministic vector wTw^*_T, and the empirical covariance STS_T to stably approximate ΣT=TkwT,kΣk\Sigma_T^* = T\sum_k w^*_{T,k} \Sigma_k, ensuring stability (Praharaj et al., 23 Dec 2025).

Violations in Minimax-Optimal UCB/KL-UCB Algorithms

Most minimax-optimal optimism-based (UCB and KL-UCB) bandit algorithms violate the Lai–Wei stability condition. In these algorithms, exploration bonuses are too small and eventually become constant-order as na,t=Θ(T/K)n_{a,t} = \Theta(T/K), failing to provide the randomization necessary for empirical allocations to stabilize. This induces undesirable phenomena such as “lock-in” to a single arm with positive probability, preventing the fulfillment of na,T/na,T1n_{a,T}/n^*_{a,T} \to 1 for all arms (Praharaj et al., 24 Nov 2025).

The instability is structural and arises from the algorithmic choice to shrink exploration bonuses in pursuit of minimax regret, which fundamentally conflicts with the requirements of stability.

5. Implications: Trade-Offs Between Regret and Inferential Validity

The findings in the referenced works underscore a fundamental tension between minimax regret optimality and statistical stability. The two desiderata currently stand in opposition:

  • Classic UCB-1: Exhibits stability and hence supports valid CLTs and confidence intervals, but incurs O(KTlogT)O(\sqrt{KT\log T}) regret, which is suboptimal.
  • All Known Minimax-Optimal Strategies: Achieve regret O(KT)O(\sqrt{KT}) but violate stability, meaning OLS averages and Wald intervals fail to have proper normal or coverage properties.

A positive resolution—designing a bandit strategy achieving O(KT)O(\sqrt{KT}) regret and CLT-level inferential validity—remains an open direction. Conversely, a negative resolution would indicate an unavoidable trade-off between exploration-exploitation efficiency and post-hoc statistical regularity (Praharaj et al., 24 Nov 2025).

6. Practical Guidelines, Tests, and Numerical Results

When enforcing Lai–Wei stability in linear contextual bandits via penalized EXP4, several practical prescriptions and empirical observations arise (Praharaj et al., 23 Dec 2025):

  • Parameter Choices: ϵ=1/(KT)\epsilon = 1/(KT), η(logK)/(KT)\eta \approx \sqrt{(\log K)/(KT)}, λ=γT/T\lambda = \gamma_T/\sqrt{T} with γT\gamma_T \to \infty slowly (e.g., γT=logT\gamma_T = \sqrt{\log T}).
  • Monitoring Stability: Compare eigenvalues of STS_T with those of a plug-in estimate ΣT\Sigma^*_T to verify ΣT1STI\Sigma_T^{*-1} S_T \approx I in operator norm.
  • Empirical Validation: Simulations verify that standardized OLS errors are nearly N(0,1)N(0,1) and Wald intervals attain target coverage without dlogT\sqrt{d\log T} inflation.
  • Regret Bound: The penalized EXP4 algorithm achieves near minimax-optimal regret (up to logarithmic factors), demonstrating that stability can be compatible with high statistical efficiency.

This suggests that by a judicious algorithmic design, one can achieve both valid inference and nearly optimal regret in some contextual bandit settings, though the general trade-off remains unsettled.


Algorithm Class Stability (Lai–Wei) Minimax-Optimal Regret Valid CLT for Means
Classic UCB-1 Yes No (O(KTlogT)O(\sqrt{KT\log T})) Yes
MOSS, KL-UCB, etc. No Yes (O(KT)O(\sqrt{KT})) No
Penalized EXP4 (CTX) Yes Yes (up to log) Yes

Valid CLT: whether the empirical mean or OLS estimator satisfies a CLT under adaptive sampling.


7. Open Questions and Research Directions

The existence of bandit algorithms simultaneously achieving minimax-optimal regret and Lai–Wei stability in the general setting is unresolved. No known algorithm achieves both O(KT)O(\sqrt{KT}) regret and valid CLTs for the empirical means or OLS estimators under fully adaptive sampling. This trade-off motivates future investigations pertaining to:

  • Structural regularization or stochasticity in sampling policies to promote stability.
  • Theoretical lower bounds quantifying the interplay between regret and inferential validity.
  • Alternative confidence set constructions or bootstrapping procedures robust to instability.

These questions are of both theoretical and practical importance, guiding the development and assessment of bandit algorithms utilized in high-stakes decision settings where valid inference is critical (Praharaj et al., 23 Dec 2025, Praharaj et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Lai--Wei Stability Condition.