Papers
Topics
Authors
Recent
2000 character limit reached

Variance-Adaptive Confidence Sequences

Updated 25 December 2025
  • Variance-adaptive CSs are sequences of confidence intervals that adapt their width using the cumulative empirical variance, ensuring nonasymptotic and time-uniform coverage.
  • They leverage self-normalization and martingale exponential techniques to achieve optimal shrinking rates, matching limits dictated by the law of the iterated logarithm.
  • These methods are extendable to handle heavy-tailed data, matrix estimations, and adaptive online inference in settings like bandit algorithms and reinforcement learning.

A variance-adaptive confidence sequence (CS) is a sequence of confidence intervals for an online, possibly non-i.i.d., stochastic process, whose width adapts at each time tt to the empirical variance accumulated so far. Such sequences provide nonasymptotic, nonparametric, and time-uniform coverage guarantees, meaning the probability of ever excluding the true quantity of interest across all times is controlled at a prescribed level. Variance-adaptive CSs generalize classical fixed-variance (sub-Gaussian) boundaries, achieve optimal shrinking rates (including the iterated logarithm law), and have been extended to settings such as matrix mean estimation, heavy-tailed data, and sampling without replacement.

1. Foundations and Nonparametric Setting

Variance-adaptive CSs, particularly those of the “empirical-Bernstein” type, are grounded in minimal assumptions. The prototypical setup involves a sequence of real-valued random variables Xt[a,b]X_t \in [a,b], a predictable sequence of “predictions” X^t\hat X_t, and the observed filtration {Ft}\{\mathcal F_t\}. The only technical condition required is that the martingale difference sequence Yt=XtE[XtFt1]Y_t = X_t - \mathbb E[X_t\,|\,\mathcal F_{t-1}] is almost surely bounded by c=bac = b-a (Howard et al., 2018).

The primary estimands are:

  • The mean process μt=t1i=1tE[XiFi1]\mu_t = t^{-1} \sum_{i=1}^t \mathbb E[X_i\,|\,\mathcal F_{i-1}]
  • The variance process (empirical proxy) Vt=i=1t(XiX^i)2V_t = \sum_{i=1}^t (X_i - \hat X_i)^2

These sequences remain valid without independence, identical distribution, or strong tail assumptions.

2. Empirical-Bernstein (Variance-Adaptive) CS Construction

The empirical-Bernstein confidence sequence is built upon a self-normalization/martingale-exponential construction:

  • For all λ[0,1/c)\lambda \in [0, 1/c),

E[exp{λi=1tYiψE,c(λ)Vt}]1\mathbb E \left[ \exp \left\{ \lambda \sum_{i=1}^t Y_i - \psi_{E,c}(\lambda) V_t \right\} \right] \leq 1

where ψE,c(λ)=c2(ln(1cλ)cλ)\psi_{E,c}(\lambda) = c^{-2}(-\ln(1-c\lambda)-c\lambda) (Howard et al., 2018).

For any “subexponential” uniform boundary u(v)u(v),

P(supt1μ^tμt>u(Vt)/t)2α\mathbb P\left( \sup_{t\geq1} |\widehat{\mu}_t - \mu_t| > u(V_t)/t \right) \leq 2\alpha

where μ^t=t1i=1tXi\widehat{\mu}_t = t^{-1} \sum_{i=1}^t X_i, and wt=u(Vt)/tw_t = u(V_t)/t is the data-driven, variance-adaptive width.

A widely used closed-form instantiation is the “polynomial-stitched” boundary for Xt[0,1]X_t \in [0,1] and coverage 12α1-2\alpha:

Ct=μ^t±1t(1.7Vt[loglog(2Vt)+3.8]+3.4[loglog(2Vt)+3.8])C_t = \widehat{\mu}_t \pm \frac{1}{t}\left(1.7\sqrt{V_t\left[\log\log(2V_t)+3.8\right]} + 3.4\left[\log\log(2V_t)+3.8\right]\right)

as given in Eq. (27) of (Howard et al., 2018).

3. Time-Uniform Coverage and LIL-Optimal Shrinkage

Variance-adaptive CSs provide time-uniform nonasymptotic coverage:

P(t1:μt[μ^tu(Vt)/t])12α\mathbb P\left(\forall t \ge 1: \mu_t \in [\widehat{\mu}_t \mp u(V_t)/t]\right) \ge 1-2\alpha

The width wtw_t adapts to observed variance and, for (sub-)i.i.d. data with variance σ2\sigma^2, Vtσ2tV_t \asymp \sigma^2 t, so:

  • wtσ2loglogt/tw_t \asymp \sqrt{\sigma^2 \log\log t / t}
  • This matches the lower bound dictated by the law of the iterated logarithm (LIL) for uniform-in-time confidence intervals (Howard et al., 2018).

4. Comparison to Fixed-Variance and Other Adaptive CSs

A sub-Gaussian (fixed-variance) CS with worst-case variance (ba)2/4(b-a)^2/4 produces

μ^tμba22log(1/α)t|\widehat{\mu}_t-\mu| \le \frac{b-a}{2} \sqrt{\frac{2\log(1/\alpha)}{t}}

which can be extremely conservative if the actual variance is small.

Empirical-Bernstein (variance-adaptive) CSs instead use the empirical VtV_t, sharply tightening intervals when the process is low-variance. For Bernoulli-0.01 data, sub-Gaussian CSs can be 5×5\times wider than the empirical-Bernstein CS (Howard et al., 2018).

Variance-adaptive CSs have extensions for heavy-tailed and infinite-variance settings, such as Catoni-style CSs for known-variance or pp-th-moment bounds (Wang et al., 2022), and CSs integrating heavier-tailed nonnegativity constraints (Mineiro, 2022).

5. Methodological Extensions and Matrix Generalizations

Recent developments yield closed-form, mixture-based empirical-Bernstein CSs for both scalar and matrix means:

Vt=i=1tψE(XiX^i),ψE(λ)=ln(1λ)λV_t = \sum_{i=1}^t \psi_E(|X_i-\hat X_i|), \quad \psi_E(\lambda) = -\ln(1-\lambda) - \lambda

and defines the width

Wt=2tUt(α+12ln(2Ut))W_t = \frac{2}{t} \sqrt{U_t\left(\ell_\alpha + \frac{1}{2}\ln(2U_t)\right)}

with Ut=1/(2κ2)+VtU_t = 1/(2\kappa^2) + V_t, and α\ell_\alpha an explicit log factor.

  • For a sequence of symmetric matrices XtX_t with bounded eigenvalues, the same polynomial structure yields a CS for the maximal eigenvalue deviation:

γmax(XˉtMt)Wt|\gamma_{\max}(\bar X_t - M_t)| \le W_t

where Vt=ψE(XiX^i)V_t = \sum \psi_E(|X_i - \hat X_i|) (matrix norm), and Mt=t1i=1tEi1[Xi]M_t = t^{-1} \sum_{i=1}^t \mathbb E_{i-1}[X_i] (Chugg et al., 24 Dec 2025).

A key property of these new CSs is that, in the constant-mean, i.i.d. regime, the limiting width scaled by t/logt\sqrt{t/\log t} is independent of the confidence level α\alpha—a provable improvement over previous closed-form solutions.

6. Applications and Empirical Performance

Variance-adaptive CSs are widely applicable:

  • Covariance matrix estimation
  • Sample average treatment effect inference under the Neyman-Rubin potential outcomes model
  • Bandit algorithms and A/B testing with continuous monitoring
  • Adaptive and safe inference in reinforcement learning and online learning
  • Sampling without replacement, yielding substantial improvements when the sample variance is much less than the worst-case variance of the population (Waudby-Smith et al., 2020)
  • Linear bandits, where variance-adaptive CSs are used to build ellipsoidal confidence sets for θ\theta^* with widths scaling to the sum of observed conditional variances (Jun et al., 12 Feb 2024)

Empirical studies (Chugg et al., 24 Dec 2025) show these CSs achieve or outperform previous variance-adaptive CSs and maintain coverage over a time horizon of up to 10610^6 samples. Performance is especially superior in low-variance, nonstationary, or time-varying mean settings.

7. Theoretical and Practical Implications

Variance-adaptive CSs represent a sharp advance in anytime valid inference, combining:

  • Time-uniform coverage with LIL-optimal shrinking
  • Fully nonparametric applicability, using data-driven variance proxies
  • The ability to handle non-i.i.d., martingale-dependent, and heavy-tailed settings (with appropriate extensions)
  • Closed-form, practically implementable expressions (e.g., the latest mixture-Bernstein CS (Chugg et al., 24 Dec 2025))
  • A robust foundation in mixture-based or self-normalized martingale concentration, often using the methods of mixture martingales, Ville's inequality, and polynomial “stitching”

Their flexibility and optimality have positioned variance-adaptive CSs as standard primitives in modern sequential estimation, especially as uncertainty quantification tools in high-frequency, online, or nonstationary environments (Howard et al., 2018, Chugg et al., 24 Dec 2025, Wang et al., 2022, Mineiro, 2022, Waudby-Smith et al., 2020, Jun et al., 12 Feb 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Variance-Adaptive Confidence Sequence (CS).