Empirical-Bernstein Confidence Sequence

Updated 1 January 2026

Empirical-Bernstein CS is a sequential method that constructs time-uniform confidence sets by incorporating observed sample variance for bounded data.
It employs martingale techniques and mixture strategies to achieve adaptive, tighter intervals for mean and variance estimation in diverse settings.
The method is effective for heavy-tailed data, matrix concentration, and sampling without replacement, offering robust and practical inference.

An empirical-Bernstein confidence sequence (CS) is a sequential inference method for constructing time-uniform confidence sets for means and related parameters of streams of bounded random variables, typically using empirical or self-normalized variance estimates instead of known parametric variance. The empirical-Bernstein CS generalizes the classical Bernstein inequality into the sequential setting, achieving strong coverage guarantees and variance-adaptivity, with applications to mean estimation, matrix concentration, sampling without replacement, heavy-tailed inference, and the analysis of stochastic processes in smooth Banach spaces.

1. Definition and Fundamental Principles

An empirical-Bernstein CS is defined by constructing a sequence of confidence balls or intervals for a parameter (usually the mean $\mu$ ), such that for all $t$ , the region $C_t$ contains the true parameter with high probability (at least $1-\alpha$ ) simultaneously over time. Unlike Hoeffding-type bounds, empirical-Bernstein CSs incorporate observed sample variance, yielding tighter, variance-adaptive intervals.

Let $(X_t)_{t=1}^{\infty}$ be adapted random elements in a separable Banach space $(\mathcal{X}, \|\cdot\|)$ satisfying $(2,D)$ -smoothness. Assuming $\mathbb{E}[X_t|X_1,\dots,X_{t-1}] = \mu$ and $\|X_t-\mu\| \leq B$ a.s., the core empirical-Bernstein CS for the mean is:

$\forall t\ \ \|\bar\mu_t - \mu\| \le D\,\frac{ \frac{1}{4B}\sum_{i=1}^t\psi_E(\lambda_i)\|X_i-\bar\mu_{i-1}\|^2 + 4B\ln(2/\alpha) }{ \sum_{i=1}^t\lambda_i }$

where $\bar\mu_t$ is the weighted running average, $\psi_E(\lambda) = -\ln(1-\lambda)-\lambda$ , and $\lambda_t$ are predictable weights tuned using a running empirical variance (Martinez-Taboada et al., 2024). This ball achieves the classical Bernstein limiting rate, scaling as $\sigma\sqrt{2\ln(2/\alpha)/n}$ for $t=n$ in the i.i.d. setting.

2. Martingale Construction and Proof Methodology

Empirical-Bernstein CSs leverage the martingale/supermartingale technique, where a carefully constructed process is shown to be a nonnegative supermartingale. Typically, the process takes the form:

$S_t = \cosh\Bigl(\sum_{i=1}^t \lambda_i(X_i-\mu)/(4BD)\Bigr) \cdot \exp\Bigl(-\sum_{i=1}^t \psi_E(\lambda_i)\|X_i-\bar\mu_{i-1}\|^2/(4B)^2\Bigr)$

Ville's inequality is then applied, ensuring that with probability $\geq 1-\alpha$ , the process stays below a threshold uniformly in time, leading to the simultaneous coverage property (Martinez-Taboada et al., 2024). This analysis extends to both batch (fixed $n$ ) and sequential (stopping time) contexts. Key to the proof is controlling higher-order terms of Taylor expansions (for cosh and $\psi_E$ ), and exploiting the (2,D)-smoothness of Banach spaces for vector-valued extensions.

Mixture methods, stitching arguments, and nonparametric boundaries are employed to convert linear families of bounds into curved, adaptive boundaries (e.g., polynomial-stitching boundaries with $\sqrt{V_t\ln\ln V_t}$ scaling)(Howard et al., 2018).

3. Sequential Mean and Variance Estimation

Empirical-Bernstein CSs are applied to both mean and variance estimation for bounded random variables:

Sequential Mean (Banach-valued): The construction in (Martinez-Taboada et al., 2024) yields dimension-free Euclidean or Hilbert-space CSs, centered on the empirical mean and radius scaling with the empirical variance proxy.
Sharp Sequential Variance CS: For $X_t\in[0,1]$ , the sharp time-uniform CS for the variance $\sigma^2$ is given by (Martinez-Taboada et al., 4 May 2025):

Upper endpoint:

$U_t = D_t + R_t(\alpha)$

where $D_t$ is the empirical variance average, and $R_t(\alpha)$ captures both the self-normalizing variance penalty and coverage adjustment.

Lower endpoint $L_t$ is the smallest root of a quadratic with coefficients adaptively computed from data and plug-in $\lambda$ sequences, providing time-uniform coverage and asymptotic match to the optimal Bernstein width.

4. Adaptivity, Optimality, and Extensions

Empirical-Bernstein CSs adapt to observed sample variance, exploiting potentially much smaller actual variability than worst-case assumptions. Modern betting-based approaches (e.g., STaR-Bets (Voráček et al., 28 May 2025), Waudby-Smith and Ramdas (Shekhar et al., 2023)) iteratively recalculate the optimal stake fraction at each step, adapting to remaining sample size and required multiplicative growth for the test martingale, further tightening intervals.

These methods match or nearly achieve fundamental information-theoretic lower bounds, both in first-order asymptotics and finite-sample regimes, via adaptive mixture strategies. In particular, the limiting interval width is within a $1+o(1)$ factor of the minimax inverse-KL projection bound (Shekhar et al., 2023), and empirical-Bernstein CSs constructed using STaR-Bets are consistently sharper than classical fixed- $\lambda$ or MP-EB strategies.

Vector- and matrix-valued versions extend to finite-dimensional spaces and symmetric matrices, e.g., the closed-form matrix CS in (Chugg et al., 24 Dec 2025), where the maximum eigenvalue deviation is bounded in terms of the empirical variance of eigenvalues via a trace supermartingale approach.

5. Heavy-Tailed and Nonstandard Data

Empirical-Bernstein CSs are robustified to heavy-tailed and nonparametric data using mixture boundaries and piecewise quadratic surrogates (Mineiro, 2022). When finite variance fails, the methodology adapts to finite $(1+\delta)$ -th moment assumptions, with slack terms converging to zero for nonnegative, right heavy-tailed observations.

Efficient computational implementation uses sublinear sufficient statistic summarization and discretized mixtures over a geometric grid of exponents. Two-sided intervals can be constructed (e.g., for off-policy evaluation in contextual bandits) by applying the robust mixture CS to the flipped outcome process.

6. Practical Implementation and Tuning

Key implementation recommendations across frameworks:

Empirical variance proxy updates are performed on-the-fly using running means and squared deviations.
Predictable $\lambda_t$ sequences are set via minimization of theoretical expressions or using risk-spreading formulae involving current estimated variance and horizon (Waudby-Smith et al., 2020, Martinez-Taboada et al., 2024).
Additive corrections in denominators and numerators ensure numerical stability.
For batch CIs, the logarithmic correction in sample size optimizes interval width.
Coverage calibration via risk-splitting and randomized corrections restores exact or nominal $1-\alpha$ level coverage in both one-sided and two-sided contexts.
Vectorized and root-finding algorithms efficiently invert test martingales to intervals.

Empirical-Bernstein CSs are implemented in R/Python packages, with practical parameter defaults and guidance for robust, anytime-valid sequential inference (Howard et al., 2018, Martinez-Taboada et al., 2024, Mineiro, 2022).

7. Asymptotic Behavior, Theoretical Comparisons, and Open Problems

Empirical-Bernstein CSs asymptotically yield interval width proportional to $\sigma\sqrt{2\ln(2/\alpha)/n}$ , matching the oracle Bernstein bound when the variance is unknown and estimated on the fly. In sequential contexts, advanced constructions achieve iterated logarithm type rates, with width $O(\sqrt{ (\log\log t)/t})$ in the self-normalized setting.

In batch and sequential settings, empirical-Bernstein CSs dominate traditional approaches (e.g., Maurer-Pontil) in both theory and empirical performance, consistently yielding intervals 20–50% narrower for bounded distributions and maintaining coverage at the prescribed level (Martinez-Taboada et al., 4 May 2025, Voráček et al., 28 May 2025).

Extensions to sampling without replacement quantify the finite-population advantage, resulting in strictly narrower intervals than classic i.i.d. bounds due to population depletion corrections (Waudby-Smith et al., 2020). Robust mixture CSs in the heavy-tailed regime adaptively control type-I error without loss in tightness under finite variance, and are preferable in online settings or bandit deployments (Mineiro, 2022).

Open research directions include characterizing fundamental lower bounds for CSs in more general nonparametric models, optimal tuning for dependence structures or adversarial settings, and unified frameworks for vector/matrix-valued sequential inference across general Banach spaces.

Key References:

Martinez-Taboada & Ramdas, "Empirical Bernstein in smooth Banach spaces" (Martinez-Taboada et al., 2024)
Martinez-Taboada & Ramdas, "Sharp empirical Bernstein bounds for the variance of bounded random variables" (Martinez-Taboada et al., 4 May 2025)
Voráček & Orabona, "STaR-Bets: Sequential Target-Recalculating Bets for Tighter Confidence Intervals" (Voráček et al., 28 May 2025)
Ramdas et al., "Time-uniform, nonparametric, nonasymptotic confidence sequences" (Howard et al., 2018)
Mineiro, "A lower confidence sequence for the changing mean of non-negative right heavy-tailed observations with bounded mean" (Mineiro, 2022)
Waudby-Smith & Ramdas, "Confidence sequences for sampling without replacement" (Waudby-Smith et al., 2020)
Ramdas & Waudby-Smith, "On the near-optimality of betting confidence sets for bounded means" (Shekhar et al., 2023)
Waudby-Smith et al., "Closed-form empirical Bernstein confidence sequences for scalars and matrices" (Chugg et al., 24 Dec 2025)