Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Stochastic Linear Bandits with Biased Offline Data

Updated 4 July 2025

Stochastic linear bandits with biased offline data are decision-making frameworks that leverage both online exploration and non-representative historical logs while facing distribution, confounding, and adversarial biases.
They utilize robust methods like bias-corrected UCB, test-then-use strategies, and DRO to actively adjust for offline data discrepancies and ensure safe online performance.
Empirical and theoretical analyses show that when offline data is cautiously integrated, these bandits can achieve lower regret and faster best-arm identification under controlled bias conditions.

Stochastic linear bandits with biased offline data refer to settings where an agent aims to optimize cumulative or identification objectives with linear reward models, but possesses a finite set of historical (offline) data that is distributionally or structurally non-representative of the current online environment. This bias may arise from differences in the data-generating process, shifts in arm/context distribution, logging policies, confounding or selection bias, and even targeted data poisoning. Leveraging such offline data to guide, accelerate, or potentially undermine online bandit learning introduces unique methodological, algorithmic, and theoretical challenges.

1. Definitions, Problem Setting, and Sources of Bias

A stochastic linear bandit consists of a sequence of rounds where, at each round $t$ , the learner selects an action $a_t$ represented as a feature vector in $\mathbb{R}^d$ , and observes a stochastic reward

$r_t = a_t^\top \theta_* + \eta_t$

with unknown parameter $\theta_*$ and noise $\eta_t$ (typically sub-Gaussian). The learner seeks to minimize regret or identify the optimal arm.

Biased offline data is any historical sample set $\{(\tilde{a}_n, \tilde{r}_n)\}_{n=1}^N$ that is not representative—i.e., may exhibit:

Distribution shift: The parameter $\theta^\star_{\text{off}}$ underlying the offline data may satisfy $\|\theta^\star_{\text{off}} - \theta_*\| > 0$ .
Covariate shift/confounding: The contexts or action-selection mechanisms may be policy-dependent, misspecified, or incomplete, potentially due to unobserved confounders ( $[2006.06731]$ ) or selection bias ( $[2312.12731]$ ).
Adversarial bias/poisoning: Offline data may be subject to corruption or manipulation, as in data poisoning attacks ( $[1905.06494]$ ).
Preference or incomplete reward information: Offline data may provide only relative or noisy preference judgments, rather than absolute rewards ( $[2406.09574]$ ).

These biases complicate estimation of $\theta_*$ , policy value, or best-arm, as classic bandit estimators assume offline and online data are i.i.d. from the same process.

2. Algorithmic Approaches for Exploiting or Mitigating Biased Offline Data

2.1 Robust Algorithm Design

Known Bias Bound

Optimism-in-the-Face-of-Uncertainty (OFU) Algorithms: Methods like MIN-UCB and OFU ( $[2405.02594], [2507.02762]$ ) blend online and offline data using confidence intervals that include a bias-correction term $V$ :

$\operatorname{UCB}_t^S(a) = \frac{N_t(a)\,\hat{R}_t(a) + N_{\text{off}}(a)\,\hat{X}(a)}{N_t(a) + N_{\text{off}}(a)} + \text{rad}_t^S(a)$

with

$\text{rad}_t^S(a) = \sqrt{\frac{2\log(2t/\delta_t)}{N_t(a) + N_{\text{off}}(a)}} + \frac{N_{\text{off}}(a)}{N_t(a) + N_{\text{off}}(a)} V(a)$

and the confidence bound is minimized over online and offline+online options. This ensures safety: if offline data is uninformative, the algorithm defaults to pure online UCB.

Unknown Bias Bound

Robust Test-then-Use Approaches ( $[2507.02762]$ ): Start with a testing phase using only online data, estimate the realized bias $\|\hat{\theta}_{\text{off}} - \hat{\theta}_{\text{on}}\|$ , and proceed to leverage offline data only when the bias estimate is statistically insignificant. This policy guarantees at worst pure online regret, and improves regret soon as offline bias is small.

Adversarial and Distributional Robustness

Distributionally Robust Optimization (DRO) frameworks ( $[2011.06835]$ ) cast policy learning as the minimization of the worst-case estimated risk or regret over an ambiguity set defined via f-divergence balls (e.g., chi-squared, KL):

$\sup_{q \in \Delta_n} \left\{ \sum_{i=1}^n q_i\, \omega_\pi(x_i, a_i)\, c(x_i,a_i) : d_\varphi(q, 1_n) \le \epsilon \right\}$

providing both tractability and uniform high-probability guarantees under generalized dataset shift.

PAC-Bayesian and Implicit Exploration Bounds ( $[2502.11953], [2309.15771]$ ): These approaches refine standard importance-weighted estimators by using parameter-free, high-probability uncertainty quantification that holds for arbitrary (possibly data-dependent) policies, even with biased or sparse offline data.

Causal Approaches

Structural Causal Model-based Bounds ( $[2312.12731]$ ): Exploit the graphical structure of confounding and selection biases to compute conservative bounds for the interventional mean outcome, integrating only the valid, unconfounded components of offline data to warm-start bandit learning.

2.2 Estimation and Meta-Learning

Bias-regularized Estimators and Meta-learning ( $[2005.08531]$ ): Methods such as BIAS-OFUL add a regularization towards a bias vector $\mathbf{h}$ estimated from prior offline tasks. Bias can be estimated via per-task averaging, global ridge regression, or meta-learned from offline bandit experience for transfer to future tasks.
Clustering and Dimension Reduction: Algorithms exploit user clustering or latent subspaces inferred from offline logs to aggregate limited data across similar users ( $[2505.19043], [2405.17324]$ ), improving inference robustness and sample efficiency, with theoretical guarantees that interpolation between pooled and individualized approaches can minimize bias-variance trade-off.

2.3 Defensive Measures Against Attacks

Robustness to Poisoning: Data poisoning attacks ( $[1905.06494]$ ) highlight catastrophic risks when empirical means or indices can be manipulated by an adversary. Optimization-based poisoning attacks can force a bandit to prefer any arm at minimal detectable cost. Defenses include periodic validation, anomaly detection on historical rewards, and uncertainty quantification robust to small perturbations.

3. Theoretical Guarantees and Statistical Complexity

The following table summarizes main regret/sample complexity results:

Setting	Known Bias Bound $V$	Unknown $V$	Regret Bound (up to $\tilde{O}(\cdot)$ )
Pure online	$V = \infty$	$-$	$d\sqrt{T}$
Unbiased offline	$V = 0$	$-$	$\min\{ d\sqrt{T}, \frac{\sqrt{d}T}{\sqrt{\lambda_{\min}(\hat{\Sigma})}}\}$
Biased, $V$ known	$V > 0$	$-$	$\min\{ d\sqrt{T},\, V T + \frac{\sqrt{d}T}{\sqrt{\lambda_{\min}(\hat{\Sigma})}} \}$
Biased, $V$ unknown	$-$	Robust test phase	$d\sqrt{T}$ if $V$ large/unknown; otherwise $T^\alpha + VT$ for $0<\alpha<1/2$

Key instance-dependent quantity controlling statistical complexity is the offline data dispersion ( $\lambda_{\min}(\hat{\Sigma})$ ) and the bias bound $V$ ( $[2507.02762]$ ).

For small $V$ and large, well-dispersed offline data, regret can be significantly below $d\sqrt{T}$ .
For large $V$ or little data spread, offline data is ignored, and the algorithm reduces to the pure online rate.

For best-arm identification, instance-dependent lower bounds and optimal policies (such as LUCB-H, Track-and-Stop) match the minimax rates exactly in the presence of possibly biased offline data provided the bias bound is incorporated ( $[2505.23165], [2306.09048]$ ).

4. Empirical Evaluation and Practical Impact

Recent studies confirm that robust algorithms:

Strictly outperform purely online methods in cumulative and simple regret when the bias is moderate and offline data is well-dispersed ( $[2405.02594], [2507.02762]$ ).
Robust algorithms are never worse than online-only methods, and switch to exploiting offline data only when statistical tests confirm informativeness.
In the presence of adversarial or confounded data, algorithms grounded in causal or distributionally robust frameworks maintain safety guarantees, only leveraging data components with sufficient inferential validity ( $[2312.12731], [2006.06731]$ ).
Meta-learning or pooling strategies (bias-regularized estimation, clustering) yield large gains in sample-limited or cold-start regimes, especially in personalized recommendation or treatment assignment scenarios ( $[2005.08531], [2505.19043], [2405.17324]$ ).
Attack models show that even undetectable, small perturbations to offline data can subvert bandit learning and cause persistent misallocation ( $[1905.06494]$ ).

5. Open Problems and Future Research Directions

Several challenges and open questions remain:

Optimal bias estimation when the bias is not known: Test-then-use strategies are sub-optimal under significant bias; developing adaptive, instance-dependent policies for bias estimation is an active area ( $[2507.02762]$ ).
Combining offline and online exploration in structured or contextual settings: Unified theories for stochastic linear bandits with coverage gaps, heavy-tailed importance weights, and high-dimensional function classes are still in development ( $[2002.05152], [2309.15771], [2011.06835]$ ).
Robustness to dependence and distributional shift: Extension of confidence sequence and regret analysis to temporally dependent, adversarially or causally dependent offline data ( $[2505.20017]$ ).
Automated hyperparameter transfer: Efficient sample complexity and transfer guarantees for tuning exploration parameters using offline logs across task families ( $[2501.02926]$ ).
Defensive learning: Development of learning algorithms provably robust to data poisoning and adaptive adversaries; currently, no general defense exists under full adaptivity ( $[1905.06494]$ ).
Causal inference frameworks: Scalable, practical causal identification, especially under real-world scenario combinatorics, for robustly bounding the utility of biased data ( $[2312.12731]$ ).

Summary Table: Influential Methods for Stochastic Linear Bandits with Biased Offline Data

Principle/Framework	Key Reference(s)	Mechanism for Handling Bias/Robustness
Bias-corrected OFU / MIN-UCB	(2405.02594 2507.02762)	Adaptive selection based on offline bias bound or test-phase estimation
Distributionally Robust Optimization (DRO)	(2011.06835)	Worst-case risk minimization over f-divergence ambiguity sets
Implicit Exploration Estimator	(2309.15771)	Bias-tolerant, tail-robust estimator; removes coverage assumptions
Meta-learning/Transfer/Clustering	(2005.08531 2505.19043 2405.17324)	Bias-regularized estimation, cluster-based pooling, latent subspace learning
Causal Bounds	(2312.12731)	Prior bounds via SCM structure and partial identification
Robust Confidence Sequences	(2505.20017)	Block-wise/online-to-confidence-set adaptation for non-i.i.d./biased data
Adversarial Defense/Poisoining Analysis	(1905.06494)	Attack detection, hardening by design; convex optimization for attack modeling

Stochastic linear bandits with biased offline data represent a modern, practically crucial variant of the standard bandit setting. With proper algorithmic design—incorporating bias awareness, robustness, and adaptive utilization—well-constructed offline data can yield substantial gains, but naively applying classic estimators, especially under adversarial or confounded logs, can lead to catastrophic misallocation. The continuing integration of robust statistics, meta-learning, causal inference, and adaptive online-offline blending is shaping this fast-evolving domain.