High-dimensional Linear Contextual Bandits

Updated 11 October 2025

High-dimensional linear contextual bandits are sequential decision-making problems where rewards rely on unknown linear models in high-dimensional feature spaces, challenging classical methods.
The adaptive pointwise estimator (PWE) integrates parameter and spectral sparsity to accurately predict rewards and overcome the curse of dimensionality.
The HOPE algorithm employs an explore-then-commit framework with PWE to achieve robust regret guarantees across homogeneous, mixed, and heterogeneous settings.

High-dimensional linear contextual bandit problems refer to sequential decision-making settings where, in each round, a learner observes high-dimensional context vectors for a set of actions (“arms”) and must select one to maximize the cumulative reward, with the expected reward determined by an unknown linear model of the context. The high-dimensional regime—where the number of features rivals or exceeds the number of interactions—fundamentally challenges classical exploration–exploitation strategies, as standard estimation and regret bounds become impractical without further structure or algorithmic adaptation.

1. Statistical and Structural Challenges in High Dimensions

The high-dimensional linear contextual bandit setting is characterized by the context dimensionality $p$ being of the same order or greater than the number of rounds $T$ ; this presents several key challenges:

Curse of Dimensionality: When $p \gg T$ , naïve least-squares estimation in the linear reward model $\mu_t^{(i)} = \langle\theta^{(i)}, x_t^{(i)}\rangle$ is severely underdetermined, precluding consistent parameter estimation or meaningful confidence intervals.
Structural Assumptions: To render the problem tractable, prior work has assumed either:
- Parameter sparsity, where each unknown coefficient vector $\theta^{(i)}$ is sparse; or
- Spectral sparsity, where the context covariance matrices $\Sigma^{(i)}$ have only a few large eigenvalues.
Homogeneity vs. Heterogeneity: Existing methods predominantly address homogeneous settings: either all arms are sparse (for example, Lasso-ETC, $[\text{1907.11362}]$ ), or all context covariance matrices are low-rank (e.g., ridgeless least-squares estimators, $[\text{2306.11017}]$ ). In practice, however, these structures can co-occur or be mixed, leading to heterogeneous regimes not handled by earlier estimators.

The high-dimensionality challenge is thus compounded if arms exhibit different forms of sparsity or when both parameter and spectral forms are present within a single problem instance.

2. Adaptive Pointwise Estimation and the HOPE Algorithm

To address the limitations arising from rigid, single-structure estimators, a pointwise estimator (PWE) is introduced to adaptively incorporate both parameter and spectral sparsity, which crucially accommodates mixed and heterogeneous sparsity regimes (Zhao et al., 9 Oct 2025):

Support Estimation: Determine a candidate support $S_1$ (containing the true support $S_0$ if $\theta^{(i)}$ is sparse) using variable selection (e.g., Lasso or sure independence screening).
Dimension Reduction: Restrict or truncate both the context $x$ and any initial parameter estimate to $S_1$ , effectively reducing the estimation problem’s dimension.
Model Transformation: Decompose the context via projection and augment the model with an invertible transformation $\Gamma_t^{(i)}$ (constructed from the spectral information of the contexts $X$ ) to sparsify the resultant nuisance term in the reward model.
Low-dimensional Estimation: Solve a penalized (e.g., Lasso) regression on the $N+1$ -dimensional transformed model to estimate a scaling parameter $\alpha_t^{(i)}$ , yielding a pointwise reward estimate

$\widehat{\mu}_t^{(i)} = \widehat{\alpha}_t^{(i)} \cdot \frac{\sqrt{N}\|x\|_2^2}{\|X x\|_2}.$

This PWE thus generalizes classical reward prediction: if the model is parameter-sparse, it delivers the same regret scaling as Lasso-ETC; if the eigenvalues decay quickly, it achieves the regret rates of spectral methods (ridgeless least squares); in mixed or heterogeneous cases, it adapts to the most favorable structure available.

The HOPE algorithm (“High-dimensional linear cOntextual bandits with Pointwise Estimator”) leverages this strategy in an Explore-Then-Commit (ETC) framework:

In the exploration phase, each arm is selected in round-robin fashion to gather independent samples.
In the exploitation phase, the PWE is instantiated (with either Lasso or RDL as initialization, chosen per-arm according to empirical suitability), allowing heterogeneous treatment across arms.
At each exploitation round, PWE is used to estimate each arm's reward, and the arm with the highest estimate is selected.

3. Regret Bounds Across Structural Regimes

By combining variable selection and spectral reduction, HOPE achieves regret guarantees that generalize and improve upon prior methods:

Parameter-Sparse Settings: For arms with $s_0$ -sparse parameters, the cumulative regret scales as

$R(T) = O\left(K^{1/3} s_0^{1/3} T^{2/3} \cdot \text{polylog}(T)\right),$

which matches the best known results from Lasso-based ETC methods in the high-dimensional sparse regime (Kim et al., 2019).

Spectral (Covariance-Sparse) Settings: When context covariance matrices have rapidly decaying eigenvalues (approximate low-rank), using RDL as the initial estimator provides a bound of order

$R(T) = \widetilde{O}\left(\max\left\{K^{1/2} p^{1/(2T^a)} T^{(a+2)/4}, K^{1/3} p^{2/(3T^a)} T^{(2-a)/3}\right\}\right),$

with $a$ determined by the eigenvalue decay.

Mixed Sparsity: In regimes where both forms of sparsity co-exist (e.g., $\theta^{(i)}$ is $s_0$ -sparse and $\Sigma^{(i)}$ has fast decaying eigenvalues), the regret improves to

$R(T) = \widetilde{O}(K^{1/3} M^{2/3} T^{2/3}),$

with $M$ capturing the effective rank over the estimated support.

Heterogeneous Settings: With arms divided into different structural classes (e.g., some sparse, some low-rank), HOPE applies the appropriate estimator per-arm; the overall regret is then determined by the worse of the two class-specific rates. This is the first method to provide regret guarantees in such mixed-structure regimes.

The regret analysis is supported by nonasymptotic error bounds for the PWE, incorporating both estimation (support recovery) and transformation errors.

4. Practical Adaptivity and Experimental Results

Empirical studies robustly support the theoretical claims:

In homogeneous settings (either all model- or covariance-sparse), HOPE matches the performance of state-of-the-art specialized algorithms (e.g., Lasso-ETC, RDL-ETC).
In mixed or heterogeneous cases, where previous methods fail due to model mismatch, HOPE consistently outperforms alternatives by selecting the appropriate estimator for each arm.
Experiments test scenarios with varying sparsity ratios, context spectra, and noise levels, confirming that HOPE delivers lower mean regret and reduced variance. Key performance cases assessed include: sparse $\theta$ with identity covariances, dense $\theta$ with decaying spectral covariances, and mixed sparsity across arms.

5. Mathematical Formulation and Pointwise Estimation Procedure

The central estimator operates as follows:

Given arm $i$ $i$ and context $x_t^{(i)}$ $x_{t}^{(i)}$ at round $t$ $t$ , after support estimation and model transformation:
- Form the model:
$y = \sqrt{N}\alpha_t^{(i)}z_t^{(i)} + \sqrt{N}\xi_t^{(i)} + \varepsilon,$

where $\alpha_t^{(i)}$ scales the reward, $z_t^{(i)}$ is transformed data, and $\xi_t^{(i)}$ is a transformed nuisance term. - Run Lasso on the $(N+1)$ -dimensional data to solve

$\widehat{\beta}_t^{(i)} = \arg\min_{\beta \in \mathbb{R}^{N+1}} \frac{1}{N}\|y - \beta\|_2^2 + \lambda_t^{(i)}\|\beta\|_1,$

and obtain the final estimate as

$\widehat{\mu}_t^{(i)} = \widehat{\alpha}_t^{(i)} \frac{\sqrt{N}\|x\|_2^2}{\|X x\|_2}.$
Regret analysis leverages model-dependent quantities such as

$M_{S_1} = \max_{i} \frac{\operatorname{tr}(\Sigma_{S_1}^{(i)})}{\|\Sigma_{S_1}^{(i)}\|_F},$

leading to bounds on the prediction error of the PWE and the resultant cumulative regret.

6. Implications and Future Directions

The HOPE algorithm’s adaptive navigation of both parameter and spectral sparsity substantially broadens the applicability of high-dimensional contextual bandit algorithms, including:

Real-world settings with unknown or heterogeneous structure, such as recommendation, personalized medicine, or online advertising, where some arms or contexts are governed by sparse signals while others are best described by low-rank or spectrally sparse representations.
A pathway for further generalizations, including extensions to nonlinear models (e.g., kernel methods, neural representations), development of new exploration strategies (not limited to ETC), and applications to full reinforcement learning scenarios where context/state spaces are high-dimensional and partial feedback is present.

A plausible implication is that the PWE-based approach can be integrated with UCB or Thompson Sampling frameworks if appropriate uncertainty quantification is extended to the transformed models. Future work is anticipated in adaptive exploration regimes and generalized context–reward structures.

This synthesis captures the key innovations of adapting pointwise estimation for mixed sparsity in high-dimensional linear contextual bandits. By supporting both homogeneous and heterogeneous structural assumptions, and providing matching regret guarantees across settings, this framework aligns with current trends in arXiv literature and recent advances in high-dimensional online learning (Zhao et al., 9 Oct 2025).

PDF Markdown Chat (Pro)

References (3)

Doubly-Robust Lasso Bandit (2019)

High-dimensional Contextual Bandit Problem without Sparsity (2023)

Navigating Sparsities in High-Dimensional Linear Contextual Bandits (2025)

Follow Topic

Get notified by email when new papers are published related to High-dimensional Linear Contextual Bandit Problems.