Single-Index Bandit Algorithm

Updated 2 January 2026

Single-index bandit algorithms are methods that compute a per-arm scalar index from states or contexts to enable efficient sequential decision-making in complex resource allocation problems.
They leverage techniques such as dynamic programming, Lagrangian relaxation, and projection methods, leading to near-optimal regret rates even in high-dimensional or large-arm regimes.
Practical implementations in restless multi-armed and contextual bandit settings demonstrate scalability and robust performance, achieving sublinear optimality gaps under various operational constraints.

A single-index bandit algorithm denotes any bandit policy that leverages a per-arm scalar index—computed as a function of an arm’s current state (or context, belief, or estimated reward parameters)—to make sequential allocation or selection decisions. Two major forms have emerged in the literature: (i) indices derived via dynamic programming and Lagrangian relaxation, especially for (restless) resource allocation settings, and (ii) indices in semiparametric and nonparametric contextual bandits based on projections of high-dimensional covariates onto a 1-dimensional space, which exploit a single-index model for the expected reward. Both paradigms offer substantial computational and statistical gains, particularly in the regime of large arm populations or high-dimensional covariates, often achieving near-optimal or minimax regret rates and practical scalability.

1. Index-Based Bandit Algorithms in Restless and Classical Regimes

In classical and restless multi-armed bandit (RMAB) models, index policies are constructed via Lagrangian relaxation of per-step pull constraints, yielding decoupled Markov decision processes (MDPs) per arm. The canonical index, known as the Whittle index, is defined for each state as the infimum subsidy (or price) λ such that it becomes preferable to remain passive rather than activate the arm. For a finite-state restless bandit, the per-arm Bellman equations under state s and subsidy λ are

$V(s; \lambda) = \max\left\{Q(s, 0; \lambda),~Q(s, 1;\lambda)\right\}$

with $Q(s, a; \lambda) = r(s,a) + \lambda (1-a) + \beta \sum_{j} p^a_{s,j} V(j;\lambda)$ (Mittal et al., 2023). The index is

$W(s) = \inf\{\lambda : Q(s,0;\lambda) \geq Q(s,1;\lambda)\}$

and is efficiently computable via value iteration, a VI-grid algorithm, or adaptive greedy techniques achieving $O(K^3)$ runtime (Mittal et al., 2023, Akbarzadeh et al., 2020).

Extensions to settings with scarce resources—such as finite-horizon, single-pull constraints—require further innovation. The "Single-Pull Index" (SPI) policy expands the per-arm state space to include absorbing "dummy" states post-pull, then solves an occupancy LP over the extended chain and defines a time- and state-dependent index by integrating fractional optimal activation probabilities from the LP solution (Xiong et al., 10 Jan 2025). This approach yields per-step O(N log N) runtime and sublinear optimality gaps as arm and budget populations scale.

Approximate single-index policies have also been studied for finite-horizon and non-exchangeable settings, where the delayed (idling) bandit property fails. Here, one constructs a per-arm dynamic program (DP), then schedules arms greedily according to per-arm density $R_i/T_i$ (expected cumulative reward per expected number of pulls), achieving a constant-factor approximation to the true optimum (Guha et al., 2013).

2. Single-Index Bandits in Contextual and Nonparametric Settings

Single-index bandit algorithms in the contextual regime rely on projecting $d$ -dimensional contexts (features) onto a learned direction $\theta$ (or $v^*$ ), capturing the reward structure as $f(\theta^\top x)$ for an unknown link function $f$ . This generalizes the generalized linear bandit setting by removing the requirement that $f$ is known a priori.

For generalized linear bandits with unknown link functions, the key innovation is estimating $\theta^*$ via Stein’s method (score-matching estimator), using observed reward-feature pairs. Algorithms such as STOR (single-epoch explore-then-commit) and ESTOR (multi-epoch update) decouple exploration and exploitation, producing nearly optimal $\tilde O_T(\sqrt{T})$ regret, even in high-dimensional, sparse settings with $s = \|\theta^*\|_0$ (Kang et al., 15 Jun 2025). GSTOR further relaxes all functional assumptions on $f$ , with regret $O(d^{3/8} T^{3/4})$ for Gaussian designs.

In batched, nonparametric contextual bandits, the single-index assumption is leveraged to design dynamic binning and successive arm elimination strategies (BIDS algorithm), which perform discrete partitioning of the projected context space and adaptively refine the active set while achieving minimax-optimal regret independent of the ambient dimension $d$ , thus bypassing the statistical curse of dimensionality (Arya et al., 1 Mar 2025).

3. Theoretical Guarantees and Optimality

Single-index and index-based bandit algorithms provide a rigorous connection between per-arm local policies and global performance:

In the infinite-horizon RMAB setting, the Whittle index policy is asymptotically optimal as the number of arms grows, with rigorous per-arm regret bounds (Akbarzadeh et al., 2020, Hu et al., 2017).
For finite-horizon, single-pull RMAB, the SPI algorithm attains a sublinear average optimality gap $\tilde{O}(1/\sqrt{\rho})$ as the number of arms and budget scale ( $\rho$ is the scaling parameter) (Xiong et al., 10 Jan 2025).
In nonparametric contextual bandits, single-index algorithms achieve regret scaling as $T^{1 - (1+\alpha)\beta/(2\beta+1)}$ (for margin condition $\alpha$ , link smoothness $\beta$ ), matching minimax lower bounds in the univariate case and universally outperforming fully nonparametric methods in high dimension (Ma et al., 31 Dec 2025, Arya et al., 1 Mar 2025).
Adaptivity to unknown smoothness is possible under self-similarity assumptions; without such structure, no policy can simultaneously achieve minimax rates for multiple unknown smoothness levels (Ma et al., 31 Dec 2025).

4. Computational Techniques and Algorithmic Structure

Table: Representative Single-Index Policies

Setting	Index/Policy Name	Computation
Restless/Finite-State RMAB	Whittle Index	VI, adaptive-greedy
Finite-horizon, single-pull RMAB	SPI	Occupancy LP + sort
Finite-horizon RMAB (multiple pulls/step)	Index policy (Hu et al., 2017)	DP + 1D bisection
GLM with unknown link	ESTOR, STOR, GSTOR (Kang et al., 15 Jun 2025)	Stein’s method + ETC
Nonparametric contextual bandit	Batched SIB (Ma et al., 31 Dec 2025)	MRC + LPE, elimination

Index computation involves DP or convex optimization for RMABs, and a combination of U-statistics (maximum rank correlation) and local polynomial regression (LPE) for contextual settings. Empirical studies show these algorithms are both scalable and robust, with runtime per period often scaling linearly (or nearly so) with the number of arms and state/context dimensions (Xiong et al., 10 Jan 2025, Ma et al., 31 Dec 2025).

5. Empirical Performance and Impact

Extensive empirical evaluation confirms theoretical predictions:

SPI for single-pull RMAB achieves performance within 1–3% of the LP upper bound, consistently outperforming Whittle and non-index heuristics, especially as the system scales (Xiong et al., 10 Jan 2025).
In contextual regimes, SIB algorithms (ESTOR/STOR) match or improve over LinUCB, LinTS, and GLM-TSL under model misspecification and high dimension; GSTOR maintains sublinear regret under fully agnostic $f$ (Kang et al., 15 Jun 2025).
Batched SIB (BIDS) and nonparametric SIB algorithms dominate nonparametric competitors, especially as $d$ increases, with interpretable learned indices identifying key features in real datasets (Arya et al., 1 Mar 2025, Ma et al., 31 Dec 2025).

6. Connections, Extensions, and Limitations

Single-index bandit algorithms unify a diverse set of resource allocation, exploration, and learning problems under a common principle: use an efficiently computable per-arm scalar index to decouple the global policy into tractable per-arm decisions. This structure, however, relies on model assumptions (indexability, single-index structures, margin conditions) and may degrade if these are systematically violated. Adaptivity to unknown structural parameters remains fundamentally impossible without additional structure in nonparametric regimes, and empirical scaling depends on efficient implementation of the underling index computation algorithms (Ma et al., 31 Dec 2025).

7. References

Representative sources for the development and analysis of single-index bandit algorithms include:

"Finite-Horizon Single-Pull Restless Bandits: An Efficient Index Policy For Scarce Resource Allocation" (Xiong et al., 10 Jan 2025)
"Conditions for indexability of restless bandits and an O(K³⁾ algorithm to compute Whittle index" (Akbarzadeh et al., 2020)
"Indexability of Finite State Restless Multi-Armed Bandit and Rollout Policy" (Mittal et al., 2023)
"Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions" (Kang et al., 15 Jun 2025)
"Nonparametric Bandits with Single-Index Rewards: Optimality and Adaptivity" (Ma et al., 31 Dec 2025)
"Semi-Parametric Batched Global Multi-Armed Bandits with Covariates" (Arya et al., 1 Mar 2025)
"Approximation Algorithms for Bayesian Multi-Armed Bandit Problems" (Guha et al., 2013)
"An Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits" (Hu et al., 2017)