Controlled Stochastic Bandit Problems

Updated 4 July 2026

Controlled stochastic bandits are sequential decision problems where the learner actively shapes the observed data by choosing actions that determine stochastic outcomes.
The framework encompasses models such as K-armed, convex, and contextual bandits, each optimizing performance metrics like cumulative regret or reward.
Algorithmic techniques like KL-UCB, confidence-based elimination, and safety-constrained methods balance exploration and exploitation under structured feedback.

Searching arXiv for recent and relevant papers on controlled stochastic bandit settings to ground the article in published work. The controlled stochastic bandit setting is a family of sequential decision problems in which a learner chooses actions, queries, or controls that directly determine what stochastic information is observed, and then uses those observations to optimize a performance criterion such as cumulative reward, cumulative cost, simple regret, or a safety-constrained objective. In the simplest stochastic $K$ -armed model, the control is the arm-selection sequence $i(t)$ and rewards are independent random variables with arm-dependent means (Honda, 2019). In stochastic convex bandit optimization, the control is the query point $x_t$ , the observation is a noisy function value $y_t=f(x_t)+\varepsilon_t$ , and the learner controls where it samples (Agarwal et al., 2011). In stochastic contextual linear bandits, the action depends on a context or even on a distribution over contexts, while in bandit linear control the control $u_t$ affects both the next state and the scalar cost feedback (Zanette et al., 2021, Kirschner et al., 2019, Cassel et al., 2020). Across these formulations, the common structure is that the learner does not passively receive data: it actively shapes the data-generating process through sequential decisions under stochastic uncertainty.

1. Canonical formulation and scope

A controlled stochastic bandit problem typically specifies an action space, a stochastic feedback model conditional on the chosen action, and a benchmark against which performance is measured. In the standard stochastic $K$ -armed bandit, there are $K$ arms, arm $i$ has unknown mean reward $\mu_i$ , and the learner chooses $i(t)\in\{1,\dots,K\}$ at each round and observes a reward, often modeled as Bernoulli in the KL-UCB analysis (Honda, 2019). In the empirical-moment formulation, each arm has reward distribution $i(t)$ 0 supported on $i(t)$ 1, rewards are independent across time conditional on the chosen arm, and regret is measured as $i(t)$ 2 (Honda et al., 2011).

The same controlled structure persists in richer models. In stochastic convex bandit optimization, the learner sequentially chooses $i(t)$ 3 over a convex, compact set $i(t)$ 4, observes only noisy function values, and minimizes

$i(t)$ 5

with $i(t)$ 6 assumed convex and $i(t)$ 7-Lipschitz (Agarwal et al., 2011). In stochastic contextual linear bandits, each round begins with a context $i(t)$ 8 drawn i.i.d. from an unknown distribution $i(t)$ 9, the learner sees a context-dependent action set $x_t$ 0, and rewards follow

$x_t$ 1

with mean-zero, $x_t$ 2-subGaussian noise (Zanette et al., 2021).

A useful unifying description is that the learner’s action determines the conditional law of the observation. In stochastic convex bandits this is explicit in

$x_t$ 3

so the learner controls where it samples and the sampling changes the information obtained (Agarwal et al., 2011). In contextual models with distributional contexts, the learner observes only a context distribution $x_t$ 4, not the realization $x_t$ 5, chooses $x_t$ 6, and then observes

$x_t$ 7

which again makes the chosen action the mechanism through which uncertainty is probed (Kirschner et al., 2019).

The scope of the setting is correspondingly broad. It includes semi-bandit models with multiple simultaneous plays and budget depletion (Zhou et al., 2017), continuous-time bandits with controlled restarts (Cayci et al., 2020), conservative and safety-constrained bandits (Wu et al., 2016, Amani et al., 2019, Lin et al., 2022), graph-feedback contextual bandits with side observations (Gong et al., 2023), and stochastic bandit formulations for discrete concave optimization and linear dynamical control (Oki et al., 2024, Cassel et al., 2020). This suggests that “controlled stochastic bandit” is best understood as a structural viewpoint rather than a single formal model.

2. Feedback models and observability

The defining informational feature of the setting is partial observability: the learner generally does not observe full reward functions, gradients, latent contexts, or all arm outcomes. What is observed depends on the chosen action and on the specific feedback model.

In zeroth-order stochastic convex optimization, feedback is noisy function evaluation. The learner may query any $x_t$ 8, but only sees a noisy value

$x_t$ 9

where $y_t=f(x_t)+\varepsilon_t$ 0 is independent, mean-zero, and $y_t=f(x_t)+\varepsilon_t$ 1-subgaussian, so the same point may be queried multiple times to build confidence intervals (Agarwal et al., 2011). In standard stochastic multi-armed bandits, the learner observes only the reward of the selected arm, with bounded or Bernoulli rewards depending on the formulation (Honda, 2019, Honda et al., 2011).

Contextual variants refine the same principle. In the distributional-context model, the learner observes $y_t=f(x_t)+\varepsilon_t$ 2, not the realized context $y_t=f(x_t)+\varepsilon_t$ 3, and chooses based on expected features

$y_t=f(x_t)+\varepsilon_t$ 4

The effective noise then contains both observation noise and context-realization uncertainty, with

$y_t=f(x_t)+\varepsilon_t$ 5

acting as an additional mean-zero term in the linear case (Kirschner et al., 2019). A conservative extension keeps the same informational asymmetry while also requiring cumulative performance to remain within a $y_t=f(x_t)+\varepsilon_t$ 6 fraction of a baseline policy at every time (Lin et al., 2022).

Other feedback structures explicitly exploit side observations. In stochastic graph bandits, the learner observes the context $y_t=f(x_t)+\varepsilon_t$ 7 and a directed graph $y_t=f(x_t)+\varepsilon_t$ 8, chooses one action $y_t=f(x_t)+\varepsilon_t$ 9, and then observes the rewards of all actions in its out-neighborhood

$u_t$ 0

The observation model is therefore neither pure bandit nor full information; it is shaped by the graph revealed at the round (Gong et al., 2023). In budget-constrained multiple-play semi-bandits, the learner plays exactly $u_t$ 1 arms and observes individual rewards and costs for each selected arm, but not for unplayed arms (Zhou et al., 2017).

A particularly distinctive feedback pattern appears in continuous-time bandits with controlled restarts. The learner selects both an arm $u_t$ 2 and a restart time $u_t$ 3. If the run is interrupted at cutoff $u_t$ 4, the realized duration and reward become right-censored: $u_t$ 5 The paper emphasizes that this feedback is right-censored in the sense that a larger cutoff reveals at least as much information as a smaller cutoff (Cayci et al., 2020).

These models support a broad taxonomy of observability regimes.

Model	Controlled decision	Observed feedback
Stochastic convex bandit	Query point $u_t$ 6	Noisy value $u_t$ 7
$u_t$ 8-armed stochastic bandit	Arm $u_t$ 9	Reward of chosen arm
Distributional contextual bandit	Action given $K$ 0	Reward with hidden realized context
Graph contextual bandit	Action $K$ 1 on graph $K$ 2	Rewards on out-neighborhood
Budgeted semi-bandit	Subset $K$ 3, $K$ 4	Individual rewards and costs of played arms
Controlled restarts	Pair $K$ 5	Right-censored duration and reward

A common misconception is that the benchmark should always be a policy that sees the finest latent state available in the environment. The distributional-context model shows otherwise: the relevant comparator is the best action given the observed distribution $K$ 6, not the realized hidden context $K$ 7, and the paper gives an example showing that competing against a policy that sees $K$ 8 before acting can incur $K$ 9 regret (Kirschner et al., 2019).

3. Objectives and regret criteria

The controlled stochastic bandit setting does not have a single universal performance functional. Instead, the objective is determined by what the learner is meant to optimize under the available control and feedback constraints.

The most common criterion is cumulative regret or pseudo-regret. In stochastic convex bandit optimization this takes the form

$K$ 0

which is a cumulative optimization loss relative to the best fixed point in the feasible set (Agarwal et al., 2011). In stochastic $K$ 1-armed bandits the pseudo-regret is

$K$ 2

or equivalently the arm-count form $K$ 3 (Honda, 2019, Honda et al., 2011). In contextual linear bandits with distributional contexts, the benchmark is the best action chosen from the observed distribution: $K$ 4 This definition makes the benchmark information-compatible with the learner (Kirschner et al., 2019).

Other formulations depart substantially from reward maximization. In controlled experiment design for stochastic contextual linear bandits, the goal is to collect a dataset under a fixed exploration policy $K$ 5 so that the recovered greedy policy $K$ 6 has small expected suboptimality

$K$ 7

This is not immediate online regret minimization; it is policy-recovery quality after deliberate data collection (Zanette et al., 2021).

Pure-exploration and offline-optimization variants use simple regret. For stochastic bandits over $K$ 8-concave functions, after $K$ 9 noisy value-oracle queries the learner outputs $i$ 0, and the expected simple regret is

$i$ 1

The same work also considers cumulative pseudo-regret

$i$ 2

showing that controlled stochastic bandit models can simultaneously support exploration-only and reward-accumulation viewpoints (Oki et al., 2024).

Several papers replace standard regret with constrained or risk-adjusted criteria. In conservative bandits, the learner must maintain

$i$ 3

or the analogous realized-reward condition, while minimizing pseudo-regret (Wu et al., 2016). In linear stochastic bandits under safety constraints, the learner minimizes

$i$ 4

subject to the unknown-parameter safety constraint $i$ 5 at every round with high probability (Amani et al., 2019). In Sharpe-ratio optimization, the object of interest is not cumulative reward but the empirical Sharpe ratio of the action sequence, with stabilized arm index

$i$ 6

and regret

$i$ 7

This makes mean and variance jointly relevant to exploration (Shah et al., 19 Aug 2025).

Time-constrained models induce yet another criterion. In continuous-time bandits with controlled restarts, the natural quantity is the renewal reward rate

$i$ 8

and regret is measured relative to the optimal expected cumulative reward achievable before a time budget $i$ 9 (Cayci et al., 2020). A plausible implication is that the controlled stochastic bandit setting is unified less by a single loss function than by a shared informational architecture: sequential control under stochastic, action-dependent feedback.

4. Algorithmic principles

Algorithm design in controlled stochastic bandits is driven by the interaction between estimation, confidence, and the geometry or structure of the action space. The classical paradigm is optimism under uncertainty. In KL-UCB and KL-UCB $\mu_i$ 0, the index of arm $\mu_i$ 1 is

$\mu_i$ 2

with $\mu_i$ 3 giving KL-UCB and $\mu_i$ 4 giving KL-UCB+ (Honda, 2019). Conservative UCB modifies this by adding a budget-feasibility check before allowing the optimistic arm to be played, reverting to the default arm when the lower confidence bound on the budget would become negative (Wu et al., 2016). Safe-LUCB similarly restricts optimism to actions certified safe under every parameter in a confidence ellipsoid (Amani et al., 2019).

Confidence-driven elimination and repeated sampling appear in nonparametric and geometric models. In stochastic convex optimization with bandit feedback, the one-dimensional algorithm queries interval quartiles and a center point, constructs confidence intervals of width $\mu_i$ 5, and discards a quartile only when the evidence is strong enough. The center point acts as a sentinel that detects when the function dips in the middle, which the paper identifies as the key innovation over simpler interval-shrinking schemes (Agarwal et al., 2011). The high-dimensional version generalizes the ellipsoid method through a regular simplex, a sequence of pyramids, cone-cutting, and hat-raising, using repeated sampling to ensure confidence intervals valid with probability at least $\mu_i$ 6 (Agarwal et al., 2011).

Moment-based and distribution-aware indices provide another principle. DMED-M replaces the full empirical distribution in DMED by the first $\mu_i$ 7 empirical moments and schedules arm $\mu_i$ 8 again when

$\mu_i$ 9

The resulting policy depends only on the first $i(t)\in\{1,\dots,K\}$ 0 empirical moments of each arm and realizes a computational-complexity versus regret tradeoff (Honda et al., 2011). A different form of estimator adaptation appears in multi-armed bandits with limited control variates, where the learner combines a reward-only estimator and a control-variate estimator through

$i(t)\in\{1,\dots,K\}$ 1

and then builds a $i(t)\in\{1,\dots,K\}$ 2-based UCB index from the combined mean and variance estimate (Verma et al., 2 Mar 2026).

Structured exploration may also be deliberately non-reactive. In design of experiments for stochastic contextual linear bandits, the planner uses offline contexts and a reward-free LinUCB-style uncertainty criterion

$i(t)\in\{1,\dots,K\}$ 3

to produce a mixture of policies, and the online sampler then keeps this exploration rule fixed while collecting rewards on fresh contexts (Zanette et al., 2021). The paper explicitly contrasts this with standard reactive algorithms such as UCB and LinUCB.

In graph-feedback contextual bandits, the exploration policy is shaped simultaneously by candidate-optimal actions, graph structure, and empirical gaps. AdaCB.G constructs an exploration set from an induced subgraph over actions that could still be optimal, then solves a linear program to assign action probabilities while ensuring enough side observations through the graph (Gong et al., 2023). In stochastic $i(t)\in\{1,\dots,K\}$ 4-concave maximization, the key principle is to reduce each greedy update to a small stochastic bandit problem over feasible one-step increments and then compose local decisions via a robustness theorem: $i(t)\in\{1,\dots,K\}$ 5 This converts local estimation accuracy into global optimization quality (Oki et al., 2024).

Control can also be embedded in dynamics rather than static action choice. In bandit linear control, the disturbance-action policy

$i(t)\in\{1,\dots,K\}$ 6

reparameterizes control so that the induced surrogate losses have bounded memory, after which a new reduction from bandit convex optimization with memory to standard bandit convex optimization becomes possible (Cassel et al., 2020). This suggests that controlled stochastic bandit algorithms often hinge on an intermediate representation that makes action-dependent information accumulation tractable.

5. Constraints, structure, and specialized regimes

A major line of work studies controlled stochastic bandits under explicit safety, budget, or feasibility constraints. Conservative bandits introduce a default arm and require cumulative reward never to fall below a fixed fraction of the default strategy uniformly over time; the budget process is

$i(t)\in\{1,\dots,K\}$ 7

and safety means $i(t)\in\{1,\dots,K\}$ 8 for all $i(t)\in\{1,\dots,K\}$ 9 (Wu et al., 2016). Linear stochastic bandits under safety constraints define the true safe set as

$i(t)$ 00

replace it with an inner approximation

$i(t)$ 01

and choose only actions certified safe under the current confidence ellipsoid (Amani et al., 2019). A contextual analogue requires cumulative reward to remain above a $i(t)$ 02 fraction of a baseline policy even though the realized context is hidden and only the context distribution is observed (Lin et al., 2022).

Resource constraints lead to further specializations. In budget-constrained multiple-play semi-bandits, the learner must choose exactly $i(t)$ 03 out of $i(t)$ 04 arms per round, pay the realized costs of all selected arms, and stop when the chosen set would exceed the remaining budget. The stochastic oracle benchmark repeatedly plays the fixed best subset maximizing the sum of bang-per-buck ratios $i(t)$ 05 (Zhou et al., 2017). In continuous-time bandits with controlled restarts, restart times become decision variables, and the optimal static action is characterized by the reward-rate maximizer $i(t)$ 06 (Cayci et al., 2020).

Another line exploits latent structure in the reward model. Distributional-context bandits replace exact pre-decision context information by distributions over contexts (Kirschner et al., 2019). High-dimensional contextual bandits with missing covariates assume a sparse parameter $i(t)$ 07, masked observations

$i(t)$ 08

and missing completely at random covariates with coordinate-wise observation probabilities $i(t)$ 09, leading to a missingness-adjusted lasso plug-in policy (Jang et al., 2022). Graph-feedback bandits exploit side observations encoded by a directed graph revealed at each round (Gong et al., 2023).

The setting also accommodates structured non-stationarity. Generalized non-stationary bandits assume time-varying arm means $i(t)$ 10, controlled short-term variation of the best mean, and a geometric condition on the gap process. The same framework covers switching bandits, locally polynomial means, locally smooth means, and bounded-inflexion gap models, and the algorithm PrudentBandits performs confidence-based gap estimation, active-set maintenance, and significant change-point detection on gaps rather than raw means (Manegueu et al., 2021).

Risk-sensitive and economic-control variants extend the notion of “control” beyond arm choice alone. Sharpe-ratio bandits optimize a stabilized ratio depending jointly on mean and variance (Shah et al., 19 Aug 2025). Continuous-time two-armed restless bandits with imperfect information model a risky arm that raises future expected returns through human capital $i(t)$ 11, a safe arm with deterministic payoff, and a hidden type $i(t)$ 12; the optimal policy is a stopping rule characterized by an index that formally coincides with Gittins’ index (Fryer et al., 2015). A plausible implication is that the controlled stochastic bandit setting reaches into stochastic control whenever present choices alter not only information but also future opportunity sets.

6. Guarantees, lower bounds, and conceptual significance

Regret guarantees in controlled stochastic bandits reflect both classical exploration–exploitation limits and additional costs induced by geometry, constraints, or richer information structures. In stochastic convex optimization with bandit feedback, the generalized ellipsoid algorithm achieves $i(t)$ 13 regret with high probability, and the paper states that any algorithm must incur at least $i(t)$ 14 regret, so the dependence on $i(t)$ 15 is optimal up to logarithmic factors (Agarwal et al., 2011). In the Bernoulli stochastic bandit, KL-UCB $i(t)$ 16 satisfies

$i(t)$ 17

matching the Lai–Robbins lower bound and establishing asymptotic optimality of KL-UCB+ (Honda, 2019).

For contextual and structured models, rates depend on dimension or on the structure that mediates observation quality. The design-of-experiments approach to stochastic contextual linear bandits achieves near-minimax online sample complexity

$i(t)$ 18

for recovery of a near-optimal policy from non-reactive exploration data (Zanette et al., 2021). Distributional-context linear bandits obtain order-optimal high-probability regret $i(t)$ 19, matching the classical linear contextual rate up to logarithmic factors (Kirschner et al., 2019). Graph-feedback contextual bandits obtain a minimax-style bound driven by the expected independence number $i(t)$ 20 rather than the full action set size, and also admit a gap-dependent upper bound under a uniform gap condition (Gong et al., 2023). High-dimensional sparse contextual bandits with missing covariates incur regret

$i(t)$ 21

with high probability, so missingness worsens regret by at most the factor $i(t)$ 22 (Jang et al., 2022).

Safety and conservativeness add explicit prices. Conservative UCB achieves the usual stochastic-bandit $i(t)$ 23-type term plus an additive penalty of order $i(t)$ 24 up to logarithmic factors, and the paper proves a lower bound showing that any algorithm satisfying the conservative constraint must pay at least $i(t)$ 25 additional regret in the worst case (Wu et al., 2016). In linear stochastic bandits under safety constraints, if the safety gap $i(t)$ 26 is positive and known, Safe-LUCB yields $i(t)$ 27 regret up to problem-dependent constants; when $i(t)$ 28 or unknown, the worst-case bound degrades to $i(t)$ 29 (Amani et al., 2019). The conservative contextual extension decomposes regret into a standard linear-UCB term plus two time-independent constants, one for conservativeness and one for unknown contexts (Lin et al., 2022).

Stochastic structure can also separate sharply from adversarial structure. The $i(t)$ 30-concave maximization work gives $i(t)$ 31 simple regret and $i(t)$ 32 cumulative regret under an unbiased noisy value oracle, but also proves that in the adversarial full-information setting no polynomial-time per-round algorithm can achieve $i(t)$ 33 regret for any constant $i(t)$ 34 unless $i(t)$ 35 (Oki et al., 2024). In bandit linear control, strongly convex and smooth costs allow $i(t)$ 36 pseudo-regret under bandit feedback, while full-information online control can achieve much faster rates, including logarithmic regret in some regimes (Cassel et al., 2020). This clarifies that partial feedback is not merely a technical nuisance; it changes the attainable rate.

Several specialized models retain the canonical logarithmic or square-root patterns after suitable reformulation. Continuous-time bandits with controlled restarts achieve $i(t)$ 37 regret for finite restart sets and $i(t)$ 38 regret for continuous restart times (Cayci et al., 2020). Corralling stochastic bandit algorithms yields regret close to that of the best base learner, with gap-dependent terms involving the gap between the best base learner and the others (Arora et al., 2020). Sharpe-ratio Thompson sampling achieves logarithmic regret with a matching lower bound in the Gaussian model, establishing order-optimality for this risk-sensitive objective (Shah et al., 19 Aug 2025). Generalized non-stationary bandits obtain

$i(t)$ 39

recovering the classical switching-bandit rate up to logarithmic factors when $i(t)$ 40 (Manegueu et al., 2021).

Taken together, these results show that the controlled stochastic bandit setting is not defined by a single reward model or a single algorithmic template. Its central property is that decisions simultaneously affect immediate performance and the information available for future decisions. Convexity, linear structure, graph side-observations, censoring, safety constraints, control variates, and restart choices all alter how this dual role is expressed, but the core research problem remains the same: to exploit stochastic regularity without losing control of the information process itself.