Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bandits with Switching Costs

Updated 8 March 2026
  • Bandits with switching costs are online learning problems where each change in action incurs a penalty, complicating the trade-off between exploration and exploitation.
  • The analysis demonstrates that minimax regret scales as ~Θ(K^(1/3)T^(2/3)), highlighting distinct performance boundaries compared to standard multi-armed bandits.
  • Algorithmic techniques such as batching and mini-batching are employed to limit costly switches, striking a balance between information gathering and incurred penalties.

Bandits with switching costs refer to a class of online learning and sequential decision problems in which the learner, as in classical multi-armed bandits (MAB), chooses actions (arms) over a sequence of rounds, but now each change (switch) of action incurs an explicit and non-negligible cost. This formulation brings into focus the fundamental dilemma between exploration, which often necessitates switches to new actions to gather information, and the direct penalty levied for each such switch. The resulting model captures a wide range of scenarios—from adversarial and stochastic MAB to online learning with structured feedback graphs and general movement metrics—where control of adaptivity, exploration, and incurred cost is paramount. The minimax regret rates, which fundamentally differ from those in the classic bandit or expert settings, display distinctive scaling laws and complexity thresholds.

1. Formal Model and Regret Formulation

Let the time horizon be TT and the set of available actions (arms) have cardinality KK. At each round tt, the learner selects an action Xt[K]X_t \in [K]. The adversary, which can be oblivious or have limited adaptivity, fixes a loss sequence: t ⁣:[K][0,1],for t=1,,T.\ell_t \colon [K] \to [0,1], \qquad \text{for } t=1, \ldots, T. The learner pays the instantaneous loss t(Xt)\ell_t(X_t) and a unit cost whenever she switches: CT=t=1Tt(Xt)+t=1T1XtXt1,C_T = \sum_{t=1}^T \ell_t(X_t) + \sum_{t=1}^T \mathbf{1}_{X_t \neq X_{t-1}}, with X0X_0 a dummy null action, so X1X_1 always counts as a switch. Regret is measured against the best fixed arm in hindsight: RT=CTminx[K]t=1Tt(x).R_T = C_T - \min_{x \in [K]} \sum_{t=1}^T \ell_t(x). The minimax regret is the worst-case expected regret achievable by any (possibly randomized) learning strategy: RT=inflearnersup1:TE[RT].\mathcal{R}_T = \inf_{\text{learner}} \sup_{\ell_{1:T}} \mathbb{E}[R_T]. This formalism generalizes naturally to arbitrary movement cost functions Δ(i,j)\Delta(i,j) (incurred for Xt1XtX_{t-1} \to X_t), combinatorial arms, feedback graphs, and more (Dekel et al., 2013, Koren et al., 2017, Rangi et al., 2018, Arora et al., 2019, Dong et al., 2024).

2. Minimax Regret: Scaling Laws and Lower Bounds

In adversarial bandits with unit switching costs, the central result is that minimax regret scales as Θ~(K1/3T2/3)\widetilde\Theta(K^{1/3} T^{2/3}) for KTK \le T (Dekel et al., 2013): RT=Θ~(K1/3T2/3).\mathcal{R}_T = \widetilde\Theta\bigl(K^{1/3}T^{2/3}\bigr). This scaling is tight, as proven by matching upper and lower bounds. The lower bound employs a randomized adversarial construction that hides the identity of the optimal arm behind a carefully designed multi-scale random walk (MRW) process. Information-theoretic arguments show that identifying the correct arm with sufficient confidence costs at least Ω(K1/3T2/3)\Omega(K^{1/3}T^{2/3}) in either regret from ignorance or in cumulative switching penalties (Dekel et al., 2013, Cesa-Bianchi et al., 2013).

By contrast, the full-information (expert) version, even with switching costs, admits much faster growth: Θ(T)\Theta(\sqrt{T}). The gap reflects the essential difficulty of exploration under bandit feedback when switching is penalized: Full information (experts with switching costs): Θ(TlogK).\text{Full information (experts with switching costs): } \Theta(\sqrt{T \log K}). Partial information settings with graph-based feedback (generalizing the expert and bandit extremes) yield minimax regret in terms of structural graph invariants such as the independence number α(G)\alpha(G) or, more precisely, the domination number γ(G)\gamma(G) (Rangi et al., 2018, Arora et al., 2019): RT=O~(γ(G)1/3T2/3).R_T = \widetilde O\bigl(\gamma(G)^{1/3} T^{2/3}\bigr). Metric movement costs and combinatorial actions lead to regret scaling in terms of covering number-like complexity measures, e.g., for movement cost metric Δ\Delta with covering number C\mathcal{C}: Regret =O~(max{C1/3T2/3, KT}).\text{Regret } = \widetilde O\bigl(\max\{\mathcal{C}^{1/3} T^{2/3},\ \sqrt{KT}\}\bigr). When the arms index a continuous metric space of Minkowski dimension dd and the adversary is Lipschitz, the rate becomes Θ~(Td+1d+2)\widetilde\Theta(T^{\tfrac{d+1}{d+2}}) (Koren et al., 2017).

3. Algorithmic Techniques: Batching, Mini-batching, and Perturbed Leaders

Optimal rates are achieved by algorithms that explicitly limit the total number of switches via batching structures. The central paradigm is to partition the horizon into MM epochs of length LL, run a bandit subroutine (such as Follow-The-Perturbed-Leader, FPL, or Exp3-type) within each epoch, and only allow switches at epoch boundaries (Dekel et al., 2013, Altschuler et al., 2018):

  • Regret within epochs grows as O(KLlogK)O(\sqrt{KL\log K}) per epoch due to bandit noise.
  • Each epoch boundary incurs at most one switch, contributing MM to the total cost.
  • The optimal balance yields L(T2/(KlogK))1/3L \propto (T^2/(K\log K))^{1/3}, M=T/LM = T/L, and

RT=O((KlogK)1/3T2/3).\mathcal{R}_T = O((K\log K)^{1/3} T^{2/3}).

Epoch-based FPL and mini-batched Tsallis-INF are prototypical examples (Dekel et al., 2013, Amir et al., 2022, Rouyer et al., 2021). For more general structures (e.g., partial-information graphs or movement metrics), variants of EXP3/EXP4, log-barrier mirror descent, or John’s exploration are combined with adaptive batching and specialized loss estimators to attain optimal rates (Rangi et al., 2018, Koren et al., 2017, Dong et al., 2024).

4. Information-Theoretic Lower Bounds: The Multi-Scale Random Walk

The core lower bound construction employs an adversarial process in which the losses for each arm are constructed as a sum of a multi-scale random walk and a small bias (the optimality gap), with the best arm hidden. The MRW process has logarithmic depth and width, ensuring:

  • The arms’ losses are highly correlated, so bandit feedback on any arm reveals minimal information about the others unless the learner switches.
  • Each switch uncovers at most O(logT)O(\sqrt{\log T}) bits; identifying the optimal arm with gap ϵ\epsilon requires Ω(1/ϵ2)\Omega(1/\epsilon^2) bits, entailing Ω(K2/3T2/3/logT)\Omega(K^{2/3}T^{2/3}/\log T) total switches for the minimax regime. The interplay of the partial monitoring, exploration required to learn the identity of the best arm, and the penalization of adaptivity by switching cost, is what fundamentally creates the T2/3T^{2/3} minimax frontier (Dekel et al., 2013).

Extending this technique, analogous lower bounds are proven for combinatorial bandits (II items active per round), semi-bandit feedback, and feedback graphs. Pinsker's and chain rule KL-divergence arguments quantify the amount of information per switch and force the trade-off with regret (Dong et al., 2024, Rangi et al., 2018, Arora et al., 2019).

5. Extensions: Feedback Graphs, Metric Costs, and Combinatorial Actions

  • Feedback Graphs: Regret scales as O~(γ(G)1/3T2/3)\widetilde O(\gamma(G)^{1/3} T^{2/3}), where γ(G)\gamma(G) is the domination number, attained by constructing adaptive mini-batch OMD-based algorithms sensitive to the graph structure (Rangi et al., 2018, Arora et al., 2019).
  • Metric/General Movement Costs: When switching penalties are governed by a metric Δ(i,j)\Delta(i,j), the optimal regret is O~(C1/3T2/3)\widetilde O(\mathcal{C}^{1/3}T^{2/3}) where C\mathcal{C} is the relevant covering number (number of “effectively distinguishable” arms at metric scale) (Koren et al., 2017, Koren et al., 2017). Efficient algorithms use HST (hierarchically separated tree) approximations. In infinite metric spaces with Lipschitz losses, the rate Θ~(Td+1d+2)\widetilde\Theta(T^{\frac{d+1}{d+2}}) is minimax optimal.
  • Combinatorial Bandits: For combinatorial arms (action sets of size II from KK base arms), the minimax regret under per-base-arm switching costs is Ω~((λK)1/3(TI)2/3)\tilde{\Omega}((\lambda K)^{1/3}(T I)^{2/3}) (bandit feedback) or Ω~((λKI)1/3T2/3)\tilde{\Omega}((\lambda K I)^{1/3}T^{2/3}) (semi-bandit), with corresponding batch-based algorithms nearly matching these rates (Dong et al., 2024).

6. Contrasts with Full-Information and Other Bandit Regimes

Switching costs induce phase transitions not seen in classical MAB. In full-information (expert) settings, batching is unnecessary and minimax regret remains Θ(T)\Theta(\sqrt{T}) even with switching penalties (Cesa-Bianchi et al., 2013). In bandit regimes, the need to pay a cost to explore creates a hard trade-off between sufficient identification of optimal arms and cost minimization, inflating the regret exponent from $1/2$ to $2/3$ in TT.

For stochastic or “stochastically constrained” adversarial regimes, the best achievable regret interpolates between O(logT/Δ2)O(\log T/\Delta^2) and O(T2/3)O(T^{2/3}), depending on the gap parameter and the cost (Amir et al., 2022, Rouyer et al., 2021). In the stochastic regime with hard switch budgets, the optimal regret curve exhibits sharp phase transitions in the exponent as the budget passes specific thresholds, with each phase corresponding to how many full sweeps through the arms are permitted (Simchi-Levi et al., 2019).

7. Applications and Broader Implications

Bandits with switching costs model a broad array of online learning settings where adaptivity itself is expensive, including resource scheduling with setup costs, adaptive pricing under buyer patience, online routing and caching, and partial monitoring. They also subsume learning in adversarial MDPs with bandit feedback, for which the minimax regret is Θ~(T2/3)\widetilde\Theta(T^{2/3}) (Dekel et al., 2013). The same principles extend to Markovian bandits, where index-based policies (such as those computed by computationally efficient variants of the Asawa–Teneketzis index) become only approximately optimal, with no general exact index rule (Niño-Mora, 2023, Li et al., 2021). The analysis and techniques have inspired advances in feedback graph learning, combinatorial optimization under exploration constraints, and sequential decision problems with limited adaptivity.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bandits with Switching Costs.