Bandits with Switching Costs

Updated 8 March 2026

Bandits with switching costs are online learning problems where each change in action incurs a penalty, complicating the trade-off between exploration and exploitation.
The analysis demonstrates that minimax regret scales as ~Θ(K^(1/3)T^(2/3)), highlighting distinct performance boundaries compared to standard multi-armed bandits.
Algorithmic techniques such as batching and mini-batching are employed to limit costly switches, striking a balance between information gathering and incurred penalties.

Bandits with switching costs refer to a class of online learning and sequential decision problems in which the learner, as in classical multi-armed bandits (MAB), chooses actions (arms) over a sequence of rounds, but now each change (switch) of action incurs an explicit and non-negligible cost. This formulation brings into focus the fundamental dilemma between exploration, which often necessitates switches to new actions to gather information, and the direct penalty levied for each such switch. The resulting model captures a wide range of scenarios—from adversarial and stochastic MAB to online learning with structured feedback graphs and general movement metrics—where control of adaptivity, exploration, and incurred cost is paramount. The minimax regret rates, which fundamentally differ from those in the classic bandit or expert settings, display distinctive scaling laws and complexity thresholds.

1. Formal Model and Regret Formulation

Let the time horizon be $T$ and the set of available actions (arms) have cardinality $K$ . At each round $t$ , the learner selects an action $X_t \in [K]$ . The adversary, which can be oblivious or have limited adaptivity, fixes a loss sequence: $\ell_t \colon [K] \to [0,1], \qquad \text{for } t=1, \ldots, T.$ The learner pays the instantaneous loss $\ell_t(X_t)$ and a unit cost whenever she switches: $C_T = \sum_{t=1}^T \ell_t(X_t) + \sum_{t=1}^T \mathbf{1}_{X_t \neq X_{t-1}},$ with $X_0$ a dummy null action, so $X_1$ always counts as a switch. Regret is measured against the best fixed arm in hindsight: $R_T = C_T - \min_{x \in [K]} \sum_{t=1}^T \ell_t(x).$ The minimax regret is the worst-case expected regret achievable by any (possibly randomized) learning strategy: $\mathcal{R}_T = \inf_{\text{learner}} \sup_{\ell_{1:T}} \mathbb{E}[R_T].$ This formalism generalizes naturally to arbitrary movement cost functions $\Delta(i,j)$ (incurred for $X_{t-1} \to X_t$ ), combinatorial arms, feedback graphs, and more (Dekel et al., 2013, Koren et al., 2017, Rangi et al., 2018, Arora et al., 2019, Dong et al., 2024).

2. Minimax Regret: Scaling Laws and Lower Bounds

In adversarial bandits with unit switching costs, the central result is that minimax regret scales as $\widetilde\Theta(K^{1/3} T^{2/3})$ for $K \le T$ (Dekel et al., 2013): $\mathcal{R}_T = \widetilde\Theta\bigl(K^{1/3}T^{2/3}\bigr).$ This scaling is tight, as proven by matching upper and lower bounds. The lower bound employs a randomized adversarial construction that hides the identity of the optimal arm behind a carefully designed multi-scale random walk (MRW) process. Information-theoretic arguments show that identifying the correct arm with sufficient confidence costs at least $\Omega(K^{1/3}T^{2/3})$ in either regret from ignorance or in cumulative switching penalties (Dekel et al., 2013, Cesa-Bianchi et al., 2013).

By contrast, the full-information (expert) version, even with switching costs, admits much faster growth: $\Theta(\sqrt{T})$ . The gap reflects the essential difficulty of exploration under bandit feedback when switching is penalized: $\text{Full information (experts with switching costs): } \Theta(\sqrt{T \log K}).$ Partial information settings with graph-based feedback (generalizing the expert and bandit extremes) yield minimax regret in terms of structural graph invariants such as the independence number $\alpha(G)$ or, more precisely, the domination number $\gamma(G)$ (Rangi et al., 2018, Arora et al., 2019): $R_T = \widetilde O\bigl(\gamma(G)^{1/3} T^{2/3}\bigr).$ Metric movement costs and combinatorial actions lead to regret scaling in terms of covering number-like complexity measures, e.g., for movement cost metric $\Delta$ with covering number $\mathcal{C}$ : $\text{Regret } = \widetilde O\bigl(\max\{\mathcal{C}^{1/3} T^{2/3},\ \sqrt{KT}\}\bigr).$ When the arms index a continuous metric space of Minkowski dimension $d$ and the adversary is Lipschitz, the rate becomes $\widetilde\Theta(T^{\tfrac{d+1}{d+2}})$ (Koren et al., 2017).

3. Algorithmic Techniques: Batching, Mini-batching, and Perturbed Leaders

Optimal rates are achieved by algorithms that explicitly limit the total number of switches via batching structures. The central paradigm is to partition the horizon into $M$ epochs of length $L$ , run a bandit subroutine (such as Follow-The-Perturbed-Leader, FPL, or Exp3-type) within each epoch, and only allow switches at epoch boundaries (Dekel et al., 2013, Altschuler et al., 2018):

Regret within epochs grows as $O(\sqrt{KL\log K})$ per epoch due to bandit noise.
Each epoch boundary incurs at most one switch, contributing $M$ to the total cost.
The optimal balance yields $L \propto (T^2/(K\log K))^{1/3}$ , $M = T/L$ , and

$\mathcal{R}_T = O((K\log K)^{1/3} T^{2/3}).$

Epoch-based FPL and mini-batched Tsallis-INF are prototypical examples (Dekel et al., 2013, Amir et al., 2022, Rouyer et al., 2021). For more general structures (e.g., partial-information graphs or movement metrics), variants of EXP3/EXP4, log-barrier mirror descent, or John’s exploration are combined with adaptive batching and specialized loss estimators to attain optimal rates (Rangi et al., 2018, Koren et al., 2017, Dong et al., 2024).

4. Information-Theoretic Lower Bounds: The Multi-Scale Random Walk

The core lower bound construction employs an adversarial process in which the losses for each arm are constructed as a sum of a multi-scale random walk and a small bias (the optimality gap), with the best arm hidden. The MRW process has logarithmic depth and width, ensuring:

The arms’ losses are highly correlated, so bandit feedback on any arm reveals minimal information about the others unless the learner switches.
Each switch uncovers at most $O(\sqrt{\log T})$ bits; identifying the optimal arm with gap $\epsilon$ requires $\Omega(1/\epsilon^2)$ bits, entailing $\Omega(K^{2/3}T^{2/3}/\log T)$ total switches for the minimax regime. The interplay of the partial monitoring, exploration required to learn the identity of the best arm, and the penalization of adaptivity by switching cost, is what fundamentally creates the $T^{2/3}$ minimax frontier (Dekel et al., 2013).

Extending this technique, analogous lower bounds are proven for combinatorial bandits ( $I$ items active per round), semi-bandit feedback, and feedback graphs. Pinsker's and chain rule KL-divergence arguments quantify the amount of information per switch and force the trade-off with regret (Dong et al., 2024, Rangi et al., 2018, Arora et al., 2019).

5. Extensions: Feedback Graphs, Metric Costs, and Combinatorial Actions

Feedback Graphs: Regret scales as $\widetilde O(\gamma(G)^{1/3} T^{2/3})$ , where $\gamma(G)$ is the domination number, attained by constructing adaptive mini-batch OMD-based algorithms sensitive to the graph structure (Rangi et al., 2018, Arora et al., 2019).
Metric/General Movement Costs: When switching penalties are governed by a metric $\Delta(i,j)$ , the optimal regret is $\widetilde O(\mathcal{C}^{1/3}T^{2/3})$ where $\mathcal{C}$ is the relevant covering number (number of “effectively distinguishable” arms at metric scale) (Koren et al., 2017, Koren et al., 2017). Efficient algorithms use HST (hierarchically separated tree) approximations. In infinite metric spaces with Lipschitz losses, the rate $\widetilde\Theta(T^{\frac{d+1}{d+2}})$ is minimax optimal.
Combinatorial Bandits: For combinatorial arms (action sets of size $I$ from $K$ base arms), the minimax regret under per-base-arm switching costs is $\tilde{\Omega}((\lambda K)^{1/3}(T I)^{2/3})$ (bandit feedback) or $\tilde{\Omega}((\lambda K I)^{1/3}T^{2/3})$ (semi-bandit), with corresponding batch-based algorithms nearly matching these rates (Dong et al., 2024).

6. Contrasts with Full-Information and Other Bandit Regimes

Switching costs induce phase transitions not seen in classical MAB. In full-information (expert) settings, batching is unnecessary and minimax regret remains $\Theta(\sqrt{T})$ even with switching penalties (Cesa-Bianchi et al., 2013). In bandit regimes, the need to pay a cost to explore creates a hard trade-off between sufficient identification of optimal arms and cost minimization, inflating the regret exponent from $1/2$ to $2/3$ in $T$ .

For stochastic or “stochastically constrained” adversarial regimes, the best achievable regret interpolates between $O(\log T/\Delta^2)$ and $O(T^{2/3})$ , depending on the gap parameter and the cost (Amir et al., 2022, Rouyer et al., 2021). In the stochastic regime with hard switch budgets, the optimal regret curve exhibits sharp phase transitions in the exponent as the budget passes specific thresholds, with each phase corresponding to how many full sweeps through the arms are permitted (Simchi-Levi et al., 2019).

7. Applications and Broader Implications

Bandits with switching costs model a broad array of online learning settings where adaptivity itself is expensive, including resource scheduling with setup costs, adaptive pricing under buyer patience, online routing and caching, and partial monitoring. They also subsume learning in adversarial MDPs with bandit feedback, for which the minimax regret is $\widetilde\Theta(T^{2/3})$ (Dekel et al., 2013). The same principles extend to Markovian bandits, where index-based policies (such as those computed by computationally efficient variants of the Asawa–Teneketzis index) become only approximately optimal, with no general exact index rule (Niño-Mora, 2023, Li et al., 2021). The analysis and techniques have inspired advances in feedback graph learning, combinatorial optimization under exploration constraints, and sequential decision problems with limited adaptivity.

References:

"Bandits with Switching Costs: $T^{2/3}$ Regret" (Dekel et al., 2013)
"Online Learning with Switching Costs and Other Adaptive Adversaries" (Cesa-Bianchi et al., 2013)
"Online learning with feedback graphs and switching costs" (Rangi et al., 2018)
"Multi-Armed Bandits with Metric Movement Costs" (Koren et al., 2017)
"A faster index algorithm and a computational study for bandits with switching costs" (Niño-Mora, 2023)
"Online learning over a finite action set with limited switching" (Altschuler et al., 2018)
"Phase Transitions in Bandits with Switching Constraints" (Simchi-Levi et al., 2019)
"Bandits with Feedback Graphs and Switching Costs" (Arora et al., 2019)
"Adversarial Combinatorial Bandits with Switching Costs" (Dong et al., 2024)
"Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits" (Luo et al., 2022)
"Bandits with Movement Costs and Adaptive Pricing" (Koren et al., 2017)
"Multinomial Logit Bandit with Low Switching Cost" (Dong et al., 2020)
"An Algorithm for Stochastic and Adversarial Bandits with Switching Costs" (Rouyer et al., 2021)
"Better Best of Both Worlds Bounds for Bandits with Switching Costs" (Amir et al., 2022)
"Multi-token Markov Game with Switching Costs" (Li et al., 2021)