Regime-Switching Bandits
- Regime-switching bandits are sequential decision models that account for abrupt or hidden changes in reward distributions, extending classical stationary bandit frameworks.
- They include piecewise-stationary models with change-point detection and latent Markov-modulated approaches leveraging spectral methods for robust parameter estimation.
- Algorithms like PrudentBandits and SEEU provide theoretical guarantees with logarithmic or sublinear regret, adapting dynamically to nonstationary conditions.
Regime-switching bandits are sequential decision-making models in which the reward distributions of available arms evolve according to an underlying process that features abrupt or latent transitions between discrete regimes. Such models subsume classical stochastic bandit settings as special cases when there is no regime change, and generalize to various nonstationary environments prevalent in information systems, financial markets, and adaptive control. Two principal and widely studied instantiations are the piecewise-stationary (change-point) model, wherein abrupt reward changes occur at unknown time points, and Markov-modulated bandits, in which arm rewards are determined by an unobserved, often ergodic, Markov process. These frameworks require online learning algorithms to dynamically balance exploration and exploitation, while tracking or detecting regime changes and adapting policies in the presence of nonstationarity.
1. Formal Models of Regime Switching
Regime-switching bandits can be formalized through two main paradigms: (i) piecewise-stationary (change-point) models, and (ii) latent Markov-modulated reward models.
(a) Piecewise-Stationary (Switching) Bandits
Let denote the arm set. At each time , pulling arm yields with , . The horizon is partitioned by unknown breakpoints creating regimes; within each, is constant for all 0. The number of switches is 1. The canonical regret is 2 with 3 (Manegueu et al., 2021).
(b) Hidden Markov Model (HMM)-Modulated Bandits
Here, a latent, finite-state, ergodic Markov chain 4, with unknown transition matrix 5 and states 6, controls the reward distribution:
- At time 7, state 8.
- Pulling arm 9 yields a reward drawn from 0, mean 1.
- The agent observes only rewards, not the current state (Zhou et al., 2020).
This framework generalizes the piecewise-stationary model and captures persistent latent regime dynamics frequently observed in finance or networked control.
2. Algorithms for Regime-Switching Bandits
(a) Algorithms for Piecewise-Stationary Bandits
The PrudentBandits algorithm (Manegueu et al., 2021) is a unified method for various nonstationary regimes, specializing to switching bandits with 2 and known (or overestimated) number of stationary segments 3. It operates as follows:
- Maintains active set 4 based on when each arm was last pulled and a lower bound 5 on the reward gap.
- Pulls each arm in 6 once per round; updates sample-based statistics.
- Performs change-point detection by comparing pairwise estimated gaps between subintervals within each episode; triggers episode reset upon significant discrepancy.
- Achieves detection with high probability and avoids spurious resets in the piecewise-constant case.
(b) Spectral and Belief-state Algorithms for Latent Markov Regimes
The SEEU (Spectral Exploration+UCB Exploitation) algorithm (Zhou et al., 2020) integrates spectral learning and belief-state planning:
- Alternates fixed-length exploration phases (uniform random arm selection) for HMM parameter estimation via method-of-moments tensor decompositions.
- In exploitation phases, computes the most optimistic model in the confidence region (in mean matrix 7 and transition matrix 8) and then follows the optimal belief-state policy for the induced POMDP.
- Belief state is recursively updated using Bayesian filtering; confidence sets shrink with further exploration.
(c) Conformal Prediction with Regime Detection
The Conformal Bandit framework (Cuonzo et al., 10 Dec 2025) integrates Conformalized Quantile Regression (CQR) for finite-time coverage with online regime-inference using a fitted HMM (with states, e.g., “Bull”, “Neutral”, “Bear” for financial returns):
- At each round, uses the current HMM-posterior on regimes to guide both prediction intervals and arm selection via regime-adaptive policies (upsides during bullish/neutral, downside protection during bearish).
- Provides robust statistical coverage and adapts exploration according to changing regimes.
3. Regret Analysis and Theoretical Guarantees
(a) Piecewise-Stationary Bandits
The regret of PrudentBandits in the switching case admits 9 with 0 (Manegueu et al., 2021). Up to logarithmic factors, this matches the minimax lower bound 1 for switching bandits shown in prior works (Garivier–Moulines ’11, etc.), and holds under bounded and well-separated gaps with known 2.
(b) Markov-Modulated Bandits
For the latent regime case, SEEU achieves, with high probability, 3 with 4 a constant determined by model parameters, mixing rates, and spectral identifiability (Zhou et al., 2020). The proof decomposes regret into exploration, exploitation, and model/belief estimation error, with spectral error propagation controlled via Lipschitz bounds in the belief state. In finite-sample experiments, only SEEU achieves sublinear regret; other methods exhibit linear regret due to inability to adapt to latent regime transitions.
(c) Conformal Regime-Aware Guarantees
The Regime-Aware Conformal Bandit policy provides finite-sample coverage, 5 for all arms and rounds under Adaptive Conformal Inference, irrespective of regime changes (Cuonzo et al., 10 Dec 2025). Empirical results show near-logarithmic regret and high risk-adjusted returns in financial switching environments, though no explicit closed-form regret bound is established for the regime-switching setting.
4. Learning Regimes: Change-Point Detection and Hidden State Inference
Change-Point Detection in Piecewise-Stationary Models
Algorithms such as PrudentBandits perform pairwise gap-based change-point tests between episode subintervals to efficiently detect abrupt changes. Given two time intervals and an arm 6, if the difference in estimated average gaps exceeds adaptive thresholds, an episode reset is triggered. This approach allows detection of regime changes without spurious false-positives in piecewise-constant models (Manegueu et al., 2021).
Hidden Markov Model (HMM) Estimation
In the Markov-modulated model, learning requires
- Estimating transition and reward parameters using method-of-moments tensor decomposition during exploration phases (Zhou et al., 2020).
- Recursively updating beliefs via HMM forward filters.
- Adapting confidence sets on model parameters and planning optimistically to minimize regret under uncertainty in both current regime and arm rewards.
Adaptive detection of regime changes is performed via these estimators, enabling the learning policy to exploit variations in the underlying state sequence.
5. Practical Considerations and Empirical Performance
Parameter Knowledge and Adaptivity
In PrudentBandits, an upper bound on the number of regimes 7 (or switches 8) must be specified. When 9 is unknown, adaptation strategies from recent literature (e.g., Auer et al. ’19) are recommended. The spectral approach in HMM-modulated settings requires a minimum exploration sample size for consistent estimation (Manegueu et al., 2021, Zhou et al., 2020).
Computation
PrudentBandits requires computation of gap estimates for all pairs of subintervals per round; this can be mitigated by windowing strategies. The spectral method incurs a computational load for tensor decomposition and POMDP value iteration during exploitation phases.
Experimental Results
- SEEU outperforms ε-greedy, sliding-window UCB, and Exp3.S in simulated Markov-regime environments, matching 0 regret scaling (Zhou et al., 2020).
- Regime-Aware Conformal Bandits show higher returns, Sharpe ratios, and lower drawdowns compared to UCB1 and non-regime-aware conformal policies in portfolio allocation tasks with HMM-detected market regimes (Cuonzo et al., 10 Dec 2025).
- Coverage guarantees for conformal intervals hold in small-gap and regime-switching conditions, with tighter intervals than classical UCB methods.
6. Relations to Broader Bandit and Nonstationary Learning Literature
Classical multi-armed bandit theory assumes stationary reward distributions. Regime-switching bandits generalize this, enabling learning in abruptly or persistently nonstationary environments. The piecewise-stationary and Markov regime-switching frameworks encompass a variety of real-world settings, from communication systems to financial decision-making.
Recent work demonstrates that minimax optimal regret rates in switching bandits can be achieved by combining efficient change-point detection with gap-based tests, while spectral approaches and belief-state planning unlock tractable and statistically efficient learning in latent state models. The integration of conformal prediction tools provides additional statistical guarantees—namely, finite-sample coverage—in online and nonstationary settings, a property unattainable in classical regret-minimization-only frameworks.
Empirical and theoretical evidence suggests that modular architectures capable of handling both abrupt and smoothly-varying nonstationarity, as well as latent regime structure, offer strong performance and adaptability in complex sequential learning environments.
Key References:
- "Generalized non-stationary bandits" (Manegueu et al., 2021)
- "Regime Switching Bandits" (Zhou et al., 2020)
- "Conformal Bandits: Bringing statistical validity and reward efficiency to the small-gap regime" (Cuonzo et al., 10 Dec 2025)