Expert Selection Bandits in MDPs
- Expert selection bandits in MDPs are multi-armed bandit models that incorporate regime-switching to adapt to dynamic, non-stationary reward structures.
- They employ techniques like gap estimation, spectral methods for HMMs, and conformal inference to detect regime changes and optimize arm selection.
- Theoretical analyses and empirical evaluations demonstrate sublinear regret bounds and improved performance in finance, recommender systems, and control applications.
A regime-switching bandit refers to a class of multi-armed bandit models in which the underlying reward distributions of arms are influenced by non-stationary, piecewise-constant, or Markovian regime changes. These models are motivated by applications in finance, recommender systems, and control, where latent environmental states cause abrupt changes in reward structure over time. Approaches to regime-switching bandits include statistical changepoint detection algorithms, spectral methods for hidden Markov models, and frameworks integrating conformal inference with regime-aware learning policies.
1. Formal Definitions and Modeling Variants
The regime-switching bandit framework encompasses several formalizations, all of which generalize the classical stationary multi-armed bandit:
- Piecewise-Stationary (Switching Regimes): The time horizon is partitioned by unknown breakpoints into non-overlapping stationary regimes. For each arm and time , the reward is
where is piecewise-constant, changing only at breakpoints (Manegueu et al., 2021).
- Markovian Regime-Switching: Rewards are modulated by a latent Markov process over states. The underlying state evolves as an ergodic Markov chain with transition matrix 0, which is unobserved. Each arm 1 has regime-dependent means 2 (Zhou et al., 2020).
- HMM-Driven Reward Dynamics (as in applications to finance): The latent regime 3 evolves via a homogeneous Markov chain, and observed rewards or vectors (e.g., log-returns) are generated conditionally on the regime, often assumed Gaussian with state-dependent means and covariances. Arms may correspond to actions such as portfolio allocations (Cuonzo et al., 10 Dec 2025).
- Regret Metric: In both settings, the benchmark is an oracle that, at every 4, selects the arm 5 (or the optimal POMDP policy given access to 6 and emission laws), with regret defined as
7
or, for Markovian regimes, as 8, where 9 is the steady-state average reward under the optimal belief-policy (Manegueu et al., 2021, Zhou et al., 2020).
2. Algorithmic Methods for Regime-Switching Bandits
Piecewise-Stationary Regimes: PrudentBandits
The PrudentBandits algorithm (Manegueu et al., 2021) is a unified method for piecewise-stationary or “switching” bandits. It operates by maintaining statistics over episodes demarcated by changepoints:
- Active arm set construction: Selects arms for which the time-since-last-pull exceeds a lower bound tied to gap estimates.
- Gap estimation: For arm 0 over subinterval 1,
2
where the max is over arms present throughout the interval and 3 is the number of arm pulls.
- Lower bounds on gaps: Apply statistical confidence corrections tailored for switching settings.
- Change-point detection: Detections are triggered by significant discrepancies in gap estimates between subintervals, ensuring avoidance of false alarms and timely detection of true breakpoints.
Markovian Regimes: Spectral and Belief-based Methods
The SEEU algorithm (Zhou et al., 2020) leverages hidden Markov model (HMM) structure by:
- Spectral Method-of-Moments: Empirically estimates the HMM’s transition matrix 4 and reward distribution means 5 via tensor decompositions of second- and third-order moments of observed arm-reward pairs.
- Belief-state Tracking: Updates a belief vector 6 over regimes using Bayesian filtering, allowing the agent to act optimally according to the inferred regime distribution.
- Optimistic Model Selection: Maintains high-probability confidence sets around parameter estimates and solves for the optimal belief-policy for the most optimistic plausible model at each episode.
- Phased Learning: Alternates between uniform exploration phases (to ensure identifiability of the HMM) and exploitation using the best policy inferred so far.
Conformal Bandits with Regime Detection
The Conformal Bandit framework (Cuonzo et al., 10 Dec 2025) is integrated with regime-sensing HMMs:
- Conformalized Quantile Regression (CQR): For each arm, constructs finite-sample-valid predictive intervals via adaptive calibration scores, adapting to observed contextual covariates.
- HMM Filtering: At each round, updates regime posterior probabilities based on new contextual information, using the EM algorithm-fitted transition model and Gaussian emission parameters estimated historically.
- Regime-aware arm selection: Uses regime estimates (e.g., Bull, Bear, Neutral) to select arms via regime-specific decision indices: e.g., pick the arm with highest upper confidence bound in Bull/Neutral, or protect downside by selecting highest lower bound in Bear period.
3. Regret Bounds and Theoretical Guarantees
Piecewise-Stationary Case
For 7 switches (i.e., 8 stationary segments) over horizon 9 and 0 arms, the main regret result for PrudentBandits (Manegueu et al., 2021):
1
which matches the minimax optimal rate 2 up to a logarithmic factor, assuming gaps between regime-best and sub-optimal arms are bounded away from zero and the number of regimes is known or tightly upper-bounded.
Markovian Regimes
For the Markovian regime-switching bandit, the SEEU algorithm achieves (Zhou et al., 2020):
3
with high probability, under ergodicity, full-rank emission matrix, and sufficient exploration. This rate is sublinear but does not reach the 4 minimax rate, due to the need for repeated HMM parameter estimation under partial observability.
Conformal Bandits
In the regime-switching Conformal Bandit setting, theoretical guarantees are provided for finite-sample coverage of the constructed prediction intervals for each arm-reward, uniformly over arms and time:
5
for any sequence of regimes (Cuonzo et al., 10 Dec 2025). No explicit regret bound is proved for this setting, but empirical results indicate near-logarithmic regret in small-gap regimes, outperforming standard stationary UCB methods under heavy-tailed and tiny-gap scenarios.
4. Practical Aspects and Implementation Considerations
- Parameter Knowledge: Switching-bandit methods require an upper bound on the number of regime changes (6 or 7). Fully adaptive routines for unknown numbers of regimes remain an active area (Manegueu et al., 2021).
- Detection Mechanisms: The efficacy of gap-based versus mean-based changepoint detection is highlighted: gap-based approaches yield sharper detection events when only relative arm performance is relevant.
- Spectral Methods: Identification of latent HMM structure via tensor decompositions is data-efficient but requires episodes of uniform exploration.
- Regime Inference: In financial applications, HMMs with Gaussian emissions are learned on historical data and used online for regime inference, which then modulates actionable decisions (Cuonzo et al., 10 Dec 2025).
- Computational Complexity: Algorithms updating statistics over all arms and subintervals can be computationally intensive at large 8 or with many arms; windowed estimation and episode batching are practical mitigations.
5. Empirical and Applied Evaluations
Piecewise-Stationary Bandits
Theory-driven evaluations for PrudentBandits demonstrate optimal regret scaling and flexibility for handling mixed forms of nonstationarity (including polynomial drift and inflection-bounded variations) under a single code base (Manegueu et al., 2021).
Markovian Regimes
Proof-of-concept experiments for the SEEU algorithm demonstrate sublinear (slope 9 in log-log plots) regret, matching theoretical rates and outperforming baselines such as 0-greedy, sliding-window UCB, and Exp3.S, all of which suffer linear regret in the presence of hidden regime switching (Zhou et al., 2020).
Financial Regime-Switching Applications
The Regime-Aware Conformal Bandit method, applied to portfolio selection on real ETF data, shows improved partial-information metrics (cumulative wealth, Sharpe ratio, drawdown) over standard UCB and stationary conformal approaches. The HMM-augmented conformal framework preserves finite-sample predictive coverage and delivers higher risk-adjusted returns. Empirical simulations reveal that all conformal variants maintain nominal (typically ~80%) coverage, with interval widths narrower than non-conformal UCB, particularly in small-gap, heavy-tail reward distributions (Cuonzo et al., 10 Dec 2025).
6. Comparison to Classical and Related Frameworks
| Model type | Regret Bound | Regime Modeling Approach |
|---|---|---|
| Piecewise-stationary (PrudentBandits) | 1 | Deterministic breakpoints |
| Markovian regimes (SEEU) | 2 | Hidden finite-state MC |
| Regime-aware Conformal Bandit | Empirical near-logarithmic | HMM + conformal inference |
Classical switching bandit methods achieve the minimax 3 regret up to log factors, assuming known regime segmentation and reasonably separated arm means. Markovian approaches yield higher regret but do not require prior knowledge of regime structure, instead learning latent transitions online. Conformal bandit methods emphasize statistical coverage in addition to regret, providing prediction validity guarantees not generally available in classical approaches.
7. Perspectives and Future Directions
Open directions include online adaptation to an unknown and potentially unbounded number of latent regimes, tighter regret bounds for bandits under hidden Markov dependence, optimal exploration schedules for spectral learning phases, and integration of finite-sample coverage guarantees with minimax-optimal regret. In nonparametric and heavy-tailed small-gap settings, coupling regime detection with robust inference methods demonstrates advantages for both risk-adjusted rewards and predictive validity (Cuonzo et al., 10 Dec 2025). Robust and scalable methods for computational efficiency and real-time deployment are increasingly significant given high-dimensional and fast-switching environments.