Expert Selection Bandits in MDPs

Updated 31 May 2026

Expert selection bandits in MDPs are multi-armed bandit models that incorporate regime-switching to adapt to dynamic, non-stationary reward structures.
They employ techniques like gap estimation, spectral methods for HMMs, and conformal inference to detect regime changes and optimize arm selection.
Theoretical analyses and empirical evaluations demonstrate sublinear regret bounds and improved performance in finance, recommender systems, and control applications.

A regime-switching bandit refers to a class of multi-armed bandit models in which the underlying reward distributions of arms are influenced by non-stationary, piecewise-constant, or Markovian regime changes. These models are motivated by applications in finance, recommender systems, and control, where latent environmental states cause abrupt changes in reward structure over time. Approaches to regime-switching bandits include statistical changepoint detection algorithms, spectral methods for hidden Markov models, and frameworks integrating conformal inference with regime-aware learning policies.

1. Formal Definitions and Modeling Variants

The regime-switching bandit framework encompasses several formalizations, all of which generalize the classical stationary multi-armed bandit:

Piecewise-Stationary (Switching Regimes): The time horizon $\{1,2,\ldots,T\}$ is partitioned by unknown breakpoints $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ into $M$ non-overlapping stationary regimes. For each arm $k\in\mathcal{K} = \{1,\ldots,K\}$ and time $t$ , the reward is

$X_{k,t} = \mu_{k}(t) + \epsilon_{k,t}, \quad \epsilon_{k,t}\in [0,1],\quad \mathbb{E}[\epsilon_{k,t}] = 0,$

where $\mu_k(t)$ is piecewise-constant, changing only at breakpoints $\bar\tau_m$ (Manegueu et al., 2021).

Markovian Regime-Switching: Rewards are modulated by a latent Markov process $\{S_t\}_{t=1}^T$ over $M$ states. The underlying state evolves as an ergodic Markov chain with transition matrix $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 0, which is unobserved. Each arm $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 1 has regime-dependent means $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 2 (Zhou et al., 2020).
HMM-Driven Reward Dynamics (as in applications to finance): The latent regime $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 3 evolves via a homogeneous Markov chain, and observed rewards or vectors (e.g., log-returns) are generated conditionally on the regime, often assumed Gaussian with state-dependent means and covariances. Arms may correspond to actions such as portfolio allocations (Cuonzo et al., 10 Dec 2025).
Regret Metric: In both settings, the benchmark is an oracle that, at every $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 4, selects the arm $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 5 (or the optimal POMDP policy given access to $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 6 and emission laws), with regret defined as

$1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 7

or, for Markovian regimes, as $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 8, where $1 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1$ 9 is the steady-state average reward under the optimal belief-policy (Manegueu et al., 2021, Zhou et al., 2020).

2. Algorithmic Methods for Regime-Switching Bandits

Piecewise-Stationary Regimes: PrudentBandits

The PrudentBandits algorithm (Manegueu et al., 2021) is a unified method for piecewise-stationary or “switching” bandits. It operates by maintaining statistics over episodes demarcated by changepoints:

Active arm set construction: Selects arms for which the time-since-last-pull exceeds a lower bound tied to gap estimates.
Gap estimation: For arm $M$ 0 over subinterval $M$ 1,

$M$ 2

where the max is over arms present throughout the interval and $M$ 3 is the number of arm pulls.

Lower bounds on gaps: Apply statistical confidence corrections tailored for switching settings.
Change-point detection: Detections are triggered by significant discrepancies in gap estimates between subintervals, ensuring avoidance of false alarms and timely detection of true breakpoints.

Markovian Regimes: Spectral and Belief-based Methods

The SEEU algorithm (Zhou et al., 2020) leverages hidden Markov model (HMM) structure by:

Spectral Method-of-Moments: Empirically estimates the HMM’s transition matrix $M$ 4 and reward distribution means $M$ 5 via tensor decompositions of second- and third-order moments of observed arm-reward pairs.
Belief-state Tracking: Updates a belief vector $M$ 6 over regimes using Bayesian filtering, allowing the agent to act optimally according to the inferred regime distribution.
Optimistic Model Selection: Maintains high-probability confidence sets around parameter estimates and solves for the optimal belief-policy for the most optimistic plausible model at each episode.
Phased Learning: Alternates between uniform exploration phases (to ensure identifiability of the HMM) and exploitation using the best policy inferred so far.

Conformal Bandits with Regime Detection

The Conformal Bandit framework (Cuonzo et al., 10 Dec 2025) is integrated with regime-sensing HMMs:

Conformalized Quantile Regression (CQR): For each arm, constructs finite-sample-valid predictive intervals via adaptive calibration scores, adapting to observed contextual covariates.
HMM Filtering: At each round, updates regime posterior probabilities based on new contextual information, using the EM algorithm-fitted transition model and Gaussian emission parameters estimated historically.
Regime-aware arm selection: Uses regime estimates (e.g., Bull, Bear, Neutral) to select arms via regime-specific decision indices: e.g., pick the arm with highest upper confidence bound in Bull/Neutral, or protect downside by selecting highest lower bound in Bear period.

3. Regret Bounds and Theoretical Guarantees

Piecewise-Stationary Case

For $M$ 7 switches (i.e., $M$ 8 stationary segments) over horizon $M$ 9 and $k\in\mathcal{K} = \{1,\ldots,K\}$ 0 arms, the main regret result for PrudentBandits (Manegueu et al., 2021):

$k\in\mathcal{K} = \{1,\ldots,K\}$ 1

which matches the minimax optimal rate $k\in\mathcal{K} = \{1,\ldots,K\}$ 2 up to a logarithmic factor, assuming gaps between regime-best and sub-optimal arms are bounded away from zero and the number of regimes is known or tightly upper-bounded.

Markovian Regimes

For the Markovian regime-switching bandit, the SEEU algorithm achieves (Zhou et al., 2020):

$k\in\mathcal{K} = \{1,\ldots,K\}$ 3

with high probability, under ergodicity, full-rank emission matrix, and sufficient exploration. This rate is sublinear but does not reach the $k\in\mathcal{K} = \{1,\ldots,K\}$ 4 minimax rate, due to the need for repeated HMM parameter estimation under partial observability.

Conformal Bandits

In the regime-switching Conformal Bandit setting, theoretical guarantees are provided for finite-sample coverage of the constructed prediction intervals for each arm-reward, uniformly over arms and time:

$k\in\mathcal{K} = \{1,\ldots,K\}$ 5

for any sequence of regimes (Cuonzo et al., 10 Dec 2025). No explicit regret bound is proved for this setting, but empirical results indicate near-logarithmic regret in small-gap regimes, outperforming standard stationary UCB methods under heavy-tailed and tiny-gap scenarios.

4. Practical Aspects and Implementation Considerations

Parameter Knowledge: Switching-bandit methods require an upper bound on the number of regime changes ( $k\in\mathcal{K} = \{1,\ldots,K\}$ 6 or $k\in\mathcal{K} = \{1,\ldots,K\}$ 7). Fully adaptive routines for unknown numbers of regimes remain an active area (Manegueu et al., 2021).
Detection Mechanisms: The efficacy of gap-based versus mean-based changepoint detection is highlighted: gap-based approaches yield sharper detection events when only relative arm performance is relevant.
Spectral Methods: Identification of latent HMM structure via tensor decompositions is data-efficient but requires episodes of uniform exploration.
Regime Inference: In financial applications, HMMs with Gaussian emissions are learned on historical data and used online for regime inference, which then modulates actionable decisions (Cuonzo et al., 10 Dec 2025).
Computational Complexity: Algorithms updating statistics over all arms and subintervals can be computationally intensive at large $k\in\mathcal{K} = \{1,\ldots,K\}$ 8 or with many arms; windowed estimation and episode batching are practical mitigations.

5. Empirical and Applied Evaluations

Piecewise-Stationary Bandits

Theory-driven evaluations for PrudentBandits demonstrate optimal regret scaling and flexibility for handling mixed forms of nonstationarity (including polynomial drift and inflection-bounded variations) under a single code base (Manegueu et al., 2021).

Markovian Regimes

Proof-of-concept experiments for the SEEU algorithm demonstrate sublinear (slope $k\in\mathcal{K} = \{1,\ldots,K\}$ 9 in log-log plots) regret, matching theoretical rates and outperforming baselines such as $t$ 0-greedy, sliding-window UCB, and Exp3.S, all of which suffer linear regret in the presence of hidden regime switching (Zhou et al., 2020).

Financial Regime-Switching Applications

The Regime-Aware Conformal Bandit method, applied to portfolio selection on real ETF data, shows improved partial-information metrics (cumulative wealth, Sharpe ratio, drawdown) over standard UCB and stationary conformal approaches. The HMM-augmented conformal framework preserves finite-sample predictive coverage and delivers higher risk-adjusted returns. Empirical simulations reveal that all conformal variants maintain nominal (typically ~80%) coverage, with interval widths narrower than non-conformal UCB, particularly in small-gap, heavy-tail reward distributions (Cuonzo et al., 10 Dec 2025).

Model type	Regret Bound	Regime Modeling Approach
Piecewise-stationary (PrudentBandits)	$t$ 1	Deterministic breakpoints
Markovian regimes (SEEU)	$t$ 2	Hidden finite-state MC
Regime-aware Conformal Bandit	Empirical near-logarithmic	HMM + conformal inference

Classical switching bandit methods achieve the minimax $t$ 3 regret up to log factors, assuming known regime segmentation and reasonably separated arm means. Markovian approaches yield higher regret but do not require prior knowledge of regime structure, instead learning latent transitions online. Conformal bandit methods emphasize statistical coverage in addition to regret, providing prediction validity guarantees not generally available in classical approaches.

7. Perspectives and Future Directions

Open directions include online adaptation to an unknown and potentially unbounded number of latent regimes, tighter regret bounds for bandits under hidden Markov dependence, optimal exploration schedules for spectral learning phases, and integration of finite-sample coverage guarantees with minimax-optimal regret. In nonparametric and heavy-tailed small-gap settings, coupling regime detection with robust inference methods demonstrates advantages for both risk-adjusted rewards and predictive validity (Cuonzo et al., 10 Dec 2025). Robust and scalable methods for computational efficiency and real-time deployment are increasingly significant given high-dimensional and fast-switching environments.

Markdown Report Issue Upgrade to Chat

References (3)

Generalized non-stationary bandits (2021)

Regime Switching Bandits (2020)

Conformal Bandits: Bringing statistical validity and reward efficiency to the small-gap regime (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Expert Selection Bandits in MDPs.

Expert Selection Bandits in MDPs

1. Formal Definitions and Modeling Variants

2. Algorithmic Methods for Regime-Switching Bandits

Piecewise-Stationary Regimes: PrudentBandits

Markovian Regimes: Spectral and Belief-based Methods

Conformal Bandits with Regime Detection

3. Regret Bounds and Theoretical Guarantees

Piecewise-Stationary Case

Markovian Regimes

Conformal Bandits

4. Practical Aspects and Implementation Considerations

5. Empirical and Applied Evaluations

Piecewise-Stationary Bandits

Markovian Regimes

Financial Regime-Switching Applications

7. Perspectives and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Expert Selection Bandits in MDPs

1. Formal Definitions and Modeling Variants

2. Algorithmic Methods for Regime-Switching Bandits

Piecewise-Stationary Regimes: PrudentBandits

Markovian Regimes: Spectral and Belief-based Methods

Conformal Bandits with Regime Detection

3. Regret Bounds and Theoretical Guarantees

Piecewise-Stationary Case

Markovian Regimes

Conformal Bandits

4. Practical Aspects and Implementation Considerations

5. Empirical and Applied Evaluations

Piecewise-Stationary Bandits

Markovian Regimes

Financial Regime-Switching Applications

6. Comparison to Classical and Related Frameworks

7. Perspectives and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research