Papers
Topics
Authors
Recent
Search
2000 character limit reached

Expert Selection Bandits in MDPs

Updated 31 May 2026
  • Expert selection bandits in MDPs are multi-armed bandit models that incorporate regime-switching to adapt to dynamic, non-stationary reward structures.
  • They employ techniques like gap estimation, spectral methods for HMMs, and conformal inference to detect regime changes and optimize arm selection.
  • Theoretical analyses and empirical evaluations demonstrate sublinear regret bounds and improved performance in finance, recommender systems, and control applications.

A regime-switching bandit refers to a class of multi-armed bandit models in which the underlying reward distributions of arms are influenced by non-stationary, piecewise-constant, or Markovian regime changes. These models are motivated by applications in finance, recommender systems, and control, where latent environmental states cause abrupt changes in reward structure over time. Approaches to regime-switching bandits include statistical changepoint detection algorithms, spectral methods for hidden Markov models, and frameworks integrating conformal inference with regime-aware learning policies.

1. Formal Definitions and Modeling Variants

The regime-switching bandit framework encompasses several formalizations, all of which generalize the classical stationary multi-armed bandit:

  • Piecewise-Stationary (Switching Regimes): The time horizon {1,2,,T}\{1,2,\ldots,T\} is partitioned by unknown breakpoints 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+1 into MM non-overlapping stationary regimes. For each arm kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\} and time tt, the reward is

Xk,t=μk(t)+ϵk,t,ϵk,t[0,1],E[ϵk,t]=0,X_{k,t} = \mu_{k}(t) + \epsilon_{k,t}, \quad \epsilon_{k,t}\in [0,1],\quad \mathbb{E}[\epsilon_{k,t}] = 0,

where μk(t)\mu_k(t) is piecewise-constant, changing only at breakpoints τˉm\bar\tau_m (Manegueu et al., 2021).

  • Markovian Regime-Switching: Rewards are modulated by a latent Markov process {St}t=1T\{S_t\}_{t=1}^T over MM states. The underlying state evolves as an ergodic Markov chain with transition matrix 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+10, which is unobserved. Each arm 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+11 has regime-dependent means 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+12 (Zhou et al., 2020).
  • HMM-Driven Reward Dynamics (as in applications to finance): The latent regime 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+13 evolves via a homogeneous Markov chain, and observed rewards or vectors (e.g., log-returns) are generated conditionally on the regime, often assumed Gaussian with state-dependent means and covariances. Arms may correspond to actions such as portfolio allocations (Cuonzo et al., 10 Dec 2025).
  • Regret Metric: In both settings, the benchmark is an oracle that, at every 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+14, selects the arm 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+15 (or the optimal POMDP policy given access to 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+16 and emission laws), with regret defined as

1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+17

or, for Markovian regimes, as 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+18, where 1=τˉ1<<τˉM+1=T+11 = \bar\tau_1 < \cdots < \bar\tau_{M+1} = T+19 is the steady-state average reward under the optimal belief-policy (Manegueu et al., 2021, Zhou et al., 2020).

2. Algorithmic Methods for Regime-Switching Bandits

Piecewise-Stationary Regimes: PrudentBandits

The PrudentBandits algorithm (Manegueu et al., 2021) is a unified method for piecewise-stationary or “switching” bandits. It operates by maintaining statistics over episodes demarcated by changepoints:

  • Active arm set construction: Selects arms for which the time-since-last-pull exceeds a lower bound tied to gap estimates.
  • Gap estimation: For arm MM0 over subinterval MM1,

MM2

where the max is over arms present throughout the interval and MM3 is the number of arm pulls.

  • Lower bounds on gaps: Apply statistical confidence corrections tailored for switching settings.
  • Change-point detection: Detections are triggered by significant discrepancies in gap estimates between subintervals, ensuring avoidance of false alarms and timely detection of true breakpoints.

Markovian Regimes: Spectral and Belief-based Methods

The SEEU algorithm (Zhou et al., 2020) leverages hidden Markov model (HMM) structure by:

  • Spectral Method-of-Moments: Empirically estimates the HMM’s transition matrix MM4 and reward distribution means MM5 via tensor decompositions of second- and third-order moments of observed arm-reward pairs.
  • Belief-state Tracking: Updates a belief vector MM6 over regimes using Bayesian filtering, allowing the agent to act optimally according to the inferred regime distribution.
  • Optimistic Model Selection: Maintains high-probability confidence sets around parameter estimates and solves for the optimal belief-policy for the most optimistic plausible model at each episode.
  • Phased Learning: Alternates between uniform exploration phases (to ensure identifiability of the HMM) and exploitation using the best policy inferred so far.

Conformal Bandits with Regime Detection

The Conformal Bandit framework (Cuonzo et al., 10 Dec 2025) is integrated with regime-sensing HMMs:

  • Conformalized Quantile Regression (CQR): For each arm, constructs finite-sample-valid predictive intervals via adaptive calibration scores, adapting to observed contextual covariates.
  • HMM Filtering: At each round, updates regime posterior probabilities based on new contextual information, using the EM algorithm-fitted transition model and Gaussian emission parameters estimated historically.
  • Regime-aware arm selection: Uses regime estimates (e.g., Bull, Bear, Neutral) to select arms via regime-specific decision indices: e.g., pick the arm with highest upper confidence bound in Bull/Neutral, or protect downside by selecting highest lower bound in Bear period.

3. Regret Bounds and Theoretical Guarantees

Piecewise-Stationary Case

For MM7 switches (i.e., MM8 stationary segments) over horizon MM9 and kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}0 arms, the main regret result for PrudentBandits (Manegueu et al., 2021):

kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}1

which matches the minimax optimal rate kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}2 up to a logarithmic factor, assuming gaps between regime-best and sub-optimal arms are bounded away from zero and the number of regimes is known or tightly upper-bounded.

Markovian Regimes

For the Markovian regime-switching bandit, the SEEU algorithm achieves (Zhou et al., 2020):

kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}3

with high probability, under ergodicity, full-rank emission matrix, and sufficient exploration. This rate is sublinear but does not reach the kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}4 minimax rate, due to the need for repeated HMM parameter estimation under partial observability.

Conformal Bandits

In the regime-switching Conformal Bandit setting, theoretical guarantees are provided for finite-sample coverage of the constructed prediction intervals for each arm-reward, uniformly over arms and time:

kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}5

for any sequence of regimes (Cuonzo et al., 10 Dec 2025). No explicit regret bound is proved for this setting, but empirical results indicate near-logarithmic regret in small-gap regimes, outperforming standard stationary UCB methods under heavy-tailed and tiny-gap scenarios.

4. Practical Aspects and Implementation Considerations

  • Parameter Knowledge: Switching-bandit methods require an upper bound on the number of regime changes (kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}6 or kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}7). Fully adaptive routines for unknown numbers of regimes remain an active area (Manegueu et al., 2021).
  • Detection Mechanisms: The efficacy of gap-based versus mean-based changepoint detection is highlighted: gap-based approaches yield sharper detection events when only relative arm performance is relevant.
  • Spectral Methods: Identification of latent HMM structure via tensor decompositions is data-efficient but requires episodes of uniform exploration.
  • Regime Inference: In financial applications, HMMs with Gaussian emissions are learned on historical data and used online for regime inference, which then modulates actionable decisions (Cuonzo et al., 10 Dec 2025).
  • Computational Complexity: Algorithms updating statistics over all arms and subintervals can be computationally intensive at large kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}8 or with many arms; windowed estimation and episode batching are practical mitigations.

5. Empirical and Applied Evaluations

Piecewise-Stationary Bandits

Theory-driven evaluations for PrudentBandits demonstrate optimal regret scaling and flexibility for handling mixed forms of nonstationarity (including polynomial drift and inflection-bounded variations) under a single code base (Manegueu et al., 2021).

Markovian Regimes

Proof-of-concept experiments for the SEEU algorithm demonstrate sublinear (slope kK={1,,K}k\in\mathcal{K} = \{1,\ldots,K\}9 in log-log plots) regret, matching theoretical rates and outperforming baselines such as tt0-greedy, sliding-window UCB, and Exp3.S, all of which suffer linear regret in the presence of hidden regime switching (Zhou et al., 2020).

Financial Regime-Switching Applications

The Regime-Aware Conformal Bandit method, applied to portfolio selection on real ETF data, shows improved partial-information metrics (cumulative wealth, Sharpe ratio, drawdown) over standard UCB and stationary conformal approaches. The HMM-augmented conformal framework preserves finite-sample predictive coverage and delivers higher risk-adjusted returns. Empirical simulations reveal that all conformal variants maintain nominal (typically ~80%) coverage, with interval widths narrower than non-conformal UCB, particularly in small-gap, heavy-tail reward distributions (Cuonzo et al., 10 Dec 2025).

Model type Regret Bound Regime Modeling Approach
Piecewise-stationary (PrudentBandits) tt1 Deterministic breakpoints
Markovian regimes (SEEU) tt2 Hidden finite-state MC
Regime-aware Conformal Bandit Empirical near-logarithmic HMM + conformal inference

Classical switching bandit methods achieve the minimax tt3 regret up to log factors, assuming known regime segmentation and reasonably separated arm means. Markovian approaches yield higher regret but do not require prior knowledge of regime structure, instead learning latent transitions online. Conformal bandit methods emphasize statistical coverage in addition to regret, providing prediction validity guarantees not generally available in classical approaches.

7. Perspectives and Future Directions

Open directions include online adaptation to an unknown and potentially unbounded number of latent regimes, tighter regret bounds for bandits under hidden Markov dependence, optimal exploration schedules for spectral learning phases, and integration of finite-sample coverage guarantees with minimax-optimal regret. In nonparametric and heavy-tailed small-gap settings, coupling regime detection with robust inference methods demonstrates advantages for both risk-adjusted rewards and predictive validity (Cuonzo et al., 10 Dec 2025). Robust and scalable methods for computational efficiency and real-time deployment are increasingly significant given high-dimensional and fast-switching environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Expert Selection Bandits in MDPs.