Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 151 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 83 tok/s Pro
Kimi K2 178 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Heterogeneous RMAB: Models, Algorithms, and Theory

Updated 12 November 2025
  • Heterogeneous RMABs are problems where independent, non-identical Markov arms evolve continuously, whether activated or not.
  • They enable efficient resource allocation in uncertain environments using occupancy LP relaxations, online mirror descent, and index-based scheduling.
  • Advanced algorithms achieve regret bounds up to O(H√T) in adversarial settings and logarithmic regret in stationary regimes through careful exploration-exploitation trade-offs.

A heterogeneous restless multi-armed bandit (RMAB) is a generalization of the classical bandit model in which a collection of independent, non-identical Markov decision processes ("arms") evolve restlessly—meaning each arm's state transitions at every step regardless of whether it is chosen ("activated") or left passive. Heterogeneity arises from arm-specific state spaces, reward structures, and transition kernels. The decision-maker faces an instantaneous activation constraint (e.g., at most BB arms can be activated per period) and seeks to maximize cumulative (possibly adversarial) rewards, often under partial (bandit) feedback and with unknown arm dynamics. The RMAB is a canonical model for resource allocation, scheduling, communication, and intervention planning in nonstationary, uncertain environments.

1. Formal Model Specification

Let NN denote the number of arms, each indexed by n[N]n\in[N]. Arm nn has its own finite state space Sn\mathcal{S}_n (Sn<|\mathcal{S}_n|<\infty) and binary action set A={0,1}\mathcal{A}=\{0,1\} ("passive" or "activate"). At each decision epoch h{1,,H}h\in\{1,\dots,H\} within episode t{1,,T}t\in\{1,\dots,T\}, the decision-maker chooses actions Ant,hAA_n^{t,h} \in \mathcal{A} for each arm, subject to a hard constraint: n=1NAnt,hB\sum_{n=1}^N A_n^{t,h} \leq B Each arm evolves per its own action-dependent, unknown Markov kernel: Pn(ss,a)[0,1],s,sSn,  a{0,1}P_n(s' \mid s, a) \in [0,1] \,,\quad s, s' \in \mathcal{S}_n\,,\; a\in\{0,1\} The environment may present adversarial, non-stationary reward functions selected arbitrarily at each episode: rnt:Sn×{0,1}[0,1]r_n^t: \mathcal{S}_n \times \{0,1\} \to [0,1]. In the "bandit feedback" regime, the learner only observes rewards for those state-action pairs actually visited: rnt(Snt,h,Ant,h)r_n^t(S_n^{t,h}, A_n^{t,h}).

Benchmarking uses a noncausal offline optimal policy πopt\pi^\text{opt} maximizing cumulative expected reward under the hard constraint. Regret after TT episodes is

Δ(T)=t=1TRt(πopt)t=1TRt(πt)\Delta(T) = \sum_{t=1}^T R_t(\pi^\text{opt}) - \sum_{t=1}^T R_t(\pi^t)

where Rt(π)R_t(\pi) is the expected reward under policy π\pi in episode tt, given all transitions {Pn}\{P_n\}.

2. Algorithmic Frameworks for Heterogeneous Restless Bandits

2.1. Model-Based Approaches

  • Confidence-Set Estimation: Tracks empirical transition statistics per arm (state-action visit counts Cn(s,a)C_n(s,a), transitions Cn(s,a,s)C_n(s,a,s')), forming confidence sets Pnt(s,a)\mathcal{P}_n^t(s,a) using Hoeffding-style bounds. Key update:

δnt(s,a)=ln(4S2N(t1)H/ϵ)2Cnt1(s,a)\delta_n^t(s,a) = \sqrt{\frac{\ln(4|\mathcal{S}|^2N(t-1)H/\epsilon)}{2C_n^{t-1}(s,a)}}

  • Occupancy-Measure Linear Program Relaxations: The per-step activation constraint is relaxed to an average constraint, and a linear program over time-dependent occupancy measures μn(s,a;h)\mu_n(s,a; h) is constructed. State-action-state occupancy zn(s,a,s;h)z_n(s,a,s'; h) encodes transitions consistent with the estimated confidence sets.

2.2. Online Convex Optimization (OMD Approach)

  • Mirror Descent on Occupancy Measures: The learner updates occupancy measures via Online Mirror Descent (OMD) steps:

ztargmaxzZtz,  ηr^t1DKL(z    zt1)z^t \leftarrow \arg\max_{z \in \mathcal{Z}^t} \langle z, \;\eta\hat{r}^{t-1} \rangle - D_{\text{KL}}(z \;\|\; z^{t-1})

where r^t1\hat{r}^{t-1} is a high-probability optimistic estimator from bandit feedback, Zt\mathcal{Z}^t enforces balance, budget, and confidence constraints.

2.3. Bandit Feedback: Biased Reward Estimation

  • Implicit Exploration Estimators: For each (s,a)(s, a), the estimator is

rˉnt(s,a)=Hrnt(s,a)max{cnt(s,a),1}\bar{r}_n^t(s,a) = \frac{H r_n^t(s,a)}{\max\{c_n^t(s,a),1\}}

with cnt(s,a)c_n^t(s,a) the count of (s,a)(s, a) visits, plus an exploration bonus δnt(s,a)\delta_n^t(s,a), capping at $1$, and optimistic default for unvisited pairs.

  • Projection to Hard Budgets: After solving the relaxed LP and OMD updates, a Reward-Maximizing Index (RMI) is constructed for each arm-state. For step (t,h)(t, h), arms with the largest RMI values are selected, satisfying the hard per-period constraint.

2.4. Index-Based Scheduling

  • Whittle-Index and Priority Rules: Much of the literature (for stochastic or stationarily rewarding settings) constructs Whittle indices via Lagrangian relaxation, then schedules top-BB arms per period. In the adversarial/bandit feedback case, an index is computed from OMD-derived occupancy probabilities:

Int(s;h)=sznt,(s,1,s;h)b{0,1},sznt,(s,b,s;h)\mathcal{I}_n^t(s; h) = \frac{\sum_{s'} z_n^{t,*}(s,1,s'; h)}{\sum_{b \in \{0,1\}, s'} z_n^{t,*}(s,b,s'; h)}

and the top BB arms are selected in each round.

3. Theoretical Guarantees and Regret Rates

3.1. Adversarial, Bandit-Feedback RMABs

  • High-Probability Regret Bounds: For adversarial rewards, unknown heterogeneous transitions, hard constraints, and bandit feedback, the UCMD-ARMAB algorithm achieves regret

O~(HT)\widetilde{O}(H\sqrt{T})

with probability at least 13ϵ1-3\epsilon when step size η=ln(S2N)/T\eta = \sqrt{\ln(|\mathcal{S}|^2 N)/T} (Xiong et al., 2 May 2024). The bound combines OMD optimization, reward estimation concentrations, and error from enforcing the hard activation constraint via RMI.

3.2. Stochastic, Weakly Coupled and Whittle-Relaxed Models

  • Logarithmic Regret for Stationary Rewards: For stochastic settings with per-arm unknown Markov transition and rewards, epoch-based exploration/exploitation (DSEE, ASR, LEMP) achieves

R(T)=O(logT)R(T) = O(\log T)

with explicit dependence on gaps, mixing times, and arm/state heterogeneity (Liu et al., 2010, Gafni et al., 2019, Gafni et al., 2021, Gafni et al., 2022).

  • Fluid LP/MPC Schemes: In the infinite-horizon average-reward regime, LP-update (model-predictive) control with randomized rounding, under uniform ergodicity, achieves an

O(logNN)O\left(\frac{\log N}{\sqrt{N}}\right)

optimality gap relative to the Whittle-relaxed upper bound (Narasimha et al., 11 Nov 2025).

  • Extensions to Multi-Resource, Multi-Worker RMABs: Lagrangian relaxations and specialized index policies accommodate heterogeneous worker/resource constraints, per-task costs, and fairness—empirically matching optimal reward within $2$–10%10\% (Biswas et al., 2023).

4. Algorithmic Landscapes and Methodological Variants

Regime Setting & Feedback Heterogeneity Regret/Optimality Core Algorithmic Ingredients
Adversarial RMAB (Xiong et al., 2 May 2024) Episodic, bandit Markov kernels, rewards O~(HT)\tilde{O}(H\sqrt{T}) Occupancy-OMD, UCB confidence sets, RMI
Stationary, deterministic Infinite-horizon, full arm transition & reward O(logT)O(\log T) DSEE, ASR, per-arm adaptive rates
Average-reward, ergodicity Fluid, offline LP arbitrary transition O(logN/N)O(\log N/\sqrt{N}) gap Model predictive control, LP rounding
Resource/fairness constraints Multi-resource/workers per-worker costs/budgets \approx optimal Multi-worker Whittle, balanced alloc.

These variants address key challenges: unknown, nonstationary, adversarial rewards; arm-specific transitions and reward structures; hard activation or budget constraints; and partial (bandit) feedback.

5. Unique Challenges in Heterogeneous and Bandit-Feedback Regimes

Heterogeneity introduces several difficulties absent in homogeneous or i.i.d. settings:

  • Transition Estimation: Each arm's distinct transition requires independent confidence management; convergence rates depend on state-space size, mixing gaps, and time spent on each arm.
  • Exploration-Exploitation Trade-off: Static or worst-case exploration schedules may incur suboptimal regret (oversampling easy arms). Modern approaches use per-arm (and sometimes per-state) data-driven estimates of "hardness" (e.g., reward gap, mixing rate, see DsiD^i_s in LEMP (Gafni et al., 2022)).
  • Bandit and Adversarial Feedback: Only a subset of rewards are observed; reward estimation must maintain optimism for unvisited actions/state pairs, requiring optimistic/unbiased estimators and confidence bonuses.
  • Instantaneous Activation Constraints: Hard per-step activation bounds must be satisfied by mapping relaxed solutions (from occupancy LPs or OMD) back to feasible actions, which may reduce empirical regret performance if not carefully managed.

6. Extensions and Applications

Heterogeneous RMABs appear in cognitive radio (dynamic spectrum access), queueing and server allocation, project monitoring, health intervention planning, financial portfolio management, anti-poaching surveillance, and more. Contemporary extensions address:

  • Streaming and Finite-Horizon: Arms may arrive/depart or have finite lifetimes (streaming RMAB/finite-horizon RMAB). Efficient interpolation and index-decay techniques permit scalable, near-optimal solutions with index decay guarantees (Mate et al., 2021).
  • Global/Contextual State and Multi-Agent Settings: Arms' rewards and dynamics may depend on exogenous global states (e.g., epidemics, system overload)—necessitating learning both per-arm and global transitions (Gafni et al., 2021, Gafni et al., 2022).
  • Resource Pooling, Competition, and Reservation: Arms may compete for several shared, capacity-limited resources, requiring multi-dimensional relaxations and pricing (Lagrange multipliers) to modulate admission control and priority (Fu et al., 2018).
  • Deep and Feature-Based Generalization: Recent architectures pretrain shared policies over arm-level features, supporting opt-in/opt-out (streaming) arms and generalizing to continuous state/action spaces and multi-action settings (Zhao et al., 2023).

Notable open problems in heterogeneous RMAB research include:

  • Adversarial, Non-Stationary Regimes: Tight characterization of achievable regret under combination of adversarial rewards, unknown transitions, hard instantaneous constraints, and bandit feedback (Xiong et al., 2 May 2024).
  • Scalability and Generalization: Efficient, scalable computation and learning with large NN, high-dimensional/continuous state/action spaces, and feature-driven transitions.
  • Multi-resource and Fairness: Optimal or near-optimal policies when arms/tasks require combinatorial resources—ensuring fairness across resource/provider types (Biswas et al., 2023).
  • Partial Observability and Non-indexability: Many realistic RMAB instances are non-indexable (lack monotonicity or threshold policies), requiring deployment of fluid LP, model-predictive, or reinforcement learning approaches that do not rely on index structure (Narasimha et al., 11 Nov 2025).

Heterogeneous RMAB theory and algorithms thus unite sequential learning, stochastic control, convex optimization, and online bandit feedback, setting the foundation for a wide swath of real-world resource allocation and decision-making under uncertainty.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Heterogeneous Restless Multi-Armed Bandit (RMAB).