Heterogeneous RMAB: Models, Algorithms, and Theory

Updated 12 November 2025

Heterogeneous RMABs are problems where independent, non-identical Markov arms evolve continuously, whether activated or not.
They enable efficient resource allocation in uncertain environments using occupancy LP relaxations, online mirror descent, and index-based scheduling.
Advanced algorithms achieve regret bounds up to O(H√T) in adversarial settings and logarithmic regret in stationary regimes through careful exploration-exploitation trade-offs.

A heterogeneous restless multi-armed bandit (RMAB) is a generalization of the classical bandit model in which a collection of independent, non-identical Markov decision processes ("arms") evolve restlessly—meaning each arm's state transitions at every step regardless of whether it is chosen ("activated") or left passive. Heterogeneity arises from arm-specific state spaces, reward structures, and transition kernels. The decision-maker faces an instantaneous activation constraint (e.g., at most $B$ arms can be activated per period) and seeks to maximize cumulative (possibly adversarial) rewards, often under partial (bandit) feedback and with unknown arm dynamics. The RMAB is a canonical model for resource allocation, scheduling, communication, and intervention planning in nonstationary, uncertain environments.

1. Formal Model Specification

Let $N$ denote the number of arms, each indexed by $n\in[N]$ . Arm $n$ has its own finite state space $\mathcal{S}_n$ ( $|\mathcal{S}_n|<\infty$ ) and binary action set $\mathcal{A}=\{0,1\}$ ("passive" or "activate"). At each decision epoch $h\in\{1,\dots,H\}$ within episode $t\in\{1,\dots,T\}$ , the decision-maker chooses actions $A_n^{t,h} \in \mathcal{A}$ for each arm, subject to a hard constraint: $\sum_{n=1}^N A_n^{t,h} \leq B$ Each arm evolves per its own action-dependent, unknown Markov kernel: $P_n(s' \mid s, a) \in [0,1] \,,\quad s, s' \in \mathcal{S}_n\,,\; a\in\{0,1\}$ The environment may present adversarial, non-stationary reward functions selected arbitrarily at each episode: $r_n^t: \mathcal{S}_n \times \{0,1\} \to [0,1]$ . In the "bandit feedback" regime, the learner only observes rewards for those state-action pairs actually visited: $r_n^t(S_n^{t,h}, A_n^{t,h})$ .

Benchmarking uses a noncausal offline optimal policy $\pi^\text{opt}$ maximizing cumulative expected reward under the hard constraint. Regret after $T$ episodes is

$\Delta(T) = \sum_{t=1}^T R_t(\pi^\text{opt}) - \sum_{t=1}^T R_t(\pi^t)$

where $R_t(\pi)$ is the expected reward under policy $\pi$ in episode $t$ , given all transitions $\{P_n\}$ .

2. Algorithmic Frameworks for Heterogeneous Restless Bandits

2.1. Model-Based Approaches

Confidence-Set Estimation: Tracks empirical transition statistics per arm (state-action visit counts $C_n(s,a)$ , transitions $C_n(s,a,s')$ ), forming confidence sets $\mathcal{P}_n^t(s,a)$ using Hoeffding-style bounds. Key update:

$\delta_n^t(s,a) = \sqrt{\frac{\ln(4|\mathcal{S}|^2N(t-1)H/\epsilon)}{2C_n^{t-1}(s,a)}}$

Occupancy-Measure Linear Program Relaxations: The per-step activation constraint is relaxed to an average constraint, and a linear program over time-dependent occupancy measures $\mu_n(s,a; h)$ is constructed. State-action-state occupancy $z_n(s,a,s'; h)$ encodes transitions consistent with the estimated confidence sets.

2.2. Online Convex Optimization (OMD Approach)

Mirror Descent on Occupancy Measures: The learner updates occupancy measures via Online Mirror Descent (OMD) steps:

$z^t \leftarrow \arg\max_{z \in \mathcal{Z}^t} \langle z, \;\eta\hat{r}^{t-1} \rangle - D_{\text{KL}}(z \;\|\; z^{t-1})$

where $\hat{r}^{t-1}$ is a high-probability optimistic estimator from bandit feedback, $\mathcal{Z}^t$ enforces balance, budget, and confidence constraints.

2.3. Bandit Feedback: Biased Reward Estimation

Implicit Exploration Estimators: For each $(s, a)$ , the estimator is

$\bar{r}_n^t(s,a) = \frac{H r_n^t(s,a)}{\max\{c_n^t(s,a),1\}}$

with $c_n^t(s,a)$ the count of $(s, a)$ visits, plus an exploration bonus $\delta_n^t(s,a)$ , capping at $1$, and optimistic default for unvisited pairs.

Projection to Hard Budgets: After solving the relaxed LP and OMD updates, a Reward-Maximizing Index (RMI) is constructed for each arm-state. For step $(t, h)$ , arms with the largest RMI values are selected, satisfying the hard per-period constraint.

2.4. Index-Based Scheduling

Whittle-Index and Priority Rules: Much of the literature (for stochastic or stationarily rewarding settings) constructs Whittle indices via Lagrangian relaxation, then schedules top- $B$ arms per period. In the adversarial/bandit feedback case, an index is computed from OMD-derived occupancy probabilities:

$\mathcal{I}_n^t(s; h) = \frac{\sum_{s'} z_n^{t,*}(s,1,s'; h)}{\sum_{b \in \{0,1\}, s'} z_n^{t,*}(s,b,s'; h)}$

and the top $B$ arms are selected in each round.

3. Theoretical Guarantees and Regret Rates

3.1. Adversarial, Bandit-Feedback RMABs

High-Probability Regret Bounds: For adversarial rewards, unknown heterogeneous transitions, hard constraints, and bandit feedback, the UCMD-ARMAB algorithm achieves regret

$\widetilde{O}(H\sqrt{T})$

with probability at least $1-3\epsilon$ when step size $\eta = \sqrt{\ln(|\mathcal{S}|^2 N)/T}$ (Xiong et al., 2 May 2024). The bound combines OMD optimization, reward estimation concentrations, and error from enforcing the hard activation constraint via RMI.

3.2. Stochastic, Weakly Coupled and Whittle-Relaxed Models

Logarithmic Regret for Stationary Rewards: For stochastic settings with per-arm unknown Markov transition and rewards, epoch-based exploration/exploitation (DSEE, ASR, LEMP) achieves

$R(T) = O(\log T)$

with explicit dependence on gaps, mixing times, and arm/state heterogeneity (Liu et al., 2010, Gafni et al., 2019, Gafni et al., 2021, Gafni et al., 2022).

Fluid LP/MPC Schemes: In the infinite-horizon average-reward regime, LP-update (model-predictive) control with randomized rounding, under uniform ergodicity, achieves an

$O\left(\frac{\log N}{\sqrt{N}}\right)$

optimality gap relative to the Whittle-relaxed upper bound (Narasimha et al., 11 Nov 2025).

Extensions to Multi-Resource, Multi-Worker RMABs: Lagrangian relaxations and specialized index policies accommodate heterogeneous worker/resource constraints, per-task costs, and fairness—empirically matching optimal reward within $2$– $10\%$ (Biswas et al., 2023).

4. Algorithmic Landscapes and Methodological Variants

Regime	Setting & Feedback	Heterogeneity	Regret/Optimality	Core Algorithmic Ingredients
Adversarial RMAB (Xiong et al., 2 May 2024)	Episodic, bandit	Markov kernels, rewards	$\tilde{O}(H\sqrt{T})$	Occupancy-OMD, UCB confidence sets, RMI
Stationary, deterministic	Infinite-horizon, full	arm transition & reward	$O(\log T)$	DSEE, ASR, per-arm adaptive rates
Average-reward, ergodicity	Fluid, offline LP	arbitrary transition	$O(\log N/\sqrt{N})$ gap	Model predictive control, LP rounding
Resource/fairness constraints	Multi-resource/workers	per-worker costs/budgets	$\approx$ optimal	Multi-worker Whittle, balanced alloc.

These variants address key challenges: unknown, nonstationary, adversarial rewards; arm-specific transitions and reward structures; hard activation or budget constraints; and partial (bandit) feedback.

5. Unique Challenges in Heterogeneous and Bandit-Feedback Regimes

Heterogeneity introduces several difficulties absent in homogeneous or i.i.d. settings:

Transition Estimation: Each arm's distinct transition requires independent confidence management; convergence rates depend on state-space size, mixing gaps, and time spent on each arm.
Exploration-Exploitation Trade-off: Static or worst-case exploration schedules may incur suboptimal regret (oversampling easy arms). Modern approaches use per-arm (and sometimes per-state) data-driven estimates of "hardness" (e.g., reward gap, mixing rate, see $D^i_s$ in LEMP (Gafni et al., 2022)).
Bandit and Adversarial Feedback: Only a subset of rewards are observed; reward estimation must maintain optimism for unvisited actions/state pairs, requiring optimistic/unbiased estimators and confidence bonuses.
Instantaneous Activation Constraints: Hard per-step activation bounds must be satisfied by mapping relaxed solutions (from occupancy LPs or OMD) back to feasible actions, which may reduce empirical regret performance if not carefully managed.

6. Extensions and Applications

Heterogeneous RMABs appear in cognitive radio (dynamic spectrum access), queueing and server allocation, project monitoring, health intervention planning, financial portfolio management, anti-poaching surveillance, and more. Contemporary extensions address:

Streaming and Finite-Horizon: Arms may arrive/depart or have finite lifetimes (streaming RMAB/finite-horizon RMAB). Efficient interpolation and index-decay techniques permit scalable, near-optimal solutions with index decay guarantees (Mate et al., 2021).
Global/Contextual State and Multi-Agent Settings: Arms' rewards and dynamics may depend on exogenous global states (e.g., epidemics, system overload)—necessitating learning both per-arm and global transitions (Gafni et al., 2021, Gafni et al., 2022).
Resource Pooling, Competition, and Reservation: Arms may compete for several shared, capacity-limited resources, requiring multi-dimensional relaxations and pricing (Lagrange multipliers) to modulate admission control and priority (Fu et al., 2018).
Deep and Feature-Based Generalization: Recent architectures pretrain shared policies over arm-level features, supporting opt-in/opt-out (streaming) arms and generalizing to continuous state/action spaces and multi-action settings (Zhao et al., 2023).

7. Open Problems and Trends

Notable open problems in heterogeneous RMAB research include:

Adversarial, Non-Stationary Regimes: Tight characterization of achievable regret under combination of adversarial rewards, unknown transitions, hard instantaneous constraints, and bandit feedback (Xiong et al., 2 May 2024).
Scalability and Generalization: Efficient, scalable computation and learning with large $N$ , high-dimensional/continuous state/action spaces, and feature-driven transitions.
Multi-resource and Fairness: Optimal or near-optimal policies when arms/tasks require combinatorial resources—ensuring fairness across resource/provider types (Biswas et al., 2023).
Partial Observability and Non-indexability: Many realistic RMAB instances are non-indexable (lack monotonicity or threshold policies), requiring deployment of fluid LP, model-predictive, or reinforcement learning approaches that do not rely on index structure (Narasimha et al., 11 Nov 2025).

Heterogeneous RMAB theory and algorithms thus unite sequential learning, stochastic control, convex optimization, and online bandit feedback, setting the foundation for a wide swath of real-world resource allocation and decision-making under uncertainty.