Reinforcement and Bandit Learning

Updated 28 January 2026

Reinforcement and bandit learning are decision-making frameworks balancing exploration and exploitation under uncertainty.
They employ rigorous mathematical formulations and performance metrics, utilizing algorithms such as UCB, Thompson Sampling, and EXP methods.
Modern research unifies these paradigms to drive applications in robotics, large-scale language models, photonic computation, and wireless systems.

Reinforcement and bandit learning are two core paradigms for sequential decision-making under uncertainty, unified by principles of exploration-exploitation tradeoff and regret minimization, but differing in their assumptions regarding state, action, and temporal structure. Modern research integrates these frameworks, advances theoretical guarantees, and deploys them across diverse application areas, from large-scale language systems to photonic computation and robotics.

1. Mathematical Foundations and Problem Formulations

Multi-Armed Bandit Problems. The prototypical stochastic multi-armed bandit (MAB) consists of $K$ arms, each associated with an unknown reward distribution $P_i$ with mean $\mu_i$ . At each time $t$ , an agent selects arm $A_t$ , observes $X_t \sim P_{A_t}$ , and aims to maximize cumulative expected reward. Performance is evaluated via the regret:

$R_T(\pi) = T \mu^* - \mathbb{E}_\pi\left[\sum_{t=1}^T X_t\right], \quad \mu^* = \max_{i} \mu_i,$

with various decompositions and regret bounds under both minimax and instance-dependent formulations. Bandit variants include contextual bandits (where context $c_t$ influences the reward distribution of each arm) and continuum-armed bandits (where the action set is uncountably infinite, e.g., a compact subset of $\mathbb{R}^d$ ) (Zhou et al., 2024).

Reinforcement Learning (RL) and MDPs. RL generalizes bandits to Markov Decision Processes: $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ , with state space $\mathcal{S}$ , action space $\mathcal{A}$ , transition dynamics $P(s'|s,a)$ , reward function $R(s,a)$ , and discount factor $\gamma$ . The agent’s objective is to find a policy $\pi$ that maximizes

$J(\pi) = \mathbb{E}^\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)\right].$

With $|\mathcal{S}|=1$ , the MDP reduces to the bandit problem (Lin, 2022, Combrink et al., 2022).

Restless Multi-Armed Bandits (RMABs). In RMABs, each arm is itself an MDP, possibly evolving under its own dynamics whether activated or not. The agent simultaneously controls a subset of arms subject to activation constraints, and recent research has analyzed adversarial RMABs with unknown transition and reward processes (Xiong et al., 2024).

2. Algorithmic Paradigms and Exploration–Exploitation Tradeoff

Frequentist Approaches. Classic MAB algorithms exemplify the tradeoff:

$\epsilon$ -Greedy: Randomly explores with probability $\epsilon$ , otherwise exploits the empirically best arm. Leads to $O(K \ln T)$ regret with decaying $\epsilon$ (Lin, 2022, Zhou et al., 2024, Combrink et al., 2022).
Upper Confidence Bound (UCB): Selects the arm maximizing an optimism-in-the-face-of-uncertainty index (empirical mean plus a confidence bound), guaranteeing $O\left(\sum_{i:\Delta_i > 0} \frac{\ln T}{\Delta_i}\right)$ regret (Zhou et al., 2024).
Explore-Then-Commit: Uniform exploration for a fixed period, then exploitation, with matching $O(\sqrt{KT})$ regret.
Fractional Moment Algorithms: Preference rules based on tunable higher moments, offering a PAC guarantee (finding an $\epsilon$ -optimal arm in $O(n)$ samples) and low empirical regret (B et al., 2012).

Bayesian and Posterior Sampling Methods.

Thompson Sampling (TS): Maintains a posterior over arm means, samples a mean for each arm, and picks the arm with the highest sample. Achieves $O(\sqrt{KT \ln T})$ regret in Bernoulli bandits (Zhou et al., 2024).

EXP-Based and Adversarial Extensions.

EXP3, EXP4.P: Designed for adversarial and contextual bandit problems, these algorithms maintain exponential weights over arms or experts, offering $O^*(\sqrt{T})$ regret under sub-Gaussian and unbounded rewards as well as in linear contextual settings (Xu et al., 2020).
RL+Bandit Hybrids for Learning Rates and Agent Selection: Bandit wrappers can optimize learning rates in deep RL (e.g., LRRL employing EXP3 on learning-rate arms), and select among multiple RL architectures during deployment, with guarantees similar to their bandit core (Donâncio et al., 2024, Merentitis et al., 2019).

Model-Based Nonlinear Extensions.

ViOlin: For nonlinear contextual bandit and deterministic-dynamics RL, this model-based algorithm alternates between maximizing “virtual returns” (incorporating not just reward but also gradient and Hessian information) and online model updates, provably converging to local maxima with regret scaling with sequential Rademacher complexity rather than the covering number or eluder dimension (Dong et al., 2021).

3. Unified Algorithms and Bandit-RL Adaptivity

Reduction and Unification. The structural connection between bandit and RL frameworks is evident: in the “tabular contextual bandit” setting, MDP algorithms can automatically adapt their regret bounds—without user intervention—to the bandit regime.

Ubev-S: Pools counts across time, employs range-dependent bonuses, and obtains, in contextual bandit regimes ( $P(s'|s,a) = \mu(s')$ ), optimal regret $R(T) = \tilde O(\sqrt{SAT})$ , while recovering standard MDP rates for environments with genuine transition structure (Zanette et al., 2019). These bounds are achieved by automatically exploiting rapid mixing and the collapse of long-horizon value ranges in bandit-like MDPs.

Multi-Agent and Physical Hybridization. Banditized Q-learning (DBQL) and Parallel Bandit RL (PBRL) decompose the RL problem into local bandits over state–action pairs, enabling parallelization and even photonic hardware implementation. Quantum interference is leveraged for conflict-free update scheduling among agents, yielding faster convergence as measured by cumulative Q-table loss, especially when state-action selections are driven by anti-correlated (chaotic) random sources (Shinkawa et al., 2022, Urushibara et al., 2022).

4. Contextual Bandits and Representation Learning

Deep and LLM-Augmented Contextual Bandits. Leveraging LLMs as context encoders in contextual bandits increases the representational power for complex, unstructured contexts (e.g., text descriptions). In synthetic experiments, LLM-augmented contextual bandits demonstrate improved cumulative reward and reduced regret over linear bandits, UCB, and $\epsilon$ -greedy approaches; for $T=1000$ actions, cumulative regret is reduced by approximately 20%, and action selection frequencies are more aligned to contextually optimal choices (Baheri et al., 2023).

Continuous-State Bandit and RL for Wireless and Signal Processing.

Actor-critic architectures such as DCB-DDPG (deep contextual bandit using deterministic policy gradient) and DRL-DDPG (for RL with MDP structure) address high-dimensional continuous control (e.g., IRS-assisted massive MIMO communication), with the bandit variant favored when decision problems are one-shot in each independent channel realization (Pereira-Ruisánchez et al., 2024).

5. Advanced Bandit Applications and Data-Driven Fitting

Bandits in Model Management and ML Ops. Online model selection among deployed models in ML Ops can be cast as a non-stationary bandit problem, with batch-wise model switching managed by $\epsilon$ -greedy, UCB, or Thompson Sampling. On both balanced and highly imbalanced datasets, bandit-based controllers outperform static validation or A/B testing by maintaining adaptability and quick rollback under model drift (McClendon et al., 28 Mar 2025).

Risk-Aware RL via Bandit-Based Adaptation.

Families of risk-conditioned policies are trained under CVaR constraints, and a multi-armed bandit UCB framework adapts among these at deployment. The method achieves nearly 2x improvement in mean and tail (CVaR) return in unseen robotic settings, with adaptation converging to optimal risk levels in minutes or thousands of time steps (Zeng et al., 16 Oct 2025).

Behavioral Data Fitting.

Convex relaxations enable efficient maximum-likelihood estimation of RL model parameters (e.g., “forgetting Q-learning” variants) from observed behavioral data in bandit tasks, facilitating robust parameter recovery and value trajectory estimation even in the presence of non-convex dynamics (Zhu et al., 6 Nov 2025).

6. Human Feedback, Sequence Prediction, and Structured Bandits

Bandit-RL for Structured Output and Sequence Learning. Machine translation and other sequence prediction tasks are naturally modeled as contextual bandit or RL problems, where bandit feedback consists of per-output scalar rewards (e.g., BLEU score, human rating). Actor-critic and policy gradient algorithms, adapted to sequence models, can optimize directly for such sparse and noisy feedback. Both simulated and true human feedback (cardinal or ordinal) can be integrated via learned reward estimators, yielding statistically significant improvements over supervised baselines—robust even in the presence of high-granularity, variance, and skew in the feedback (Sharaf et al., 2017, Kreutzer et al., 2018, Nguyen et al., 2017).

Surrogate Rewards and Agent Selection. Bandit frameworks can select among multiple concurrently learning RL agents by constructing surrogate rewards that combine environmental returns with model uncertainty (e.g., information gain via VIME), allowing early and robust concentration on optimal agent configurations (Merentitis et al., 2019).

7. Theoretical and Practical Outlook

Concentration Inequalities and Regret Analysis. Modern analyses leverage non-asymptotic concentration inequalities (Hoeffding, Bernstein) to establish high-confidence regret bounds, with algorithms such as UCB1, TS, and MOSS achieving instance-dependent and minimax-optimal rates (Zhou et al., 2024). Extensions to heavy-tailed, adversarial, or scale-free reward settings are realized via truncation and robust confidence intervals (EXP4.P, EXP3.P, and variants) (Xu et al., 2020).

Open Challenges and Research Frontiers.

Infinite-horizon, function approximation, and POMDP extensions of adaptive RL/bandit algorithms remain open.
Non-stationary environments, structured bandits (sparsity, meta-learning), and causal and human-feedback integration are active areas of research.
Theoretical analysis of hybrid deep RL–bandit frameworks, especially with complex function approximators, remains unresolved.
Physical and photonic implementations introduce constraints and opportunities for design, requiring careful handling of information flow, synchronization, and error correction (Urushibara et al., 2022, Shinkawa et al., 2022).

These advances underline the deepening synthesis of reinforcement and bandit learning, facilitating both methodological innovation and application in increasingly complex, uncertain, and high-dimensional domains.