Context-Free Multi-Armed Bandit Router

Updated 28 November 2025

Context-Free Multi-Armed Bandit Router is defined as a sequential learning framework that models routing as a multi-armed bandit problem without using local contextual information.
It spans various settings including stochastic, adversarial, and combinatorial scenarios to address challenges such as delayed feedback and congestion-dependent costs.
Key algorithms like tuned ε-greedy, UCB, and ZDD-based methods have proven sublinear regret performance and enable efficient real-time routing in complex networks.

A context-free multi-armed bandit (MAB) router is a class of adaptive routing strategies for networks in which the decision process does not depend on local contextual features but instead relies on sequential learning to minimize a cost or maximize a reward associated with chosen routes or forwarding actions. The problem is modeled as a multi-armed bandit instance, often with combinatorial, adversarial, stochastic, delayed, or congestion-dependent structures, and solved using online learning algorithms to ensure sublinear regret relative to optimal (clairvoyant) routing or forwarding policies.

1. Formal Definition and Core Model Variants

A context-free MAB router operates on a network described as a directed graph $G = (V, E)$ , where routes or forwarding actions (arms) are selected sequentially at each time $t$ . The context-free designation indicates that routing decisions do not exploit side-information (e.g., user identifiers or local traffic states), but are solely based on historic aggregate performance of previously selected arms. Key model settings include:

Stochastic MAB Routing: At each time $t$ , the algorithm selects an arm (route, link, or forwarding neighbor), and observes a random cost (e.g., delay or loss) drawn i.i.d. from an unknown distribution per arm. The goal is to minimize cumulative expected cost or regret over $T$ rounds, relative to the best arm in hindsight (Avrachenkov et al., 2012).
Combinatorial and Dependent-Arm Routing: Arms correspond to paths (multi-hop sequences or super-arms), with reward or cost structure exhibiting strong dependencies due to overlapping links. The reward (or delay) of a path is determined by the linear combination of its edge components, which are themselves stochastically evolving and unknown (Liu et al., 2012).
Adversarial and Congestion-Sensitive Routing: The cost assigned to each arm is determined by an adversary, possibly as a function of past actions (to model congestion or traffic load) (Sakaue et al., 2017, Awasthi et al., 2023).
Bandit with Delays: Observed feedback is delayed; at each decision time, only a subset of previously selected arms have revealed their outcomes (as in CCN interest forwarding) (Avrachenkov et al., 2012).
Queueing and Joint Routing/Scheduling: In queueing networks, the control must both route and schedule packets, observing costs only for selected actions and crucially balancing stability (bounded queue lengths) and routing cost minimization (Chadaga et al., 3 Sep 2025).

2. Fundamental Algorithms and Methodological Advances

Several algorithmic families underpin context-free MAB routers, tailored to the stochastic/ adversarial/ combinatorial/ queueing nature of the environment:

Classic MAB Policies: Standard $\varepsilon$ -greedy, tuned $\varepsilon$ -greedy (with decaying exploration $\varepsilon_0/t$ ), and UCB-style (including lower confidence bound adaptations for delay minimization) operate efficiently and provably achieve optimal logarithmic regret even with substantial feedback delays (Avrachenkov et al., 2012).
Linear and Combinatorial Bandits: For routing where path rewards depend on shared links, bandit algorithms that explicitly exploit linear structure (e.g., barycentric spanner + DSEE schedule), achieve regret scaling as $\tilde{O}(m d^3 \log T)$ , where $m$ is the number of links and $d$ the linear subspace dimension (Liu et al., 2012). Thompson Sampling (CTS) for combinatorial structures provides Bayesian regret guarantees as $\mathcal O\left(\sum_{i=1}^{m}\frac{\log T}{p_i \Delta_i}\right)$ and matches UCB-type algorithms in empirical performance (Hüyük et al., 2019).
Adversarial Combinatorial Bandit (C-ComBand with ZDD Encoding): To enable scalable computation over exponentially large super-arm sets (e.g., all $s$ – $t$ paths), decisions sets are encoded via zero-suppressed binary decision diagrams (ZDDs). C-ComBand maintains per-arm weights and leverages dynamic programming over the ZDD; it achieves high-probability $O(T^{2/3})$ or expected $O(\sqrt{T})$ regret, with $O(|V|)$ – $O(d\,|V|)$ per-round complexity ( $|V|$ : ZDD nodes) (Sakaue et al., 2017).
Drift-plus-Penalty with Bandit Exploration: In joint routing/scheduling with queueing, Lyapunov drift-plus-penalty is combined with optimistic (lower-confidence) cost estimation to ensure stability (bounded queues) and sublinear regret $O(\sqrt{T} \log T)$ compared to the optimal static policy (Chadaga et al., 3 Sep 2025).
Congestion-Aware Bandit Routing (Carmab): For scenarios where the cost of route usage is congestion-dependent (cost increases with recent usage frequency in the last $\Delta$ rounds), the problem is re-cast as a small-diameter MDP and solved using UCRL2-style algorithms with short-term resets. The Carmab algorithm achieves policy regret of $\tilde{O}(\sqrt{K \Delta T})$ for $K$ routes and time horizon $T$ (Awasthi et al., 2023).

3. Theoretical Guarantees and Regret Analyses

A central metric for context-free MAB routers is regret: the expected excess cost compared to a benchmark (e.g., optimal static path, policy, or allocation) after $T$ rounds. The following summary provides key theoretical results:

Setting	Regret Bound	Reference
Stochastic MAB + delays	$O(\ln T)$ (tuned $\varepsilon$ -greedy, UCB)	(Avrachenkov et al., 2012)
Linear bandit on paths	$O(m d^3 \log T)$ , with $m$ links, $d$ path dim.	(Liu et al., 2012)
Heavy-tailed statistics	$O(d T^{1/q})$ for $q$ -th moment finite	(Liu et al., 2012)
CTS semi-bandit model	$O\left(\sum_{i=1}^m \frac{\log T}{p_i \Delta_i}\right)$ , Bayesian $O(\max \{\mathbb{E}[m\sqrt{T\log T/p^}],\mathbb{E}[m^2/p^] \})$	(Hüyük et al., 2019)
Adversarial ZDD-combinatorial	$O(T^{2/3})$ high-probability, $O(\sqrt{T})$ expected	(Sakaue et al., 2017)
Congested bandits (Carmab)	$\tilde{O}(\sqrt{K \Delta T})$ policy regret	(Awasthi et al., 2023)
Drift+optimistic for queuing	$O(\sqrt{T} \log T)$ regret, queue stability	(Chadaga et al., 3 Sep 2025)

These regret rates capture the optimality and efficiency in respective structural regimes (adversarial, combinatorial, stochastic with dependency, queueing dynamics, delay, congestion).

4. Computational and Implementation Considerations

Efficient realization of context-free MAB routers hinges on data structure and algorithmic design:

ZDD-based routing: Construct the ZDD for feasible routing sets (e.g., all $s$ – $t$ paths) via frontier-based search; perform dynamic programming across ZDD nodes for forward/backward weights, sampling, and co-occurrence calculations. Operations (sampling, loss estimation, weight updates) scale as $O(|V|)$ – $O(d\,|V|)$ , where $|V| \ll |S|$ (number of feasible sets), drastically reducing memory and computational requirements versus explicit enumeration (Sakaue et al., 2017).
Barycentric spanners: For dependency-exploiting routing, compute a $d$ -dimension barycentric spanner of paths, perform periodic exploration of these basis arms, and use fast linear interpolation. The exploitation step can use on-the-fly shortest-path solvers given estimated link costs; per-slot complexity reduces to $O(m + n \log n)$ for $m$ links and $n$ nodes (Liu et al., 2012).
Online updates: Maintain per-arm or per-link estimates and counters incrementally; in delayed feedback regimes, update using only observed outcomes and prioritize deterministic “Round-Robin” initial exploration to mitigate variance (Avrachenkov et al., 2012).
Queueing control: In DPOP, the per-slot optimization (given queue lengths and current cost estimates) is solved to maximize a drift-plus-penalty objective; confidence intervals on unknown costs are adaptively shrunk to satisfy the exploration–exploitation tradeoff (Chadaga et al., 3 Sep 2025).
Congestion tracking: Maintain a sliding-window count for each route to track short-term congestion. MDP-based planning exploits the finite history window, enabling efficient value iteration or policy iteration when the congestion memory $\Delta$ is small (Awasthi et al., 2023).

5. Practical Guidelines for Deployment

Empirical and implementation guidance distilled from studies includes:

For stochastic routing with delays and no context, use tuned $\varepsilon$ -greedy or UCB (delay-resistant) with minimal initial exploration (Round-Robin for $K$ arms suffices) and expect rapid convergence to optimal routing; initial phases can be a handful of time slots, and exploration parameter settings follow from theoretical guarantees (Avrachenkov et al., 2012).
In combinatorial (path-based) settings, compress the feasible path space with ZDDs, set exploration schedules ( $\gamma_t \sim t^{-1/3}$ or $t^{-1/2}$ ) to control regret, and update weights/top-down sampling per round with minimal memory overhead (Sakaue et al., 2017).
When link dependencies are strong, leverage the ASPR construction: compute a barycentric spanner, alternate scheduled exploration with exploitation using linear estimates, and use the near-logarithmic exploration variant to circumvent unknown parameter estimation (Liu et al., 2012).
In dynamic queueing networks, initialize with at least one exploration per link, balance the drift and cost penalty using $V\sim \sqrt{T}$ , and set optimism/confidence width parameters proportional to log-time and observed sample size. While queue stabilization is guaranteed under proper parameterization, regret grows sublinearly and approaches the static optimum (Chadaga et al., 3 Sep 2025).
In congestion-sensitive routing, select the congestion window $\Delta$ to match the natural time-scale of network congestion; ensure state explosion ( $K^\Delta$ ) due to MDP reformulation is manageable, and leverage resetting epochs to ensure statistical efficiency (Awasthi et al., 2023).

6. Empirical Findings and Observed Performance

Empirical evaluation across several network topologies and traffic scenarios confirms:

For CCN interest forwarding, all tested MAB algorithms (tuned $\varepsilon$ -greedy, UCB, standard $\varepsilon$ -greedy) achieve near-optimal forwarding with $>95\%$ of interests routed optimally by $t\sim30$ , even under delayed feedback (Avrachenkov et al., 2012).
ZDD-based adversarial combinatorial bandit routers demonstrate that, even when the set of feasible paths is exponentially large (e.g., $|S| \sim 10^{11}$ for certain network grids), the ZDD encoding enables real-time operations, sublinear regret, and empirical avoidance of congestion in competitive scenarios (Sakaue et al., 2017).
DPOP routers in queueing networks reduce transmission cost to the static optimum and stabilize queue backlogs rapidly; regret remains below the predicted $O(\sqrt{T} \log T)$ bound across all tested noise and arrival rate regimes. Backlog and cost curves closely track the “oracle” (fully informed) policy after initial transient (Chadaga et al., 3 Sep 2025).
In congestion-aware bandit routing, Carmab learns to interleave or cycle routes, thereby avoiding persistent congestion; regret scales as predicted, and route utilization dynamically adjusts to maintain performance under dynamic load (Awasthi et al., 2023).

Context-free MAB routers intersect a range of research areas:

Combinatorial bandits: Allow for subset-based action spaces, crucial for path and Steiner-tree routing (Hüyük et al., 2019, Sakaue et al., 2017).
Network control and stochastic optimization: Extensions to joint scheduling, admission control, and throughput optimization with unknown link qualities and queueing dynamics (Chadaga et al., 3 Sep 2025).
Multi-player scenarios and congestion games: Adversarial and congestion-aware routing bandits address the strategic interactions and traffic-coupled payoff structures common in real-world settings (Sakaue et al., 2017, Awasthi et al., 2023).
Heavy-tailed and delayed-response settings: Policies adapt exploration rates and use flexible scheduling to ensure low regret under unpredictable or slowly observed costs (Liu et al., 2012, Avrachenkov et al., 2012).

A plausible implication is that, by integrating structural knowledge (e.g., combinatorial constraints, delay profiles, congestion dynamics, queueing feedback) and leveraging compact representation and adaptive estimation, context-free MAB routing provides a robust foundation for sequential decision making in large-scale and uncertain network environments.