Queue Length Regret Decomposition

Updated 3 February 2026

Queue Length Regret Decomposition Framework is a method that splits the performance gap between learning-based and oracle policies using interpretable components of queue dynamics.
It leverages structural insights such as regenerative cycles, busy periods, and policy-switching to yield sharper regret bounds than standard bandit models.
The framework guides algorithm design with queue-aware exploration, forced scheduling, and coupling techniques to achieve near-optimal performance in complex queueing systems.

A queue length regret decomposition framework provides a rigorous methodology to analyze, bound, and optimize the performance difference—in queueing terms—between learning-based scheduling/control policies and optimal omniscient policies (usually called “oracle” or “genie” policies) in stochastic or adversarial queueing systems. The central technical innovation is to decompose queue-length regret into interpretable components linked to scheduling or learning errors, and to exploit structural properties of queues—such as regenerative cycles or birth–death process dynamics—to obtain sharper regret bounds than those available for generic bandits or MDPs. Over the last decade, this framework has unified a variety of algorithmic and analytic results across discrete-time, continuous-time, single-server, multi-server, and context-aware queueing models.

1. Formal Definition of Queue-Length Regret

Let $Q^\pi(t)$ denote the queue length (or queue vector in networked systems) at time $t$ under a policy $\pi$ , and let $Q^*(t)$ be the queue length under an oracle policy that always selects the maximally stabilizing or profit-maximizing action. The queue-length regret up to time $T$ is typically defined as

$R_Q^\pi(T) = \mathbb{E} \left[ Q^\pi(T) - Q^*(T) \right]$

or, for cumulative/average formulations,

$R^\pi(T) = \mathbb{E}\left[ \sum_{t=0}^{T-1} Q^\pi(t) - \sum_{t=0}^{T-1} Q^*(t) \right]$

In adversarial or nonstationary settings, the queue-length regret is sometimes defined as the worst-case (over time and comparators) excess backlog,

$R_Q^\pi(T) = \max_{t \le T} \max_{i \in [N]} [Q^\pi(t) - Q^i(t)]$

where $Q^i(t)$ is the backlog if always selecting the $i$ -th service/resource/action (Krishnakumar et al., 23 Jan 2025).

2. Foundational Decomposition Structures

The core of queue-length regret decomposition rests on upper-bounding queue regret via event-based, temporal, or process-theoretic partitions. Several archetypes have been established:

Cumulative Rate-Loss Decomposition: For single or multi-queue systems, queue-regret is bounded above by the sum over time of the instantaneous differences between the algorithmic and optimal number of departures (or reward),

$Q_u(t) - Q_u^*(t) \leq \sum_{s=1}^t [S_u^*(s) - S_u(s)]$

where $S_u(s)$ are the service outcomes (Krishnasamy et al., 2016).

Busy Period/Phase Partitioning: Define busy periods as maximal consecutive time intervals where the queue is non-empty, and partition time into busy and idle periods. Regret can be decomposed over these intervals. For instance, the queue-length regret of a policy $\pi_1$ can be represented as

$R^{\pi_1}(T) \leq \sum_{i \neq i^*} \mathbb{E}[Z_i] \, \mathbb{E}[S_i(T)]$

where $Z_i$ is the total backlog in busy periods served by suboptimal server $i$ and $S_i(T)$ is their count up to $T$ (Stahlbuhk et al., 2020).

Policy-Switching Queue Couplings: In settings with context or state (e.g., jobs with features), policy-switching queues construct a layered process: run policy $\pi$ up to time $s$ , then switch to the optimal policy $\pi^*$ , and analyze the contractive effect. The telescoping sum formula,

$R_T = \mathbb{E}[Q(T) - Q^*(T)] = \sum_{t=1}^{T-1} \mathbb{E}[Q(t,T) - Q(t-1,T)]$

decomposes regret into the incremental effect of policy mismatches (Bae et al., 27 Jan 2026).

Subinterval Adversarial Regret Bounds: For adversarially varying service rates or arrival processes, queue regret is controllable by the maximum "bandit loss" over all subintervals,

$R_Q^\pi(T) \leq \sup_{I \subseteq [T]} \max_{i \in [N]} \sum_{t \in I} [S_i(t) - \langle \bm S(t), \bm X^\pi(t) \rangle]$

(Krishnakumar et al., 23 Jan 2025).

3. Stagewise and Termwise Queue-Regret Dynamics

The decomposition framework unveils structural regimes in queue-regret evolution.

Early-Stage (Unstable/Non-Regenerative) Regime: In the early regime, when learning algorithms have not sufficiently identified the optimal action, queues may remain backlogged and fail to regenerate. Regret escalates logarithmically due to cumulative "suboptimal pulls" as in multi-armed bandits:

$\Psi_u(t) \gtrsim \Omega\left(\frac{\log t}{\log \log t}\right)$

and always $O(\log T)$ upper bounds hold using standard bandit policies (Krishnasamy et al., 2016, Stahlbuhk et al., 2020).

Late-Stage (Stable/Regenerative) Regime: Once the system has had enough exploration to consistently exploit the optimal action and exceed the arrival rate, the queues stabilize and regenerate—i.e., they regularly empty, and residual regret is "erased" at these epochs. Queue-regret now behaves like the "derivative" of cumulative bandit regret,

$\Psi_u(t) = O(\mathrm{poly}(\log t)/t)$

and per-slot or $O(1)$ total queue regret becomes achievable under suitable algorithms (Krishnasamy et al., 2016, Stahlbuhk et al., 2020).

Adversarial and Nonstationary Environments: Regret in time-varying or adversarial settings can be bounded by uniform subinterval bandit regret, resulting in polynomial or polylogarithmic bounds, such as

$R_Q(T) = \widetilde{O}(\sqrt{N} T^{3/4})$

(Krishnakumar et al., 23 Jan 2025), or

$R_T = O(\ln^2 T)$

for adversarial contexts (Bae et al., 27 Jan 2026).

Learning-Queue Tradeoffs: In two-sided markets, queue-regret is tied to regret via a tunable parameter $\gamma$ :

$R(T) = \tilde{O}(T^{1-\gamma}),\quad \overline{Q}(T) = \tilde{O}(T^{\gamma/2}),\quad \max\text{-queue}(T) = O(T^\gamma)$

showing a Pareto frontier between exploitation (low regret) and queue-length (Yang et al., 15 Oct 2025).

4. Algorithmic Design and Policy Implications

Analysis of the decomposition framework yields concrete algorithmic prescriptions:

Forced Exploration in Bandit Scheduling: Algorithms such as Q-UCB and Q-ThS interleave exploitation with forced exploration at a vanishingly small probability $\epsilon_t \asymp (\log^2 t)/t$ $ϵ_{t} ≍ (lo g^{2} t) / t$ to guarantee sufficient learning for queue stabilization,
- Q-UCB: Explores with probability $\epsilon_t$ , otherwise chooses argmax of UCB estimates.
- Q-ThS: Explores identically, exploits via Thompson Sampling (Krishnasamy et al., 2016).
Queue-Aware Exploration Schedules: Algorithms designed to minimize queue-regret must preferentially explore in idle periods ("free exploration windows") to avoid incurring additional backlog and may use time-out thresholds to mitigate the risk of excessively long busy periods (Stahlbuhk et al., 2020).
Policy Coupling in Contextual/Structured Queues: The policy-switching coupling approach isolates the impact of suboptimal decisions to a single time slot and bounds their effect on the future queue trajectory, leveraging contractivity under optimal control (Bae et al., 27 Jan 2026).
Gradient and Bisection Tricks in Pricing/Matching: In two-sided queueing markets, stochastic zeroth-order gradient methods with random price perturbations directly exploit the decomposition to maintain negative drift and control queue sizes while learning unknown demand and supply curves (Yang et al., 15 Oct 2025).
Bias/Span Control in Reinforcement Learning: In queueing MDPs with birth-death structure, explicit bias-span bounds $O(S)$ replace exponential-diameter factors, enabling $O(\sqrt{AT})$ regret even as the state space or MDP diameter explodes (Anselmi et al., 2023, Weber et al., 2024).

5. Model Extensions and Unifying Themes

The queue length regret decomposition framework generalizes across diverse queuing models:

System Class	Decomposition Mechanism	Queue-Regret Behavior
Single-/Multi-Server Queues (Krishnasamy et al., 2016)	Regenerative cycles, busy/idles	$O(\log T)$ then $O(1/t)$
Switch Networks (Krishnasamy et al., 2016)	Multi-dimensional vector regret	As above, per-queue
Channel Scheduling (Stahlbuhk et al., 2020, Krishnakumar et al., 23 Jan 2025)	Busy periods, subintervals	$O(1)$ or polynomial in $T$
Contextual Bandits (Bae et al., 27 Jan 2026)	Coupling, policy-switching	$\widetilde{O}(T^{-1/4})$
Two-Sided Markets (Yang et al., 15 Oct 2025)	Learning-queue tradeoff	$\tilde{O}(T^{1-\gamma})$
M/M/c/S RL (Anselmi et al., 2023, Weber et al., 2024)	Per-state analysis, bias-span	$O(S \log T + \sqrt{m T \log T})$

Key unifying principles include: queueing dynamics concentrate learning errors; regenerative phenomena allow past regret to be "washed away," and appropriate decomposition isolates exploration losses from intrinsic queue-drift effects. Extensions to multi-class, networked, or adversarial environments primarily demand more intricate coupling or busy-period analyses but maintain the same separation of learning and queue-induced regret components.

6. Implications and Theoretical Significance

The queue length regret decomposition framework produces several crucial insights:

Sublinear and Order-Optimal Regret: Classical bandit approaches without queue awareness suffer $\Omega(\log T)$ regret, but queue-regret-aware policies can achieve $O(1)$ total or even vanishing per-slot regret, provided stabilizability (Stahlbuhk et al., 2020).
Decoupling from State Space Diameter: In queue-MDPs, decomposition leverages the steady-state measure and bias structure, yielding regret bounds that eschew exponential dependence on queue capacity or state count (Anselmi et al., 2023, Weber et al., 2024).
Tradeoff Frontiers: The decomposition exposes explicit tradeoffs between regret and queue-length, representing fundamental performance limits in online learning for queueing control systems (Yang et al., 15 Oct 2025).
Robustness to Instability: In adversarial or non-stationary regimes, the decomposition formalism ensures that arrival process or instability does not impact the additive regret bounds (arrivals cancel in Lindley representation) (Krishnakumar et al., 23 Jan 2025).
Algorithmic Guideposts: It prescribes forced queue-aware exploration, adaptive timeouts, and explicit coupling as central design ingredients for near-optimal performance.

7. Broader Impact and Open Directions

The queue length regret decomposition framework informs a range of research on stochastic control, RL for queueing, and online optimization in dynamic resource allocation systems. It has driven the development of algorithms that achieve polylogarithmic or even constant regret in previously intractable settings and clarified the impact of queue dynamics on online learning rates. Ongoing challenges include optimizing regret in adversarial or non-stationary queues with partial feedback, designing algorithms that interpolate smoothly between different performance regimes, and extending decomposition techniques to networks with complex feedback or service interaction structures (Krishnasamy et al., 2016, Stahlbuhk et al., 2020, Krishnakumar et al., 23 Jan 2025, Yang et al., 15 Oct 2025, Anselmi et al., 2023, Weber et al., 2024, Bae et al., 27 Jan 2026).

The framework continues to unify queueing theory and online learning, yielding both policy prescriptions and analytic sharpness across diverse service systems.

Markdown Upgrade to Chat

References (7)

Minimizing Queue Length Regret for Arbitrarily Varying Channels (2025)

Learning Unknown Service Rates in Queues: A Multi-Armed Bandit Approach (2016)

Learning Algorithms for Minimizing Queue Length Regret (2020)

Queue Length Regret Bounds for Contextual Queueing Bandits (2026)

Near-Optimal Regret-Queue Length Tradeoff in Online Learning for Two-Sided Markets (2025)

Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space (2023)

Reinforcement Learning and Regret Bounds for Admission Control (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Queue Length Regret Decomposition Framework.