Papers
Topics
Authors
Recent
Search
2000 character limit reached

Queue Length Regret Decomposition

Updated 3 February 2026
  • Queue Length Regret Decomposition Framework is a method that splits the performance gap between learning-based and oracle policies using interpretable components of queue dynamics.
  • It leverages structural insights such as regenerative cycles, busy periods, and policy-switching to yield sharper regret bounds than standard bandit models.
  • The framework guides algorithm design with queue-aware exploration, forced scheduling, and coupling techniques to achieve near-optimal performance in complex queueing systems.

A queue length regret decomposition framework provides a rigorous methodology to analyze, bound, and optimize the performance difference—in queueing terms—between learning-based scheduling/control policies and optimal omniscient policies (usually called “oracle” or “genie” policies) in stochastic or adversarial queueing systems. The central technical innovation is to decompose queue-length regret into interpretable components linked to scheduling or learning errors, and to exploit structural properties of queues—such as regenerative cycles or birth–death process dynamics—to obtain sharper regret bounds than those available for generic bandits or MDPs. Over the last decade, this framework has unified a variety of algorithmic and analytic results across discrete-time, continuous-time, single-server, multi-server, and context-aware queueing models.

1. Formal Definition of Queue-Length Regret

Let Qπ(t)Q^\pi(t) denote the queue length (or queue vector in networked systems) at time tt under a policy π\pi, and let Q(t)Q^*(t) be the queue length under an oracle policy that always selects the maximally stabilizing or profit-maximizing action. The queue-length regret up to time TT is typically defined as

RQπ(T)=E[Qπ(T)Q(T)]R_Q^\pi(T) = \mathbb{E} \left[ Q^\pi(T) - Q^*(T) \right]

or, for cumulative/average formulations,

Rπ(T)=E[t=0T1Qπ(t)t=0T1Q(t)]R^\pi(T) = \mathbb{E}\left[ \sum_{t=0}^{T-1} Q^\pi(t) - \sum_{t=0}^{T-1} Q^*(t) \right]

In adversarial or nonstationary settings, the queue-length regret is sometimes defined as the worst-case (over time and comparators) excess backlog,

RQπ(T)=maxtTmaxi[N][Qπ(t)Qi(t)]R_Q^\pi(T) = \max_{t \le T} \max_{i \in [N]} [Q^\pi(t) - Q^i(t)]

where Qi(t)Q^i(t) is the backlog if always selecting the ii-th service/resource/action (Krishnakumar et al., 23 Jan 2025).

2. Foundational Decomposition Structures

The core of queue-length regret decomposition rests on upper-bounding queue regret via event-based, temporal, or process-theoretic partitions. Several archetypes have been established:

  • Cumulative Rate-Loss Decomposition: For single or multi-queue systems, queue-regret is bounded above by the sum over time of the instantaneous differences between the algorithmic and optimal number of departures (or reward),

Qu(t)Qu(t)s=1t[Su(s)Su(s)]Q_u(t) - Q_u^*(t) \leq \sum_{s=1}^t [S_u^*(s) - S_u(s)]

where Su(s)S_u(s) are the service outcomes (Krishnasamy et al., 2016).

  • Busy Period/Phase Partitioning: Define busy periods as maximal consecutive time intervals where the queue is non-empty, and partition time into busy and idle periods. Regret can be decomposed over these intervals. For instance, the queue-length regret of a policy π1\pi_1 can be represented as

Rπ1(T)iiE[Zi]E[Si(T)]R^{\pi_1}(T) \leq \sum_{i \neq i^*} \mathbb{E}[Z_i] \, \mathbb{E}[S_i(T)]

where ZiZ_i is the total backlog in busy periods served by suboptimal server ii and Si(T)S_i(T) is their count up to TT (Stahlbuhk et al., 2020).

  • Policy-Switching Queue Couplings: In settings with context or state (e.g., jobs with features), policy-switching queues construct a layered process: run policy π\pi up to time ss, then switch to the optimal policy π\pi^*, and analyze the contractive effect. The telescoping sum formula,

RT=E[Q(T)Q(T)]=t=1T1E[Q(t,T)Q(t1,T)]R_T = \mathbb{E}[Q(T) - Q^*(T)] = \sum_{t=1}^{T-1} \mathbb{E}[Q(t,T) - Q(t-1,T)]

decomposes regret into the incremental effect of policy mismatches (Bae et al., 27 Jan 2026).

  • Subinterval Adversarial Regret Bounds: For adversarially varying service rates or arrival processes, queue regret is controllable by the maximum "bandit loss" over all subintervals,

RQπ(T)supI[T]maxi[N]tI[Si(t)S(t),Xπ(t)]R_Q^\pi(T) \leq \sup_{I \subseteq [T]} \max_{i \in [N]} \sum_{t \in I} [S_i(t) - \langle \bm S(t), \bm X^\pi(t) \rangle]

(Krishnakumar et al., 23 Jan 2025).

3. Stagewise and Termwise Queue-Regret Dynamics

The decomposition framework unveils structural regimes in queue-regret evolution.

  • Early-Stage (Unstable/Non-Regenerative) Regime: In the early regime, when learning algorithms have not sufficiently identified the optimal action, queues may remain backlogged and fail to regenerate. Regret escalates logarithmically due to cumulative "suboptimal pulls" as in multi-armed bandits:

Ψu(t)Ω(logtloglogt)\Psi_u(t) \gtrsim \Omega\left(\frac{\log t}{\log \log t}\right)

and always O(logT)O(\log T) upper bounds hold using standard bandit policies (Krishnasamy et al., 2016, Stahlbuhk et al., 2020).

  • Late-Stage (Stable/Regenerative) Regime: Once the system has had enough exploration to consistently exploit the optimal action and exceed the arrival rate, the queues stabilize and regenerate—i.e., they regularly empty, and residual regret is "erased" at these epochs. Queue-regret now behaves like the "derivative" of cumulative bandit regret,

Ψu(t)=O(poly(logt)/t)\Psi_u(t) = O(\mathrm{poly}(\log t)/t)

and per-slot or O(1)O(1) total queue regret becomes achievable under suitable algorithms (Krishnasamy et al., 2016, Stahlbuhk et al., 2020).

  • Adversarial and Nonstationary Environments: Regret in time-varying or adversarial settings can be bounded by uniform subinterval bandit regret, resulting in polynomial or polylogarithmic bounds, such as

RQ(T)=O~(NT3/4)R_Q(T) = \widetilde{O}(\sqrt{N} T^{3/4})

(Krishnakumar et al., 23 Jan 2025), or

RT=O(ln2T)R_T = O(\ln^2 T)

for adversarial contexts (Bae et al., 27 Jan 2026).

  • Learning-Queue Tradeoffs: In two-sided markets, queue-regret is tied to regret via a tunable parameter γ\gamma:

R(T)=O~(T1γ),Q(T)=O~(Tγ/2),max-queue(T)=O(Tγ)R(T) = \tilde{O}(T^{1-\gamma}),\quad \overline{Q}(T) = \tilde{O}(T^{\gamma/2}),\quad \max\text{-queue}(T) = O(T^\gamma)

showing a Pareto frontier between exploitation (low regret) and queue-length (Yang et al., 15 Oct 2025).

4. Algorithmic Design and Policy Implications

Analysis of the decomposition framework yields concrete algorithmic prescriptions:

  • Forced Exploration in Bandit Scheduling: Algorithms such as Q-UCB and Q-ThS interleave exploitation with forced exploration at a vanishingly small probability ϵt(log2t)/t\epsilon_t \asymp (\log^2 t)/t to guarantee sufficient learning for queue stabilization,
  • Queue-Aware Exploration Schedules: Algorithms designed to minimize queue-regret must preferentially explore in idle periods ("free exploration windows") to avoid incurring additional backlog and may use time-out thresholds to mitigate the risk of excessively long busy periods (Stahlbuhk et al., 2020).
  • Policy Coupling in Contextual/Structured Queues: The policy-switching coupling approach isolates the impact of suboptimal decisions to a single time slot and bounds their effect on the future queue trajectory, leveraging contractivity under optimal control (Bae et al., 27 Jan 2026).
  • Gradient and Bisection Tricks in Pricing/Matching: In two-sided queueing markets, stochastic zeroth-order gradient methods with random price perturbations directly exploit the decomposition to maintain negative drift and control queue sizes while learning unknown demand and supply curves (Yang et al., 15 Oct 2025).
  • Bias/Span Control in Reinforcement Learning: In queueing MDPs with birth-death structure, explicit bias-span bounds O(S)O(S) replace exponential-diameter factors, enabling O(AT)O(\sqrt{AT}) regret even as the state space or MDP diameter explodes (Anselmi et al., 2023, Weber et al., 2024).

5. Model Extensions and Unifying Themes

The queue length regret decomposition framework generalizes across diverse queuing models:

System Class Decomposition Mechanism Queue-Regret Behavior
Single-/Multi-Server Queues (Krishnasamy et al., 2016) Regenerative cycles, busy/idles O(logT)O(\log T) then O(1/t)O(1/t)
Switch Networks (Krishnasamy et al., 2016) Multi-dimensional vector regret As above, per-queue
Channel Scheduling (Stahlbuhk et al., 2020, Krishnakumar et al., 23 Jan 2025) Busy periods, subintervals O(1)O(1) or polynomial in TT
Contextual Bandits (Bae et al., 27 Jan 2026) Coupling, policy-switching O~(T1/4)\widetilde{O}(T^{-1/4})
Two-Sided Markets (Yang et al., 15 Oct 2025) Learning-queue tradeoff O~(T1γ)\tilde{O}(T^{1-\gamma})
M/M/c/S RL (Anselmi et al., 2023, Weber et al., 2024) Per-state analysis, bias-span O(SlogT+mTlogT)O(S \log T + \sqrt{m T \log T})

Key unifying principles include: queueing dynamics concentrate learning errors; regenerative phenomena allow past regret to be "washed away," and appropriate decomposition isolates exploration losses from intrinsic queue-drift effects. Extensions to multi-class, networked, or adversarial environments primarily demand more intricate coupling or busy-period analyses but maintain the same separation of learning and queue-induced regret components.

6. Implications and Theoretical Significance

The queue length regret decomposition framework produces several crucial insights:

  • Sublinear and Order-Optimal Regret: Classical bandit approaches without queue awareness suffer Ω(logT)\Omega(\log T) regret, but queue-regret-aware policies can achieve O(1)O(1) total or even vanishing per-slot regret, provided stabilizability (Stahlbuhk et al., 2020).
  • Decoupling from State Space Diameter: In queue-MDPs, decomposition leverages the steady-state measure and bias structure, yielding regret bounds that eschew exponential dependence on queue capacity or state count (Anselmi et al., 2023, Weber et al., 2024).
  • Tradeoff Frontiers: The decomposition exposes explicit tradeoffs between regret and queue-length, representing fundamental performance limits in online learning for queueing control systems (Yang et al., 15 Oct 2025).
  • Robustness to Instability: In adversarial or non-stationary regimes, the decomposition formalism ensures that arrival process or instability does not impact the additive regret bounds (arrivals cancel in Lindley representation) (Krishnakumar et al., 23 Jan 2025).
  • Algorithmic Guideposts: It prescribes forced queue-aware exploration, adaptive timeouts, and explicit coupling as central design ingredients for near-optimal performance.

7. Broader Impact and Open Directions

The queue length regret decomposition framework informs a range of research on stochastic control, RL for queueing, and online optimization in dynamic resource allocation systems. It has driven the development of algorithms that achieve polylogarithmic or even constant regret in previously intractable settings and clarified the impact of queue dynamics on online learning rates. Ongoing challenges include optimizing regret in adversarial or non-stationary queues with partial feedback, designing algorithms that interpolate smoothly between different performance regimes, and extending decomposition techniques to networks with complex feedback or service interaction structures (Krishnasamy et al., 2016, Stahlbuhk et al., 2020, Krishnakumar et al., 23 Jan 2025, Yang et al., 15 Oct 2025, Anselmi et al., 2023, Weber et al., 2024, Bae et al., 27 Jan 2026).

The framework continues to unify queueing theory and online learning, yielding both policy prescriptions and analytic sharpness across diverse service systems.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Queue Length Regret Decomposition Framework.