CQB-MNL: Contextual Queueing Bandits

Updated 9 February 2026

The paper introduces a unified framework combining queueing control with contextual MNL bandit learning to jointly optimize system stability and cumulative reward.
It develops algorithmic approaches, including UCB-QMB, TS-QMB, and OFU-based methods, that adaptively address exploration-exploitation trade-offs in nonstationary settings.
Empirical evaluations in LLM service routing and multi-server matching demonstrate sublinear routing regret and enhanced queue stability under dynamic, adversarial conditions.

Contextual Queueing Bandits with Multinomial Logit Feedback (CQB-MNL) formalize the union of queueing control problems and contextual multinomial logit (MNL) bandit learning, allowing efficient routing and scheduling in systems with multiple agents, servers, and resource constraints. The CQB-MNL model generalizes standard bandit and queueing matching problems by incorporating context features, transient queue states, and MNL-based stochastic feedback, enabling joint optimization of system stability and cumulative reward or regret in adversarial, nonstationary environments. This framework is especially prominent for learning-driven resource allocation and service orchestration in domains such as LLM serving, cloud scheduling, and networked queueing.

1. Mathematical Model and System Dynamics

CQB-MNL centrally models a multi-agent, multi-server system where contextual jobs (or queries) are matched to heterogeneous servers using online learning under uncertain and evolving preferences. The key model components are as follows:

System State: The environment has $N$ queues (or job classes) and $K$ servers. At each discrete time $t$ , queue states $Q_n(t)$ represent the number of pending jobs in queue $n$ . The feature space $\mathcal{X} \subset \mathbb{R}^d$ describes job or query contexts.
Arrivals and Scheduling: Jobs arrive into queue $n$ as Bernoulli( $\lambda_n$ ) processes. At each time step, nonempty queues are presented to the assignment policy, which forms (possibly disjoint) assortments $S_{k,t}$ of pending jobs for each server, subject to cardinality constraints.
Context Embedding: Each agent (queue or job) $n$ is represented by a normalized feature vector $K$ 0; servers have unknown parameter $K$ 1.
MNL Feedback Model: The service outcome for each assortment $K$ 2 and server $K$ 3 follows the multinomial logit:

$K$ 4

with the null (no-service) choice probability

$K$ 5

Transitions: The queueing system evolves via

$K$ 6

where $K$ 7 is the arrival indicator and $K$ 8 is a binary departure variable, determined by the MNL outcome. Assortments and assignments induce competition for service within each server.

These components are foundational across model variants, including generalized resource sharing, cloud LLM serving, and matching networks (Kim et al., 2024, Bae et al., 2 Feb 2026).

2. Multinomial Logit Feedback and Implicit Preference Learning

MNL models capture the fundamental uncertainty and choice structure over which job or query is served or routed at each scheduling epoch. They introduce stochasticity in selection due to latent or unknown preferences, modeled via the exponentiated context–parameter dot product. In LLM-serving contexts, for instance, user query retrials and implicit feedback can be modeled as whether any server (LLM) succeeded in satisfying the query (departure) or not (retry), providing implicit labels for MNL-based learning (Bae et al., 2 Feb 2026).

Properties and implications:

The log-likelihood of observed feedback yields a convex loss for parameter estimation.
The MNL model induces competition among jobs in the same assortment, a departure from independent-static model assumptions.
Regularity and identifiability require bounded context norms, parameter norms, and a positivity condition on $K$ 9, ensuring tractable learning and finite confidence sets (Kim et al., 2024).

3. CQB-MNL Algorithmic Methodologies

Algorithmic schemes for CQB-MNL integrate bandit learning with queueing control. The principal methodologies are:

Online Upper Confidence Bound (UCB) Approaches: UCB-QMB maintains server-specific confidence sets and computes optimistic estimates for service probabilities. It solves a MaxWeight-like scheduling problem with these UCB rates, selecting assignments that simultaneously exploit MaxWeight throughput optimality and incentivize exploration via confidence bonuses (Kim et al., 2024).
Thompson Sampling (TS) Approaches: TS-QMB draws multiple samples from the posterior distribution over $t$ 0 for each server, computes optimistic rates from these samples, and schedules based on these sampled rewards, addressing the computational complexity and robustness to distributional mis-specification (Kim et al., 2024, Bae et al., 2 Feb 2026).
OFU-MNL++ / OFU-MN $t$ 1L for Generalized MNL Bandits: Recent advances derive self-concordant, variance-dependent confidence sets for MNL bandit learning, leading to algorithms achieving optimal (up to constants) regret rates, with independence from maximum assortment size $t$ 2 and reduced dependence on parameter bounds (Lee et al., 14 Feb 2025). These methods are plug-and-play in CQB-MNL if queue length or waiting time is embedded into agent contexts.
ACQB (Anytime Contextual Queueing Bandit): For LLM service, ACQB blends Thompson sampling and forced exploration, balancing exploitation with regular, randomized exploration steps during new arrivals, enhancing learning under heavy-tailed or adversarial contexts (Bae et al., 2 Feb 2026).

Algorithmic variants may use utility-aligned or contrastive embeddings (e.g., via InfoNCE loss on performance/cost pairs) for context features, server-disjoint or shared parameterizations, and regularized maximum-likelihood or mirror-descent updates as appropriate.

4. Theoretical Guarantees: Regret and Queue Stability

CQB-MNL approaches are evaluated by both service quality (regret) and queue stability. Theoretical results depend on system properties:

Queue Stability: Under MNL regularity, feature boundedness, and “traffic slackness” (existence of optimal matching with service probability exceeding arrival rate plus $t$ 3), algorithms achieve strong stability:

$t$ 4

This holds for optimistic (UCB/TS) MaxWeight policies (Kim et al., 2024).

Routing or Service Regret: Regret is measured relative to a clairvoyant system with access to true parameters:
- Under contextual MNL bandit models using OFU-MNL++ (and variants), regret is bounded by $t$ 5 for rewards with conditional variance $t$ 6 and context dimension $t$ 7, independent of $t$ 8 (Lee et al., 14 Feb 2025).
- In the ACQB or TS-QMB protocols, cumulative regret scales as $t$ 9, while queue-length regret decays as $Q_n(t)$ 0 for large $Q_n(t)$ 1 (Bae et al., 2 Feb 2026).
- For UCB-QMB, regret admits a bound $Q_n(t)$ 2, where $Q_n(t)$ 3 denotes maximum queue length (Kim et al., 2024).

These results hold without explicit dependence on assortment size, under adversarial or i.i.d. context sequences, and for rich system structures.

5. Applications and Empirical Evaluations

CQB-MNL frameworks are extensively validated in domains requiring dynamic routing and scheduling of contextual jobs with uncertain preferences:

LLM Service Routing: CQB-MNL (ACQB) is deployed to route conversational queries across large server clusters (SPROUT, EmbedLLM, RouterBench datasets), using contrastive-learned embeddings and per-LLM parameters. The algorithms maintain low queue lengths and achieve sublinear routing regret, outperforming random, standard MaxWeight, and prior contextual bandit baselines across varying loads and server counts (Bae et al., 2 Feb 2026).
General Multi-Queue Multi-Server Matching: UCB-QMB and TS-QMB stabilize large-scale systems with feature-rich agents and nonstationary reward structures, providing robust performance across synthetic and real-world matching environments (Kim et al., 2024).
Contextual Queueing Bandits: These approaches generalize prior matching/queueing bandit designs by enabling parameter sharing via context, improved adaptation in non-i.i.d. arrivals, and explicit handling of competitive MNL competition within servers (Lee et al., 14 Feb 2025, Kim et al., 2024).

Empirical studies affirm that CQB-MNL algorithms closely approach the performance of clairvoyant oracles in both queue-length and routing regret over diverse network and server regimes, with contrastive alignment of embeddings further improving sample efficiency and regret minimization (Bae et al., 2 Feb 2026).

6. Assumptions, Conditions, and Adaptability

CQB-MNL systems rely on several technical conditions:

Feature and Parameter Boundedness: Required for concentration and confidence-bound guarantees: $Q_n(t)$ 4, $Q_n(t)$ 5 or as dictated by the system scaling (Kim et al., 2024, Lee et al., 14 Feb 2025).
Regularity Constants: MNL feedback regularity $Q_n(t)$ 6, ensuring lower bounded choice probabilities.
Traffic Slackness: Existence of a disjoint matching with surplus service capacity over job arrivals ensures queue stability.
Context Embedding Adaptability: Queue lengths, waiting times, or batch information can be embedded as context coordinates, provided normalization to maintain boundedness, allowing seamless integration with confidence-bound and MNL bandit methodologies (Lee et al., 14 Feb 2025).

A plausible implication is that CQB-MNL admits flexible extension to settings with time-varying, adversarial, or highly structured contextual information, provided these core regularity and boundedness assumptions hold. This offers a unified route to scalable, theoretically grounded, and empirically validated sequential resource allocation.

7. Connections and Future Extensions

CQB-MNL bridges queueing theory, contextual bandits, and generalized online assortment optimization, incorporating competition, parameter sharing, and MNL dynamics across application areas. It extends MaxWeight and classic queueing matching control by adding (i) contextual generalization, (ii) stochastic/competitive MNL feedback, and (iii) rigorous learning-theoretic regret and stability guarantees (Kim et al., 2024, Lee et al., 14 Feb 2025, Bae et al., 2 Feb 2026). Ongoing research addresses:

Improved computational scaling and implementation for large-scale, real-time service systems.
Tighter confidence bounds for combinatorial MNL bandits in adversarial regime.
Learning under nonstationary arrivals and multi-resource constraints.
Integration with utility-aligned contextual representation learning (e.g., contrastive or cost-sensitive embeddings).
Richer feedback models beyond MNL and exploration-exploitation-externalities in general networked queueing bandits.

Empirical validation and methodological development continue, supported by open-source benchmarks and extensive evaluation in realistic environments (Bae et al., 2 Feb 2026, Kim et al., 2024).

Key Papers Referenced:

"Improved Online Confidence Bounds for Multinomial Logistic Bandits" (Lee et al., 14 Feb 2025)
"Learning to Route and Schedule LLMs from User Retrials via Contextual Queueing Bandits" (Bae et al., 2 Feb 2026)
"Queueing Matching Bandits with Preference Feedback" (Kim et al., 2024)