Papers
Topics
Authors
Recent
Search
2000 character limit reached

CQB-MNL: Contextual Queueing Bandits

Updated 9 February 2026
  • The paper introduces a unified framework combining queueing control with contextual MNL bandit learning to jointly optimize system stability and cumulative reward.
  • It develops algorithmic approaches, including UCB-QMB, TS-QMB, and OFU-based methods, that adaptively address exploration-exploitation trade-offs in nonstationary settings.
  • Empirical evaluations in LLM service routing and multi-server matching demonstrate sublinear routing regret and enhanced queue stability under dynamic, adversarial conditions.

Contextual Queueing Bandits with Multinomial Logit Feedback (CQB-MNL) formalize the union of queueing control problems and contextual multinomial logit (MNL) bandit learning, allowing efficient routing and scheduling in systems with multiple agents, servers, and resource constraints. The CQB-MNL model generalizes standard bandit and queueing matching problems by incorporating context features, transient queue states, and MNL-based stochastic feedback, enabling joint optimization of system stability and cumulative reward or regret in adversarial, nonstationary environments. This framework is especially prominent for learning-driven resource allocation and service orchestration in domains such as LLM serving, cloud scheduling, and networked queueing.

1. Mathematical Model and System Dynamics

CQB-MNL centrally models a multi-agent, multi-server system where contextual jobs (or queries) are matched to heterogeneous servers using online learning under uncertain and evolving preferences. The key model components are as follows:

  • System State: The environment has NN queues (or job classes) and KK servers. At each discrete time tt, queue states Qn(t)Q_n(t) represent the number of pending jobs in queue nn. The feature space XRd\mathcal{X} \subset \mathbb{R}^d describes job or query contexts.
  • Arrivals and Scheduling: Jobs arrive into queue nn as Bernoulli(λn\lambda_n) processes. At each time step, nonempty queues are presented to the assignment policy, which forms (possibly disjoint) assortments Sk,tS_{k,t} of pending jobs for each server, subject to cardinality constraints.
  • Context Embedding: Each agent (queue or job) nn is represented by a normalized feature vector xnRdx_n \in \mathbb{R}^d; servers have unknown parameter θkRd\theta_k \in \mathbb{R}^d.
  • MNL Feedback Model: The service outcome for each assortment SS and server kk follows the multinomial logit:

μ(nS,θk)=exp(xnθk)1+mSexp(xmθk),\mu(n \mid S,\theta_k) = \frac{\exp(x_n^\top \theta_k)}{1+\sum_{m\in S}\exp(x_m^\top \theta_k)},

with the null (no-service) choice probability

μ(n0S,θk)=11+mSexp(xmθk).\mu(n_0 \mid S,\theta_k) = \frac{1}{1+\sum_{m\in S}\exp(x_m^\top \theta_k)}.

  • Transitions: The queueing system evolves via

Qn(t+1)=[Qn(t)+An(t)Dn(t)]+,Q_n(t+1) = [Q_n(t) + A_n(t) - D_n(t)]^+,

where An(t)A_n(t) is the arrival indicator and Dn(t)D_n(t) is a binary departure variable, determined by the MNL outcome. Assortments and assignments induce competition for service within each server.

These components are foundational across model variants, including generalized resource sharing, cloud LLM serving, and matching networks (Kim et al., 2024, Bae et al., 2 Feb 2026).

2. Multinomial Logit Feedback and Implicit Preference Learning

MNL models capture the fundamental uncertainty and choice structure over which job or query is served or routed at each scheduling epoch. They introduce stochasticity in selection due to latent or unknown preferences, modeled via the exponentiated context–parameter dot product. In LLM-serving contexts, for instance, user query retrials and implicit feedback can be modeled as whether any server (LLM) succeeded in satisfying the query (departure) or not (retry), providing implicit labels for MNL-based learning (Bae et al., 2 Feb 2026).

Properties and implications:

  • The log-likelihood of observed feedback yields a convex loss for parameter estimation.
  • The MNL model induces competition among jobs in the same assortment, a departure from independent-static model assumptions.
  • Regularity and identifiability require bounded context norms, parameter norms, and a positivity condition on infx,j,S,θpj(x,S,θ)p0(x,S,θ)\inf_{x,j,S,\theta} p_j(x, S, \theta) p_0(x, S, \theta), ensuring tractable learning and finite confidence sets (Kim et al., 2024).

3. CQB-MNL Algorithmic Methodologies

Algorithmic schemes for CQB-MNL integrate bandit learning with queueing control. The principal methodologies are:

  • Online Upper Confidence Bound (UCB) Approaches: UCB-QMB maintains server-specific confidence sets and computes optimistic estimates for service probabilities. It solves a MaxWeight-like scheduling problem with these UCB rates, selecting assignments that simultaneously exploit MaxWeight throughput optimality and incentivize exploration via confidence bonuses (Kim et al., 2024).
  • Thompson Sampling (TS) Approaches: TS-QMB draws multiple samples from the posterior distribution over θk\theta_k for each server, computes optimistic rates from these samples, and schedules based on these sampled rewards, addressing the computational complexity and robustness to distributional mis-specification (Kim et al., 2024, Bae et al., 2 Feb 2026).
  • OFU-MNL++ / OFU-MN2^2L for Generalized MNL Bandits: Recent advances derive self-concordant, variance-dependent confidence sets for MNL bandit learning, leading to algorithms achieving optimal (up to constants) regret rates, with independence from maximum assortment size KK and reduced dependence on parameter bounds (Lee et al., 14 Feb 2025). These methods are plug-and-play in CQB-MNL if queue length or waiting time is embedded into agent contexts.
  • ACQB (Anytime Contextual Queueing Bandit): For LLM service, ACQB blends Thompson sampling and forced exploration, balancing exploitation with regular, randomized exploration steps during new arrivals, enhancing learning under heavy-tailed or adversarial contexts (Bae et al., 2 Feb 2026).

Algorithmic variants may use utility-aligned or contrastive embeddings (e.g., via InfoNCE loss on performance/cost pairs) for context features, server-disjoint or shared parameterizations, and regularized maximum-likelihood or mirror-descent updates as appropriate.

4. Theoretical Guarantees: Regret and Queue Stability

CQB-MNL approaches are evaluated by both service quality (regret) and queue stability. Theoretical results depend on system properties:

  • Queue Stability: Under MNL regularity, feature boundedness, and “traffic slackness” (existence of optimal matching with service probability exceeding arrival rate plus ϵ\epsilon), algorithms achieve strong stability:

Q(T)=1Tt=1Tn=1NE[Qn(t)]=O(min{N,K}/ϵ).\mathcal{Q}(T) = \frac1T\sum_{t=1}^T\sum_{n=1}^N \mathbb{E}[Q_n(t)] = O(\min\{N,K\}/\epsilon).

This holds for optimistic (UCB/TS) MaxWeight policies (Kim et al., 2024).

  • Routing or Service Regret: Regret is measured relative to a clairvoyant system with access to true parameters:
    • Under contextual MNL bandit models using OFU-MNL++ (and variants), regret is bounded by O(dlogTt=1Tσt2)O\left(d \log T \sqrt{ \sum_{t=1}^T \sigma_t^2 }\right) for rewards with conditional variance σt2\sigma^2_t and context dimension dd, independent of KK (Lee et al., 14 Feb 2025).
    • In the ACQB or TS-QMB protocols, cumulative regret scales as O~(t)\tilde O(\sqrt{t}), while queue-length regret decays as O~(t1/4)\tilde O(t^{-1/4}) for large tt (Bae et al., 2 Feb 2026).
    • For UCB-QMB, regret admits a bound O~(min{dκKTQmax,  N(dK/κ2ϵ3)1/4T3/4})\tilde O(\min\{ \frac{d}{\kappa}\sqrt{KT} Q_{\max},\; N(dK/\kappa^2 \epsilon^3)^{1/4} T^{3/4}\}), where QmaxQ_{\max} denotes maximum queue length (Kim et al., 2024).

These results hold without explicit dependence on assortment size, under adversarial or i.i.d. context sequences, and for rich system structures.

5. Applications and Empirical Evaluations

CQB-MNL frameworks are extensively validated in domains requiring dynamic routing and scheduling of contextual jobs with uncertain preferences:

  • LLM Service Routing: CQB-MNL (ACQB) is deployed to route conversational queries across large server clusters (SPROUT, EmbedLLM, RouterBench datasets), using contrastive-learned embeddings and per-LLM parameters. The algorithms maintain low queue lengths and achieve sublinear routing regret, outperforming random, standard MaxWeight, and prior contextual bandit baselines across varying loads and server counts (Bae et al., 2 Feb 2026).
  • General Multi-Queue Multi-Server Matching: UCB-QMB and TS-QMB stabilize large-scale systems with feature-rich agents and nonstationary reward structures, providing robust performance across synthetic and real-world matching environments (Kim et al., 2024).
  • Contextual Queueing Bandits: These approaches generalize prior matching/queueing bandit designs by enabling parameter sharing via context, improved adaptation in non-i.i.d. arrivals, and explicit handling of competitive MNL competition within servers (Lee et al., 14 Feb 2025, Kim et al., 2024).

Empirical studies affirm that CQB-MNL algorithms closely approach the performance of clairvoyant oracles in both queue-length and routing regret over diverse network and server regimes, with contrastive alignment of embeddings further improving sample efficiency and regret minimization (Bae et al., 2 Feb 2026).

6. Assumptions, Conditions, and Adaptability

CQB-MNL systems rely on several technical conditions:

  • Feature and Parameter Boundedness: Required for concentration and confidence-bound guarantees: xn1\|x_n\| \le 1, θk1\|\theta_k\| \le 1 or as dictated by the system scaling (Kim et al., 2024, Lee et al., 14 Feb 2025).
  • Regularity Constants: MNL feedback regularity κ\kappa, ensuring lower bounded choice probabilities.
  • Traffic Slackness: Existence of a disjoint matching with surplus service capacity over job arrivals ensures queue stability.
  • Context Embedding Adaptability: Queue lengths, waiting times, or batch information can be embedded as context coordinates, provided normalization to maintain boundedness, allowing seamless integration with confidence-bound and MNL bandit methodologies (Lee et al., 14 Feb 2025).

A plausible implication is that CQB-MNL admits flexible extension to settings with time-varying, adversarial, or highly structured contextual information, provided these core regularity and boundedness assumptions hold. This offers a unified route to scalable, theoretically grounded, and empirically validated sequential resource allocation.

7. Connections and Future Extensions

CQB-MNL bridges queueing theory, contextual bandits, and generalized online assortment optimization, incorporating competition, parameter sharing, and MNL dynamics across application areas. It extends MaxWeight and classic queueing matching control by adding (i) contextual generalization, (ii) stochastic/competitive MNL feedback, and (iii) rigorous learning-theoretic regret and stability guarantees (Kim et al., 2024, Lee et al., 14 Feb 2025, Bae et al., 2 Feb 2026). Ongoing research addresses:

  • Improved computational scaling and implementation for large-scale, real-time service systems.
  • Tighter confidence bounds for combinatorial MNL bandits in adversarial regime.
  • Learning under nonstationary arrivals and multi-resource constraints.
  • Integration with utility-aligned contextual representation learning (e.g., contrastive or cost-sensitive embeddings).
  • Richer feedback models beyond MNL and exploration-exploitation-externalities in general networked queueing bandits.

Empirical validation and methodological development continue, supported by open-source benchmarks and extensive evaluation in realistic environments (Bae et al., 2 Feb 2026, Kim et al., 2024).


Key Papers Referenced:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contextual Queueing Bandits with Multinomial Logit Feedback (CQB-MNL).