Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

Published 9 Apr 2026 in cs.CL | (2604.08075v1)

Abstract: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a dual-pool token-budget routing mechanism that aligns resource provisioning with real workload demands.
It demonstrates 31–42% GPU savings with significant latency and reliability improvements through targeted segregation of short and long token pools.
A closed-form cost model and self-calibrating EMA method are provided, reducing mis-routing errors below 1% and ensuring efficient scaling.

Dual-Pool Token-Budget Routing: Cost-Efficient and Reliable LLM Serving

Motivation and Problem Analysis

The paper addresses a fundamental inefficiency in the standard configuration for production LLM fleets, specifically in vLLM deployments. These systems provision every instance for the worst-case context length (e.g., 64K tokens) to ensure coverage for long-context requests. Empirical workload analyses, such as those from Azure and LMSYS, reveal that 80–95% of requests are short (≤8K tokens), resulting in substantial over-allocation of KV-cache memory and significant under-utilization of concurrency. This configuration–traffic mismatch leads to both economic inefficiencies (4–8 $\times$ throughput capacity wasted) and operational unreliability, manifesting as OOM crashes, preemption storms, and increased request rejection rates.

The authors identify that substantial improvements in both cost and reliability can be unlocked by aligning resource provisioning with actual workload distributions rather than the worst-case scenario. Existing efforts, such as vLLM’s chunked prefill, only partially address compute-level bottlenecks without solving the memory over-provisioning issue. The root cause is that static configuration for peak context aligns poorly with bursty, short-dominated traffic.

Dual-Pool Token-Budget Routing: Methodology

The proposed solution is dual-pool token-budget routing—a constant-time, lightweight fleet-level dispatch mechanism. The fleet is partitioned into two homogeneous pools:

Short Pool: Configured for high concurrency with a small maximum context length (e.g., 8K tokens, 128 concurrent sequences).
Long Pool: Maintains the original large context window to serve all requests (e.g., 64K tokens, 16 concurrent sequences).

Each incoming request is routed based on its estimated total token budget, which is the sum of input (prompt) and max output token counts. Estimation leverages a per-category bytes-to-token ratio, learned online using an EMA from the usage.prompt_tokens feedback field. This scheme is tokenizer-independent and adapts to heterogeneous traffic (e.g., code, prose, CJK), resolving systematic mis-routing caused by tokenizer fertility disparities (2602.11174).

Routing is governed by a conservative threshold and employs load-aware spillover logic to handle burst overloads. Critical implementation features include:

Conservative estimation (bias toward safety by subtracting one sigma).
Per-category calibration to converge quickly to accurate ratios (within 3.5% after ~50 samples per category).
Always routing requests that exceed the short pool’s capacity directly to the long pool, guaranteeing feasibility.

Analytical Cost Model

The paper introduces a closed-form model for quantifying potential GPU savings:

$\frac{\Delta G}{G_{\mathrm{homo}}} = \alpha\cdot \left(1 - \frac{1}{\rho}\right)$

where $\alpha$ is the fraction of short-traffic (below the routing threshold) and $\rho$ is the per-GPU throughput gain afforded by the short pool relative to the long pool. The authors demonstrate, through fleet-scale estimation, that sensible parameters (e.g., $\alpha \approx 0.80$ , $\rho \approx 4$ ) yield savings in the range of 35–60%, with the formula consistently serving as a conservative lower bound due to unmodeled sources of efficiency (e.g., KV-cache occupancy, activation memory asymmetry).

Empirical Evaluation

Traces, Models, and Configurations

Evaluation is conducted using real-world traces: Azure LLM Inference (heavy tail, 80% sub-2K tokens) and LMSYS-Chat-1M (compact, mean prompt ~70 tokens), with Llama-3-70B deployed on A100 GPUs. Through discrete-event simulation, the study compares standard homogeneous provisioning against dual-pool token-budget routing.

Cost Reduction

Dual-pool routing reduces fleet-wide GPU usage by 31–42% (up to $2.86M annual savings at AWS rates for 1000 req/s). Case projection on Qwen3-235B-A22B (MI300X, 10,000 req/s) gives a savings of$15.4M/year (1,576 → 1,096 GPUs).

Reliability and Latency

The approach reduces preemption events by 5.4 $\times$ (from 47.3 to 8.7 per thousand) and lowers OOM incidence by 5.3 $\times$ , raising the success rate from 99.69% to 99.95% under high load. Latency at P50 TTFT improves by 33% (0.42s → 0.28s), and P99 by 6%, primarily due to eliminating head-of-line blocking and unlocking higher batch concurrency for short requests. Critically, most improvement is realized without increasing deployment or operational complexity—only two pools achieve ≥98% of optimal savings.

Calibration and Threshold Setting

The self-calibrating EMA converges rapidly (≤3.5% error with 50 samples/category) and drives mis-routing rates below 1%. Savings are robust to threshold selection—a value in the 4K–16K range achieves ≥80% of peak possible savings.

Practical and Theoretical Implications

Practically, dual-pool token-budget routing provides a scalable and immediately deployable mechanism for major reductions in cloud and on-prem inference cost, while substantially improving reliability at fleet scale. The method is orthogonal to, and composes with, leading per-GPU optimizations: PagedAttention (Panickssery et al., 2023), continuous batching, prefill–decode disaggregation, and various KV-cache compression techniques (Ma et al., 5 Jan 2026, Yang et al., 17 Mar 2025). The analytically derived cost model allows practitioners to perform pre-deployment savings audits using only aggregate workload distributions and throughput measurements.

Theoretically, the work reinforces the importance of workload-aware resource provisioning and demonstrates that system-level routing policies can unlock global fleet efficiency unattainable by instance-local optimizations alone. It also exposes the inefficiency of one-size-fits-all configuration for large LLM fleets operating at scale.

Future Directions

The paper proposes several augmentations:

Adaptive thresholds informed by real-time observability signals (preemption, OOM, rejections).
Lightweight prompt compression for borderline requests to expand the “short” pool without increasing its configured capacity.
More granular pool partitioning, though the marginal benefit is modest versus added operational complexity.

Automated, self-optimizing LLM serving systems that continuously adapt to workload dynamics are a logical next step, potentially leveraging online learning from observability metrics for closed-loop adjustment.

Conclusion

Dual-pool token-budget routing effectively resolves the over-provisioning and reliability challenges endemic to homogeneous LLM fleet deployments. Through an analytically principled, empirically validated, and operationally simple approach—requiring only constant-time overhead, no tokenizer dependencies, and trivial calibration—this method achieves significant cost reductions, eliminates preemption for the majority of traffic, and improves both latency and reliability. Its compatibility with existing scheduling and memory optimization techniques positions it as a practical, high-impact tool for modern fleet-scale LLM serving deployments.

Markdown Report Issue