- The paper introduces a dual-pool token-budget routing mechanism that aligns resource provisioning with real workload demands.
- It demonstrates 31–42% GPU savings with significant latency and reliability improvements through targeted segregation of short and long token pools.
- A closed-form cost model and self-calibrating EMA method are provided, reducing mis-routing errors below 1% and ensuring efficient scaling.
Dual-Pool Token-Budget Routing: Cost-Efficient and Reliable LLM Serving
Motivation and Problem Analysis
The paper addresses a fundamental inefficiency in the standard configuration for production LLM fleets, specifically in vLLM deployments. These systems provision every instance for the worst-case context length (e.g., 64K tokens) to ensure coverage for long-context requests. Empirical workload analyses, such as those from Azure and LMSYS, reveal that 80–95% of requests are short (≤8K tokens), resulting in substantial over-allocation of KV-cache memory and significant under-utilization of concurrency. This configuration–traffic mismatch leads to both economic inefficiencies (4–8× throughput capacity wasted) and operational unreliability, manifesting as OOM crashes, preemption storms, and increased request rejection rates.
The authors identify that substantial improvements in both cost and reliability can be unlocked by aligning resource provisioning with actual workload distributions rather than the worst-case scenario. Existing efforts, such as vLLM’s chunked prefill, only partially address compute-level bottlenecks without solving the memory over-provisioning issue. The root cause is that static configuration for peak context aligns poorly with bursty, short-dominated traffic.
Dual-Pool Token-Budget Routing: Methodology
The proposed solution is dual-pool token-budget routing—a constant-time, lightweight fleet-level dispatch mechanism. The fleet is partitioned into two homogeneous pools:
- Short Pool: Configured for high concurrency with a small maximum context length (e.g., 8K tokens, 128 concurrent sequences).
- Long Pool: Maintains the original large context window to serve all requests (e.g., 64K tokens, 16 concurrent sequences).
Each incoming request is routed based on its estimated total token budget, which is the sum of input (prompt) and max output token counts. Estimation leverages a per-category bytes-to-token ratio, learned online using an EMA from the usage.prompt_tokens feedback field. This scheme is tokenizer-independent and adapts to heterogeneous traffic (e.g., code, prose, CJK), resolving systematic mis-routing caused by tokenizer fertility disparities (2602.11174).
Routing is governed by a conservative threshold and employs load-aware spillover logic to handle burst overloads. Critical implementation features include:
- Conservative estimation (bias toward safety by subtracting one sigma).
- Per-category calibration to converge quickly to accurate ratios (within 3.5% after ~50 samples per category).
- Always routing requests that exceed the short pool’s capacity directly to the long pool, guaranteeing feasibility.
Analytical Cost Model
The paper introduces a closed-form model for quantifying potential GPU savings:
GhomoΔG=α⋅(1−ρ1)
where α is the fraction of short-traffic (below the routing threshold) and ρ is the per-GPU throughput gain afforded by the short pool relative to the long pool. The authors demonstrate, through fleet-scale estimation, that sensible parameters (e.g., α≈0.80, ρ≈4) yield savings in the range of 35–60%, with the formula consistently serving as a conservative lower bound due to unmodeled sources of efficiency (e.g., KV-cache occupancy, activation memory asymmetry).
Empirical Evaluation
Traces, Models, and Configurations
Evaluation is conducted using real-world traces: Azure LLM Inference (heavy tail, 80% sub-2K tokens) and LMSYS-Chat-1M (compact, mean prompt ~70 tokens), with Llama-3-70B deployed on A100 GPUs. Through discrete-event simulation, the study compares standard homogeneous provisioning against dual-pool token-budget routing.
Cost Reduction
Dual-pool routing reduces fleet-wide GPU usage by 31–42% (up to $2.86M annual savings at AWS rates for 1000 req/s). Case projection on Qwen3-235B-A22B (MI300X, 10,000 req/s) gives a savings of$15.4M/year (1,576 → 1,096 GPUs).
Reliability and Latency
The approach reduces preemption events by 5.4× (from 47.3 to 8.7 per thousand) and lowers OOM incidence by 5.3×, raising the success rate from 99.69% to 99.95% under high load. Latency at P50 TTFT improves by 33% (0.42s → 0.28s), and P99 by 6%, primarily due to eliminating head-of-line blocking and unlocking higher batch concurrency for short requests. Critically, most improvement is realized without increasing deployment or operational complexity—only two pools achieve ≥98% of optimal savings.
Calibration and Threshold Setting
The self-calibrating EMA converges rapidly (≤3.5% error with 50 samples/category) and drives mis-routing rates below 1%. Savings are robust to threshold selection—a value in the 4K–16K range achieves ≥80% of peak possible savings.
Practical and Theoretical Implications
Practically, dual-pool token-budget routing provides a scalable and immediately deployable mechanism for major reductions in cloud and on-prem inference cost, while substantially improving reliability at fleet scale. The method is orthogonal to, and composes with, leading per-GPU optimizations: PagedAttention (Panickssery et al., 2023), continuous batching, prefill–decode disaggregation, and various KV-cache compression techniques (Ma et al., 5 Jan 2026, Yang et al., 17 Mar 2025). The analytically derived cost model allows practitioners to perform pre-deployment savings audits using only aggregate workload distributions and throughput measurements.
Theoretically, the work reinforces the importance of workload-aware resource provisioning and demonstrates that system-level routing policies can unlock global fleet efficiency unattainable by instance-local optimizations alone. It also exposes the inefficiency of one-size-fits-all configuration for large LLM fleets operating at scale.
Future Directions
The paper proposes several augmentations:
- Adaptive thresholds informed by real-time observability signals (preemption, OOM, rejections).
- Lightweight prompt compression for borderline requests to expand the “short” pool without increasing its configured capacity.
- More granular pool partitioning, though the marginal benefit is modest versus added operational complexity.
Automated, self-optimizing LLM serving systems that continuously adapt to workload dynamics are a logical next step, potentially leveraging online learning from observability metrics for closed-loop adjustment.
Conclusion
Dual-pool token-budget routing effectively resolves the over-provisioning and reliability challenges endemic to homogeneous LLM fleet deployments. Through an analytically principled, empirically validated, and operationally simple approach—requiring only constant-time overhead, no tokenizer dependencies, and trivial calibration—this method achieves significant cost reductions, eliminates preemption for the majority of traffic, and improves both latency and reliability. Its compatibility with existing scheduling and memory optimization techniques positions it as a practical, high-impact tool for modern fleet-scale LLM serving deployments.