Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale

Published 8 Apr 2026 in cs.DC, cs.OS, and cs.PF | (2604.06970v1)

Abstract: When output token counts can be predicted at submission time (Gan et al., 2026), client-side scheduling against a black-box LLM API becomes semi-clairvoyant: decisions condition on coarse token priors even though the provider's internals remain hidden. We decompose this boundary problem into three separable concerns: allocation (inter-class share via adaptive DRR), ordering (intra-class sequencing with feasible-set scoring), and overload control (explicit admit/defer/reject on a cost ladder). An information ladder experiment shows that coarse magnitude priors -- not class labels alone -- are the practical threshold for useful client control; removing magnitude inflates short-request P95 by up to $5.8\times$ and degrades deadline satisfaction. Under balanced / high congestion the full stack achieves 100% completion, 100% deadline satisfaction, and useful goodput of $4.2 \pm 1.6$ SLO-meeting requests/s with short P95 within tens of milliseconds of quota-tiered isolation. A predictor-noise sweep confirms graceful degradation under up to 60% multiplicative error. Heavy-dominated regimes separate policies on completion, tail, and interpretable shedding. We further compare short-priority allocation (biased toward interactive traffic) with Fair Queuing (round-robin across classes): Fair Queuing achieves +32% short-request P90 improvement over FIFO with only +17% long-request overhead, versus Short-Priority's +27% / +116% trade-off -- demonstrating that the allocation layer accommodates different fairness objectives without changing the remaining stack. We contribute the three-layer client-side decomposition, controlled evaluation of joint metrics across regimes, allocation-policy alternatives, and overload-policy evidence linking cost-ladder shedding to the stated service objective.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a layered scheduling framework that uses coarse output token priors to enable semi-clairvoyant admission, allocation, and ordering decisions.
It employs adaptive Deficit Round Robin for inter-class allocation and urgency-weighted heuristics for intra-class ordering to optimize latency and throughput.
Empirical evaluations demonstrate that using coarse priors nearly matches oracle performance, significantly improving SLO compliance and useful goodput under varied workloads.

Semi-Clairvoyant Scheduling for Black-Box LLM Inference: A Layered Decomposition

Problem Setting and Motivation

The paper "Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale" (2604.06970) addresses the problem of client-side scheduling for black-box LLM APIs, where internal server-side scheduling, batching, and queuing are opaque to the client. Traditional approaches offer limited leverage because request cost, largely determined by output token length, is not observable at submission time. However, building on recent advances in output-length prediction, the authors demonstrate that coarse, per-request output token priors can transform the client’s role into a semi-clairvoyant scheduler, enabling more principled admission, allocation, and ordering decisions prior to submission.

The scheduling challenge is restructured as a three-way decomposition:

Allocation: Inter-class share management via adaptive Deficit Round Robin (DRR).
Ordering: Intra-class selection through slowdown-aware, urgency-weighted heuristics.
Overload Control: Explicit, bucketed admit/defer/reject at admission, targeting cost-aware service objectives.

Empirical evaluation and ablation demonstrate how these layers, empowered with coarse magnitude information, substantially improve latency SLOs, completion rate, and useful goodput under various workload regimes.

Figure 1: Data flow for the client scheduling stack, with allocation, ordering, and overload control layered before the black-box LLM API.

Three-Layer Scheduling Stack: Design and Rationale

The decomposition into allocation, ordering, and overload control is not only analytically clean but operationally meaningful, mapping distinct pathologies and interventions to separate layers:

Allocation employs adaptive DRR with weight scaling based on observed congestion, ensuring interactive requests maintain protected share under stress. The client’s share decision adjusts dynamically as load fluctuates, optimizing both fairness and responsiveness.
Ordering within classes, particularly the heavy class, uses a feasible-set score considering queue residence, predicted cost, and deadline urgency: $w_1 \cdot (\text{wait}/\text{cost}) - w_2 \cdot (\text{size}/\text{ref}) + w_3 \cdot \text{urgency}$ . This structure mitigates head-of-line blocking and preserves deadline-sensitive work.
Overload Control is implemented as a severity-based admit/defer/reject decision at admission, with explicit mapping from estimated cost (medium/long/xlong) to action. The process replaces implicit timeouts and provider-side failures with interpretable, client-side shedding.

Decoupling these concerns allows for independent diagnosis and policy adaptation. Failures in completion, tail latency, or deadline satisfaction can be traced to their corresponding layer, as corroborated by the paper’s layerwise progression analysis.

Empirical Evaluation: Information and Policy Ladders

Influence of Information Quality

An information ladder experiment highlights the necessity of per-request magnitude priors versus class-only or no-information controls.

Blind (no-information) policies—where requests lack any length prior—result in severe short-tail latency inflation (short-request P95 up to 5.8× higher), degraded deadline satisfaction, and poor useful goodput.
Class-only policies, which know routing class but not per-request magnitude, regain routing structure but fail to anticipate congestion within a class.
Coarse semi-clairvoyant priors afford meaningful control, situating the joint operating point for completion, tail latency, and goodput well above blind or class-only.
Oracle (exact) knowledge adds minimal marginal value over coarse priors for short tail and goodput, confirming the practical sufficiency of coarse predictability.
Figure 2: Impact of information fidelity on short-request P95, completion rate, and SLO-meeting goodput across regimes; coarse magnitude is necessary for practical control.

Policy Comparisons and Regime Sensitivity

The main policy evaluations, conducted on a calibrated congestion-aware mock provider, reveal:

In balanced regimes, the full stack (allocation, ordering, overload control) achieves 100% completion, 100% deadline satisfaction, and short P95 within tens of milliseconds of simpler quota-tiered isolation, but with higher useful goodput and explicit overload actions.
In heavy-dominated regimes, trade-offs become sharper: quota-tiered policies can minimize global tail at the cost of lower completion, while the full stack optimizes all joint metrics except for cases of heavy request starvation.
Figure 3: Short-request P95 vs. completion rate for core policy variants—structured semi-clairvoyant stacks outperform blind/naive dispatch especially under stress.

Figure 4: Useful goodput vs. global P95 demonstrates distinct regime-dependent trade-offs for structured policies.

Alternative Allocation and Overload Control

Allocation Layer: Comparing Short-Priority (biased toward interactive) with Fair Queuing (round-robin), Fair Queuing provides a better balance—improving short-request P90 by +32% with only +17% overhead for long requests, compared to Short-Priority’s +27% / +116%. This exposes a tunable fairness-performance spectrum without rearchitecting the stack.

Overload Control: The cost-ladder bucket policy ensures that rejections concentrate on the most expensive (xlong) requests, protecting interactivity and SLO-compliance for short and medium jobs.

Figure 5: Overload actions under the default bucket policy—sacrifice is focused on xlong requests.

Figure 6: Comparison across overload policies—cost ladder maximizes useful goodput and SLO satisfaction with explicit, interpretable control.

Robustness: Ablations, Layerwise Progression, and Sensitivity

Layerwise ablation (Figure 7) demonstrates that each layer contributes orthogonally to joint metrics, with final policies avoiding naive trade-offs that starve completion or inflate tails for short requests.

Figure 7: Progression from naive dispatch to quota, adaptive DRR, and full stack elucidates the layered impact on short P95, goodput, and completion.

Sensitivity sweeps confirm operational stability: overload threshold perturbation ( $\pm$ 20%) leads to bounded, smooth changes in completion and goodput. Predictor quality sweeps (up to 60% multiplicative noise) result in graceful, regime-dependent degradation—underscoring that coarse, not oracle, priors are sufficient for robust scheduling.

Figure 8: Predictor noise sweep—metrics drift smoothly with increasing error in length prediction, demonstrating resilience of the layered stack.

Practical and Theoretical Implications

This research formalizes the client-API boundary as a semi-clairvoyant scheduling problem and demonstrates that meaningful, actionable controls over SLOs and throughput can be achieved with only coarse, pre-dispatch length priors. The decomposition allows for modular tuning: allocation policies can be selected for fairness or prioritized class performance, overload rejection can be shaped transparently, and system-induced sacrifices are interpretable externally.

Practically, this advances the maturity of client etiquette for multi-tenant LLM APIs, bridging the gap between server-side scheduling (e.g., vLLM/PagedAttention, DistServe, Sarathi-Serve) and the levers available before the black-box boundary. Theoretically, the explicit mapping of regime-dependent Pareto frontiers for joint metrics grounds future work in deployable, data-driven evidence.

Limitations and Future Directions

The study relies on a mock provider with linear latency scaling, calibrated but not representative of proprietary vendor models. Thresholds are hand-tuned; full trace-driven evaluation and automated parameterization remain open. Heavy-dominated workloads are more sensitive to predictor error, exposing opportunities for adaptive or hybrid admission control.

Potential avenues for future exploration include:

Integration with production predictor pipelines and real-time traces,
Automated or learning-based adaptation for thresholds and heuristics,
Closer coupling with in-engine scheduling by exposing additional feedback signals,
Extension to non-token-dominated resource models.

Conclusion

By systematically structuring client-side scheduling into allocation, ordering, and overload control—and empirically validating their separability and robustness under semi-clairvoyant information—the paper provides both a practical toolset and a conceptual foundation for effective request shaping at the black-box LLM API boundary. This layered decomposition enables tunable, transparent policies that optimize latency, throughput, and fairness, extending the frontier for AI service operators in diverse production contexts.

Markdown Report Issue