Papers
Topics
Authors
Recent
2000 character limit reached

Predictive Scheduling Under Output Uncertainty

Updated 18 November 2025
  • Predictive scheduling under output length uncertainty is a framework that uses estimated task durations to optimize resource allocation and reduce latency.
  • Methodologies integrate Bayesian modeling, conformal prediction, and size-based policies, demonstrating up to 2× reductions in mean completion times in empirical workloads.
  • Real-world applications in LLM inference, cloud platforms, and URLLC traffic highlight the need for robust prediction calibration and adaptive scheduling mechanisms.

Predictive scheduling under output length uncertainty addresses the challenge of efficient resource allocation, job ordering, and latency minimization when job durations or output sizes are not known a priori but can be estimated or inferred at runtime. This problem is fundamental in large-scale AI/ML systems, LLM serving, cloud infrastructures, and ultra-reliable low-latency communication, where arrival and output processes are only partially predictable. Recent research formalizes, analyzes, and optimizes scheduling policies that leverage predictions—possibly noisy or interval-valued—of output lengths, and quantifies the effect of such uncertainty on system performance. Methods span Bayesian modeling, conformal prediction, mechanism design, and queue-theoretic analysis, with direct system prototypes demonstrating substantial gains in empirical workloads.

1. Formal Models of Output-Length Uncertainty in Scheduling

Modern predictive scheduling literature models jobs as possessing true but unknown processing or output requirements, which must be estimated or inferred for scheduling. A typical formal model involves:

  • For each job ii, the true output length or service time XiX_i is unobserved at arrival.
  • Predictions X^i\hat{X}_i are produced either by a trained regressor/classifier (Shahout et al., 2024), derived empirically (intervals [i,ui][\ell_i, u_i]) (Chen et al., 20 Aug 2025), or drawn from a learned joint distribution g(x,y)g(x,y) (Mitzenmacher, 2019).
  • Compound workloads or pipelines can be represented as uncertain DAGs, with some stages governed by random, prediction-driven durations (i.e., token-by-token LLM generation) (Zhu et al., 4 Apr 2025).
  • In cloud and networking, job submission may report distributions Pj(a,d)P_j(a, d) over arrival and duration, with strategic reporting for incentive-aligned scheduling (Babaioff et al., 2022).

This abstracts across application domains—LLM inference, cloud platforms, network packet scheduling—retaining the key property that output length uncertainty is a first-order constraint on scheduling performance and implementability.

2. Predictive Scheduling Methodologies

Size-Based Policies and Prediction Integration

Classic size-based policies—Shortest Job First (SJF), Shortest Remaining Processing Time (SRPT)—are optimal for mean latency when job sizes are known. In predictive settings, these policies are generalized:

  • SPJF (Shortest Predicted Job First): Orders jobs by X^i\hat{X}_i, equivalent to SJF if predictions are perfect (Mitzenmacher, 2019).
  • SPRPT (Shortest Predicted Remaining Processing Time): Preempts long (predicted) jobs for shorter predicted arrivals, but care is required: predictions may be unreliable or change dynamically as more output is revealed (Shahout et al., 2024).
  • Limited Preemption (for LLMs): Preemption is restricted when memory overhead or partial KV-caches render preemption costly; prediction error is folded into policy cutoff thresholds (Shahout et al., 2024).
  • Interval-based Algorithms: Conservative (Amax\mathcal{A}_{\max}) and adaptive (Amin\mathcal{A}_{\min}) methods use prediction intervals [i,ui][\ell_i, u_i], packing jobs under worst-case uiu_i or under lower-bound i\ell_i with adaptive back-off, managing robustness vs. efficiency (Chen et al., 20 Aug 2025).

Stochastic and Learning-Augmented Techniques

  • Bayesian Networks and Entropy-Driven Scheduling: For compound or DAG jobs, a Bayesian network captures empirical correlations in stage durations; posteriors are updated online as stages complete, and mutual information/entropy is used to prioritize high-uncertainty-reduction stages (Zhu et al., 4 Apr 2025).
  • Online Conformal Prediction: Dynamic resource covering under adversarial output uncertainty is achieved by recalibrating prediction intervals online, guaranteeing specified drop/error rates regardless of predictor calibration (Cohen et al., 2023).

Mechanism-Design Approaches

  • Incentive-Aligned Scheduling: In cloud platforms where statement-of-work (SoW) submissions report job duration distributions, incentive-compatible mechanisms post prices tied to predictions, achieving near-optimal social welfare and robust scheduling guarantees (Babaioff et al., 2022).

3. Performance Metrics, Analysis, and Guarantees

The central technical contributions in this area are quantitative performance analyses, competitive ratio theorems, and "price of misprediction" bounds.

Price of Misprediction

Defined as the ratio of performance (e.g., mean waiting or sojourn time) under predicted-size scheduling vs. ideal scheduling with true job sizes (Mitzenmacher, 2019):

PoM=MA(predicted info)MA(true info)1\mathrm{PoM} = \frac{M_A(\text{predicted info})}{M_A(\text{true info})} \ge 1

For Poisson-exponential models, even noisy predictors yield significant improvement over naive scheduling; for example, SPJF with exponential prediction inflates latency by only a factor of $4/3$ over SJF, while FIFO can be orders-of-magnitude worse (Mitzenmacher, 2019).

Robustness to Prediction Error

  • Interval-based Scheduling: Conservative Amax\mathcal{A}_{\max} is Θ(α1)\Theta(\alpha^{-1})-competitive, where α=/u\alpha = \ell / u; adaptive Amin\mathcal{A}_{\min} achieves O(log(1/α))O(\log(1/\alpha)) competitive ratio (and is optimal for large systems), remaining robust even under adversarially wide prediction intervals (Chen et al., 20 Aug 2025).
  • Entropy-Driven LLM Schedulers: LLMSched achieves 14–79% lower mean job completion time on realistic workloads, with entropy-based exploration tightening downstream estimates and reducing worst-case latency in complex DAG jobs (Zhu et al., 4 Apr 2025).
  • Individual-Sequence Reliabilities: Online conformal approaches guarantee drop rates within O(1/F)O(1/F) of the target for all realization sequences, regardless of predictor calibration (Cohen et al., 2023).

Resource Constraints and Preemption Costs

In GPU-based LLM serving, memory constraints dominate batch and preemptive policy design; policies that over-provision (using upper prediction bounds) waste valuable concurrency, while under-provisioning risks evictions and recomputation. Theoretical models analyze these trade-offs explicitly (Chen et al., 20 Aug 2025, Shahout et al., 2024).

4. System Implementations and Empirical Evaluation

Recent system-level work implements and validates predictive schedulers with practical predictors:

  • Embedding-based LLM Length Prediction: Extracts token-wise hidden states in LLM decoders, recycles them into bin-classifiers for output length remaining. Applied with limited-preemption SPRPT style scheduling, this approach reduces mean completion latency by 1.66–2.01× and mean TTFT by up to 24.07× compared to FCFS/vLLM baselines at high load (Shahout et al., 2024).
  • LLMSched DAG-style Schedulers: Bayesian-entropy approaches enable practical prioritization in compound LLM pipelines, matching theoretical gains with simulation and testbed evidence (Zhu et al., 4 Apr 2025).
  • Cloud Resource Truthful Schedulers: Demonstrate that, with black-box predictors, truthful reporting mechanisms combined with posted-price menus achieve O(log H) or O(1) competitive welfare, even under adversarial job submissions (Babaioff et al., 2022).
  • Interval-Based Algorithms in LLM Serving: Simulation on LMSYS-Chat-1M test cases demonstrates that adaptive policies exploiting only lower-bounds rival the performance of hindsight-optimal jobs, even under highly noisy prediction intervals (Chen et al., 20 Aug 2025).

5. Advanced Topics: DAGs, Information Gain, and Uncertain Topologies

Compound or hierarchical job topologies introduce an additional layer of output-length uncertainty in both duration (due to branching chains) and job structure.

  • DAG Modeling: Nodes partitioned into regular (deterministic), LLM (random-duration), and dynamic (structure-uncertain) stages (Zhu et al., 4 Apr 2025).
  • Information-Gain Scheduling: Stages prioritized not only by current remaining workload, but also by estimated ability to reduce global job-topology or downstream latency uncertainty, as quantified by mutual information and conditional entropy.
  • Structure and Duration Update via BNs: Every realized output updates downstream duration/posterior distributions, enabling real-time re-prioritization and improved estimation for subsequent scheduling rounds.

6. Application Domains

Predictive scheduling under output length uncertainty finds application in:

  • LLM Inference Serving: Token-wise autoregressive decoding requires per-request length prediction and memory management; batch size and concurrency are throttled by anticipated length uncertainty (Chen et al., 20 Aug 2025, Shahout et al., 2024).
  • Compound LLM Applications: Workflow orchestration with external API and LLM calls, where chain structure and stage durations are a priori uncertain (Zhu et al., 4 Apr 2025).
  • Cloud Platforms/ML Pipelines: Statistical regularity in job arrivals and durations enable predictive truthful scheduling for social welfare maximization (Babaioff et al., 2022).
  • URLLC Traffic in 5G/6G: Slot allocation with strict reliability/latency trade-offs under unpredictable packet-generation processes, managed by online conformal prediction (Cohen et al., 2023).

7. Limitations and Open Challenges

  • Prediction Calibration: System performance is sensitive to the quality of prediction; while adaptive and robust algorithms mitigate worst-case impacts, substantial gain is attainable only if prediction error is not adversarially coupled to output size (Chen et al., 20 Aug 2025, Mitzenmacher, 2019).
  • Preemption/Resume Overhead: Especially acute in systems with large model state (e.g., LLMs), where resumption after preemption entails costly recomputation of decoder states (Shahout et al., 2024).
  • Dynamic Topology Handling: For DAGs or chain-of-thought systems, uncertainty in structure itself (branching, loop iteration count) is challenging; Bayesian and entropy-based scheduling help, yet the identification of optimal information-yielding exploration stages remains system- and workload-specific (Zhu et al., 4 Apr 2025).
  • Incentive Alignment: In cloud and shared platforms, truthfulness in duration reporting can only be guaranteed under carefully designed payment and reporting mechanisms (Babaioff et al., 2022).

Key references:

  • LLMSched: "LLMSched: Uncertainty-Aware Workload Scheduling for Compound LLM Applications" (Zhu et al., 4 Apr 2025)
  • Embedding-based LLM scheduling: "Don't Stop Me Now: Embedding Based Scheduling for LLMs" (Shahout et al., 2024)
  • Robust LLM inference: "Adaptively Robust LLM Inference Optimization under Prediction Uncertainty" (Chen et al., 20 Aug 2025)
  • Conformal reliability scheduling: "Guaranteed Dynamic Scheduling of Ultra-Reliable Low-Latency Traffic via Conformal Prediction" (Cohen et al., 2023)
  • Incentive mechanisms for cloud workloads: "Truthful Online Scheduling of Cloud Workloads under Uncertainty" (Babaioff et al., 2022)
  • Price-of-misprediction analysis: "Scheduling with Predictions and the Price of Misprediction" (Mitzenmacher, 2019)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Predictive Scheduling Under Output Length Uncertainty.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube