Papers
Topics
Authors
Recent
Search
2000 character limit reached

TimeBill Framework for Deadline-Driven LLMs

Updated 15 January 2026
  • TimeBill is a time-budgeted inference framework that adapts KV cache eviction ratios to meet hard deadlines while optimizing response quality.
  • It integrates response length prediction, execution time estimation, and closed-form optimization to balance latency and fidelity in LLM outputs.
  • Experimental benchmarks demonstrate up to 15% higher average scores and robust deadline compliance in real-time, safety-critical applications.

TimeBill is a time-budgeted inference framework for LLMs, designed to guarantee hard deadline compliance while maximizing LLM response quality in time-critical applications. It introduces fine-grained runtime prediction and analytic modeling tailored to autoregressive LLMs, enabling per-inference adaptation of the key-value (KV) cache eviction ratio. This method overcomes the inefficiency of prior approaches using global or fixed eviction ratios, especially for tasks with diverse real-time constraints and variable prompt/response structures (Fan et al., 26 Dec 2025).

1. Problem Definition and Objectives

TimeBill addresses the challenge of deploying LLMs in scenarios with stringent deadlines (e.g., robotics, autonomous vehicles, industrial automation), where the inherent uncertainty in autoregressive decoding causes unpredictable execution times. The centralized objectives of the TimeBill framework are:

  • To guarantee that the predicted worst-case latency t^WCET\hat{t}_{\mathrm{WCET}} does not exceed a user-specified budget TT.
  • To choose the minimal possible eviction ratio α\alpha for the KV cache, ensuring maximal fidelity of the generated response while respecting the deadline.

The core difficulty arises from the linear time complexity of autoregressive generation, coupled with response length variability and complex prompt effects on latency. Traditional fixed-ratio cache strategies cannot simultaneously optimize quality and deadline compliance (Fan et al., 26 Dec 2025).

2. Architectural Components and Workflow

TimeBill is structured into three tightly integrated components:

  • Response Length Predictor (RLP): Casts response length prediction as a multi-class classification task. Given prompt xx (length NxN_x), it predicts a bucket n^\hat n (of size BB), outputting a response length estimate N^=min(n^B,Nmax)\hat N = \min(\hat n B, N_{\max}).
  • Execution Time Estimator (ETE): Uses offline floating-point profiling to fit closed-form models for prefill and decode phases. Prefill time is modeled as tprefill(Nx)=aNx2+bNx+ct_{\rm prefill}(N_x) = a N_x^2 + bN_x + c. Single-step decode time is tdecode_step(Nkv)=pNkv+qt_{\rm decode\_step}(N_{\rm kv}) = p N_{\rm kv} + q. Total decode time sums over all output tokens, incorporating dynamic KV cache length under the current eviction ratio.
  • Time-Budgeted Decoder: Solves for the minimal α\alpha such that total predicted execution time (including inflated response length for worst-case) plus RLP overhead does not exceed TT. The optimal α\alpha^* is used to evict the corresponding fraction of KV cache after the prefill phase.

The workflow is highly parallelized: RLP and worst-case time estimation are run concurrently with the LLM prefill phase, utilizing available CPU/GPU resources.

3. Mathematical Models and Optimization

The essential mathematical formulations running through TimeBill include:

  • Response Length Prediction: n^=argmaxjp^j,N^=min(n^B,Nmax)\hat n = \arg\max_j \hat{p}_j \quad,\quad \hat N = \min(\hat n B, N_{\max}) where fθ(x)f_\theta(x) is a transformer-based classifier producing bucket probabilities p^\hat{\mathbf{p}}.
  • Prefill and Decode Time Estimation: tprefill(Nx)aNx2+bNx+c Nkvi=(1α)Nx+(i1) tdecode_step(Nkvi)pNkvi+q t^decode(α,N^)=p(1α)Nx(N^1)+p(N^2)(N^1)2+q(N^1)\begin{align*} t_{\rm prefill}(N_x) &\approx aN_x^2 + bN_x + c \ N_{\rm kv}^i &= (1-\alpha)N_x + (i-1) \ t_{\rm decode\_step}(N_{\rm kv}^i) &\approx pN_{\rm kv}^i + q \ \hat{t}_{\rm decode}(\alpha, \hat N) &= p(1-\alpha) N_x (\hat N - 1) + p\frac{(\hat N - 2)(\hat N - 1)}{2} + q(\hat N - 1) \end{align*}
  • Worst-case Length Inflation: N^W=min(kN^,Nmax)\hat N_W = \min(k \hat N, N_{\max}) for pessimism factor k1k \ge 1.
  • Total Predicted Latency: t^WCET(x,α)=tprefill(Nx)+t^decode(α,N^W)\hat t_{\rm WCET}(x, \alpha) = t_{\rm prefill}(N_x) + \hat t_{\rm decode}(\alpha, \hat N_W)
  • Optimization for Eviction Ratio: min0ααmaxα s.t. tpredict(x)+t^WCET(x,α)T\begin{align*} \min_{0 \le \alpha \le \alpha_{\max}} \alpha \ \text{s.t.}~t_{\rm predict}(x) + \hat t_{\rm WCET}(x, \alpha) \leq T \end{align*} This is solved in closed form:

α=min{αmax, 1Ttpredict(x)tprefill(Nx)p(N^W2)(N^W1)2q(N^W1)pNx(N^W1)}\alpha^* = \min \Bigg\{ \alpha_{\max},~1 - \frac{T - t_{\rm predict}(x) - t_{\rm prefill}(N_x) - p \frac{(\hat N_W - 2)(\hat N_W - 1)}{2} - q(\hat N_W-1)}{p N_x (\hat N_W - 1)} \Bigg\}

4. Implementation and Deployment Aspects

  • Model Choices: The framework targets LLMs such as Qwen2.5-7B-Instruct (context 32,768, max generation 8,192 tokens) and an RLP model based on Qwen2.5-0.5B-Instruct with 512 buckets (B=16B=16).
  • Profiling for ETE: Empirical measurements are made for various prompt lengths (Nx{0,1024,...,32768}N_x \in \{0, 1024, ..., 32768\}) for prefill and for varying KV-cache sizes, fitting (a,b,c)(a, b, c) for prefill and (p,q)(p, q) for decode steps. The mean absolute percentage errors are 1.22% (prefill) and 1.69% (decode step), indicating close fit.
  • Resource Utilization: TimeBill is implemented with PyTorch and custom CUDA kernels for efficient KV cache eviction. Hardware includes Intel Xeon Platinum 8350C CPUs and NVIDIA A40 GPUs.
  • Prompt Compression: If tpredictt_{\rm predict} (RLP overhead) would exceed the prefill computation window, any prompt compression method can be used to produce xpx_p such that t^predict(xp)tprefill(x)\hat t_{\rm predict}(x_p) \leq t_{\rm prefill}(x)—ensuring RLP does not delay inference.

5. Experimental Results and Benchmarks

TimeBill was evaluated on LongBench (bilingual, multi-task long context) using the following metrics:

  • Quality Metrics: F1, ROUGE-L, Levenshtein distance, aggregated as “average score.”
  • Timing Strategies and Overrun Policies:
    • Kill: any job overrun is dropped (score = 0).
    • Skip-Next: if an overrun is imminent, subsequent prompts are skipped until completion.
  • Completion Rate: The fraction of tasks finishing before the deadline.

Baselines include:

  • Vanilla LLM (no cache eviction),
  • Fixed α\alpha SnapKV (25%, 50%, 75%, 95%),
  • AWQ 4-bit weight quantization (Fan et al., 26 Dec 2025).

Key Findings:

  • RLP achieves MAE ≈ 42.7 tokens, RMSE ≈ 78.1, with R2=0.723R^2 = 0.723, outperforming 5- or 10-class BERT models for this task.
  • End-to-end predicted latency closely tracks actual runtime, with t^WCET\hat t_{\rm WCET} always upper bounding the true runtime.
  • Under time budgets T=5T=5–10 s, TimeBill achieves up to 15% higher average score than vanilla and matches the completion rate of fixed α=95%\alpha=95\% SnapKV.
  • Performance peaks at k=5k=5 for length inflation, confirming the “5× pessimism” rule common in hard real-time systems.

6. Significance and Impact

TimeBill establishes a systematic approach for meeting hard deadlines with LLMs, leveraging runtime modeling and analytic optimization to balance latency and answer quality. By integrating a fine-grained, LLM-tailored response length predictor, closed-form execution time models based on empirical hardware profiling, and an effective cache management scheme, TimeBill demonstrates robust empirical improvements in deadline completion rates and output fidelity. Its framework generalizes to any scenario with stringent real-time LLM requirements and has direct applicability to industrial, robotic, and safety-critical deployments (Fan et al., 26 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TimeBill Framework.