TimeBill Framework for Deadline-Driven LLMs
- TimeBill is a time-budgeted inference framework that adapts KV cache eviction ratios to meet hard deadlines while optimizing response quality.
- It integrates response length prediction, execution time estimation, and closed-form optimization to balance latency and fidelity in LLM outputs.
- Experimental benchmarks demonstrate up to 15% higher average scores and robust deadline compliance in real-time, safety-critical applications.
TimeBill is a time-budgeted inference framework for LLMs, designed to guarantee hard deadline compliance while maximizing LLM response quality in time-critical applications. It introduces fine-grained runtime prediction and analytic modeling tailored to autoregressive LLMs, enabling per-inference adaptation of the key-value (KV) cache eviction ratio. This method overcomes the inefficiency of prior approaches using global or fixed eviction ratios, especially for tasks with diverse real-time constraints and variable prompt/response structures (Fan et al., 26 Dec 2025).
1. Problem Definition and Objectives
TimeBill addresses the challenge of deploying LLMs in scenarios with stringent deadlines (e.g., robotics, autonomous vehicles, industrial automation), where the inherent uncertainty in autoregressive decoding causes unpredictable execution times. The centralized objectives of the TimeBill framework are:
- To guarantee that the predicted worst-case latency does not exceed a user-specified budget .
- To choose the minimal possible eviction ratio for the KV cache, ensuring maximal fidelity of the generated response while respecting the deadline.
The core difficulty arises from the linear time complexity of autoregressive generation, coupled with response length variability and complex prompt effects on latency. Traditional fixed-ratio cache strategies cannot simultaneously optimize quality and deadline compliance (Fan et al., 26 Dec 2025).
2. Architectural Components and Workflow
TimeBill is structured into three tightly integrated components:
- Response Length Predictor (RLP): Casts response length prediction as a multi-class classification task. Given prompt (length ), it predicts a bucket (of size ), outputting a response length estimate .
- Execution Time Estimator (ETE): Uses offline floating-point profiling to fit closed-form models for prefill and decode phases. Prefill time is modeled as . Single-step decode time is . Total decode time sums over all output tokens, incorporating dynamic KV cache length under the current eviction ratio.
- Time-Budgeted Decoder: Solves for the minimal such that total predicted execution time (including inflated response length for worst-case) plus RLP overhead does not exceed . The optimal is used to evict the corresponding fraction of KV cache after the prefill phase.
The workflow is highly parallelized: RLP and worst-case time estimation are run concurrently with the LLM prefill phase, utilizing available CPU/GPU resources.
3. Mathematical Models and Optimization
The essential mathematical formulations running through TimeBill include:
- Response Length Prediction: where is a transformer-based classifier producing bucket probabilities .
- Prefill and Decode Time Estimation:
- Worst-case Length Inflation: for pessimism factor .
- Total Predicted Latency:
- Optimization for Eviction Ratio: This is solved in closed form:
4. Implementation and Deployment Aspects
- Model Choices: The framework targets LLMs such as Qwen2.5-7B-Instruct (context 32,768, max generation 8,192 tokens) and an RLP model based on Qwen2.5-0.5B-Instruct with 512 buckets ().
- Profiling for ETE: Empirical measurements are made for various prompt lengths () for prefill and for varying KV-cache sizes, fitting for prefill and for decode steps. The mean absolute percentage errors are 1.22% (prefill) and 1.69% (decode step), indicating close fit.
- Resource Utilization: TimeBill is implemented with PyTorch and custom CUDA kernels for efficient KV cache eviction. Hardware includes Intel Xeon Platinum 8350C CPUs and NVIDIA A40 GPUs.
- Prompt Compression: If (RLP overhead) would exceed the prefill computation window, any prompt compression method can be used to produce such that —ensuring RLP does not delay inference.
5. Experimental Results and Benchmarks
TimeBill was evaluated on LongBench (bilingual, multi-task long context) using the following metrics:
- Quality Metrics: F1, ROUGE-L, Levenshtein distance, aggregated as “average score.”
- Timing Strategies and Overrun Policies:
- Kill: any job overrun is dropped (score = 0).
- Skip-Next: if an overrun is imminent, subsequent prompts are skipped until completion.
- Completion Rate: The fraction of tasks finishing before the deadline.
Baselines include:
- Vanilla LLM (no cache eviction),
- Fixed SnapKV (25%, 50%, 75%, 95%),
- AWQ 4-bit weight quantization (Fan et al., 26 Dec 2025).
Key Findings:
- RLP achieves MAE ≈ 42.7 tokens, RMSE ≈ 78.1, with , outperforming 5- or 10-class BERT models for this task.
- End-to-end predicted latency closely tracks actual runtime, with always upper bounding the true runtime.
- Under time budgets –10 s, TimeBill achieves up to 15% higher average score than vanilla and matches the completion rate of fixed SnapKV.
- Performance peaks at for length inflation, confirming the “5× pessimism” rule common in hard real-time systems.
6. Significance and Impact
TimeBill establishes a systematic approach for meeting hard deadlines with LLMs, leveraging runtime modeling and analytic optimization to balance latency and answer quality. By integrating a fine-grained, LLM-tailored response length predictor, closed-form execution time models based on empirical hardware profiling, and an effective cache management scheme, TimeBill demonstrates robust empirical improvements in deadline completion rates and output fidelity. Its framework generalizes to any scenario with stringent real-time LLM requirements and has direct applicability to industrial, robotic, and safety-critical deployments (Fan et al., 26 Dec 2025).