TimeBill: Time-Budgeted Inference for Large Language Models (2512.21859v1)

Published 26 Dec 2025 in cs.CL

Abstract: LLMs are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.

Summary

The paper introduces a predictive framework that adapts inference strategies to meet strict time budgets in safety-critical applications.
It employs fine-grained response length prediction and analytical execution time estimation to dynamically configure model runtime settings.
Empirical evaluations show sub-2% MAPEs and enhanced response performance compared to static baseline approaches.

TimeBill: Time-Budgeted Inference for LLMs

Introduction and Motivation

The deployment of LLMs in time-sensitive applications—robotics, autonomous driving, control systems, and industrial automation—requires not only accurate linguistic responses but also adherence to stringent time budgets for decision-making and safety-critical operations. Standard auto-regressive generation mechanisms in LLMs introduce significant uncertainty in inference latencies, compounded by input-dependent and model-specific response lengths. Existing approaches leveraging offline compression (quantization, pruning) and online optimization (KV cache eviction/quantization) are either insensitive to dynamic time budgets or suboptimal in adapting runtime configurations, failing to guarantee timely, high-quality responses across heterogeneous tasks.

The "TimeBill" framework formalizes and addresses these shortcomings by proposing a predictive, adaptive mechanism that achieves a Pareto-optimal balance between inference latency and response performance. TimeBill introduces fine-grained response length prediction and analytical execution time estimation, enabling principled runtime configuration of cache eviction ratios in accordance with user-specified time budgets.

Figure 1: An example of different inference strategies. The vanilla inference may overrun and miss the deadline, resulting in incomplete output. The As-Fast-As-Possible (AFAP) strategy will degrade the response performance, while time-budgeted inference improves the response performance under timing constraints.

Framework Overview

TimeBill maps the time-budgeted inference problem into a constrained optimization over LLM runtime configurations, abstracted as follows: maximize the response performance metric $\mathcal{M}(\cdot)$ while ensuring end-to-end inference finishes within the time budget $T$ and producing at most $N_{max}$ tokens. The framework decomposes the multi-stage inference into prefill and decoding phases; for each, execution time is characterized as a function of input length, model structure, and KV cache status.

Key challenges addressed:

Dynamic Execution Time Estimation: Predicting response length and corresponding execution time prior to inference completion.
Adaptive Runtime Configuration: Mapping predicted execution time to optimal model configuration (e.g., KV cache eviction ratio) to satisfy time constraints without excessive degradation of response quality.
Figure 2: The overview of the TimeBill framework.

Predictive Mechanisms: Response Length Predictor and Execution Time Estimator

Fine-Grained Response Length Predictor (RLP)

The RLP, built on a Small LLM (SLM) and designed using bucketed classification, predicts the response length for a given prompt with higher granularity than prior art. Knowledge distillation aligns the predictor with target LLM behavior using empirical training data $(x_j, N_j)$ , where a classification label encodes the bucket containing the observed response length for each prompt. Unlike BERT-based proxies, the SLM architecture can process longer prompts and achieves lower MAE and RMSE metrics, especially as the number of buckets (granularity) increases.

Figure 3: The overview of the proposed fine-grained response length predictor (RLP).

Workload-Guided Execution Time Estimator (ETE)

TimeBill’s ETE analytically models FLOPs for each stage (Norm, CausalAttention, FeedForward, LMHead), deriving execution time as quadratic and linear terms in input size and KV cache length. Actual hardware-dependent coefficients are learned via profiling, allowing for accurate real-time execution prediction. The ETE accommodates pessimistic WCET guarantees via a configurable factor $k$ on predicted response length, achieving hard real-time safety.

Figure 4: The timeline of TimeBill, where incoming arrows represent inputs (e.g., $x_1, N_{x_1}$ ) , and outgoing arrows represent outputs (e.g., $\hat{\mathbf{y}_1, \alpha_1^*$).

Figure 5: Fitted curves for estimating $\hat{t}_{\text{prefill-phase}$ and $\hat{t}_{\text{decoding-step}$.

Figure 6: The performance of estimating $\hat{t}_{\text{e2e}$ and $\hat{t}_{\text{WCET}$.

Time-Budgeted Efficient Inference: Optimization and System Deployment

The framework solves for the minimal KV cache eviction ratio $\alpha^*$ required to satisfy the given time budget $T$ per job, subject to constraints on completion rates and maximal allowed degradation ( $\alpha_{max}$ ). This optimization leverages predictive outputs from RLP and ETE, integrating both during runtime parallel to the model’s prefill phase via CPU/GPU co-processing. Prompt compression is optionally invoked to further mitigate input overhead, ensuring that deadline-violating inference is proactively avoided.

Empirical Evaluation

Efficacy of RLP and ETE

Evaluation on Qwen2.5-7B-Instruct employing LongBench demonstrates that the fine-grained RLP outperforms ProxyModel and S3 (BERT-based) predictors, with lowest MAE/RMSE at highest bucket granularity. The ETE achieves sub-2% MAPEs for both prefill and decoding estimations, providing reliable upper bounds for WCET, crucial for real-time guarantees.

Comparative Performance

Benchmarking versus baselines (Vanilla, fixed $\alpha$ , AWQ quantization) under Kill and Skip-Next overrun strategies, TimeBill attains higher average response scores while maintaining competitive task completion rates, outperforming static and quantized approaches across variable time budgets.

Figure 7: The average scores and completion rates of different approaches under Kill and Skip-Next.

Sensitivity to Pessimistic Factor

Experiments analyzing the effect of WCET pessimistic factor $k$ (range [1,8]) confirm the necessity of careful tuning. Moderate $k$ (e.g., 5) increases reliability (completion rate) without substantial loss in average score, but excessive conservatism ( $k>5$ ) leads to unnecessarily high $\alpha$ , harming output fidelity.

Figure 8: The average scores and completion rates with different pessimistic factors k under the overrun strategy Kill, where the time budget T=5 s.

Implications, Applications, and Future Directions

TimeBill’s unified predictive and adaptive approach enables precise control over inference latencies in LLM deployment for time-critical systems. The framework is agnostic to target LLM architecture and compatible with offline compression and online quantization methods, potentially extensible to ensembles or multi-modal models. For future work, integrating reinforcement learning with online feedback might further optimize response-performance trade-offs under fluctuating load and mixed-criticality constraints. The modeling framework could expand to distributed setups or incorporate richer hardware heterogeneity. Methodologically, theoretical bounds on WCET approximation error and joint optimization for multi-instance scheduling remain open topics.

Conclusion

TimeBill presents a formal, robust solution for time-budgeted inference in LLM deployment, integrating fine-grained response prediction, analytical time estimation, and adaptive runtime configuration. Empirical results validate the superiority of this framework in meeting hard real-time deadlines without sacrificing output quality, highlighting its practical value for AI-driven, safety-critical systems.