TimeBill: Time-Budgeted Decoding for LLMs
- The paper demonstrates that TimeBill formalizes decoding as a constrained optimization problem, maximizing quality under strict wall-clock deadlines.
- It leverages fine-grained prediction of response lengths and execution time to adaptively control memory eviction and manage computational resources.
- The framework achieves notable latency improvements with minimal quality loss, making it essential for real-time and safety-critical applications.
Time-Budgeted Decoding (TimeBill) refers to a family of frameworks and algorithms that guarantee large neural or sequence models—most prominently LLMs—produce outputs within a strict wall-clock time budget, while optimally balancing computational efficiency and prediction quality. This paradigm is central to deploying LLMs in real-time or safety-critical systems, including robotics, autonomous driving, and industrial automation, where exceeding the allotted response time may compromise system correctness or safety (Fan et al., 26 Dec 2025).
1. Motivation and Problem Definition
Autoregressive generation in LLMs has highly variable latency stemming from unpredictable output lengths and the inherently sequential, token-by-token decoding process. Traditional efficiency methods such as fixed-ratio key-value (KV) cache eviction cannot guarantee the dual objectives of maximizing task accuracy (e.g., F1, BLEU, ROUGE) and ensuring inference completes within an externally imposed deadline (Fan et al., 26 Dec 2025). Time-budgeted decoding (TimeBill) formalizes this requirement as a constrained optimization problem:
where is the decoding policy, is an input, and is the measured end-to-end inference time, including prefill and autoregressive components. Related paradigms in simultaneous translation and coding theory recast such constraints as soft or hard budgets on compute, latency, or output length (Zheng et al., 2020, Xu, 2021).
2. Core Algorithmic Frameworks
Recent work has instantiated TimeBill in multiple frameworks distinguished by three principal components: (i) precise response length or inference-time prediction, (ii) adaptive runtime control of compute-saving mechanisms, and (iii) explicit enforcement and monitoring of end-to-end time adherence.
2.1 Fine-grained Response Length and Execution Time Prediction
TimeBill (Fan et al., 26 Dec 2025) introduces a fine-grained Response Length Predictor (RLP), a compact transformer trained to estimate future response length given a prompt . This estimation supports a workload-aware Execution Time Estimator (ETE), modeling prefill and decoding latency as closed-form polynomials of sequence length, prompt length, and dynamic KV-cache size. Hardware-specific coefficients are learned via profiling and regression:
where is the prompt length, is the eviction ratio, and the effective context window at step .
2.2 Adaptive Runtime Policy Optimization
TimeBill's inference controller solves a constrained minimization:
where is a pessimistically adjusted output length. This yields a closed-form optimal (fraction of KV-cache to evict), selecting the least aggressive memory-saving configuration required to meet while minimizing loss of quality.
Alternative approaches employ discrete routing across diverse decoding strategies (best-of-N, beam search, voting) with utility-maximizing selection under learned latency and quality predictors, or reinforcement learning (policy-gradient or GRPO) that directly incorporates time cost into the reward (Huang et al., 11 Sep 2025, Li et al., 16 May 2025, Wen et al., 24 Aug 2025).
2.3 Control and Enforcement of Time Budgets
BudgetThinker and SelfBudgeter frameworks (Wen et al., 24 Aug 2025, Li et al., 16 May 2025) enforce time budgets via control tokens and runtime monitoring. BudgetThinker, for example, injects control tokens at prescribed fractions of the budget:
with mapped from the wall-clock time budget via empirical per-token latency. This makes the model aware of budget exhaustion, enabling precise truncation and consistent downstream performance.
3. Extensions and Alternative Methodologies
Several methodologies have generalized or extended the TimeBill paradigm:
- Simultaneous Translation with Opportunistic Decoding: The ODTC framework for low-latency translation overgenerates extra tokens within a speculative window, correcting mistakes as more source context arrives, trading off latency, quality (BLEU), and revision rate within a hard time or token budget (Zheng et al., 2020).
- Inference-Time Adaptive Candidate Selection: φ-Decoding leverages foresight sampling (short rollouts), adaptive in-width and in-depth pruning, and a value–alignment distribution to optimize token selection under explicit compute or time constraints, delivering improved pass@1 accuracy per FLOP budget (Xu et al., 17 Mar 2025).
- Anytime Decoding by MCTS: In channel coding, MCTS enables anytime decoding by improving sequence accuracy monotonically with available compute time, naturally yielding an error–time tradeoff controllable by a budget (Xu, 2021).
The commonality across these frameworks is a predict–allocate–enforce cycle: (1) predict likely resource consumption, (2) allocate or schedule computation or generated content to fit within , and (3) enforce constraints via hard decoding rules, control tokens, or runtime truncation.
4. Empirical Performance and Trade-offs
| Method / Setting | Deadline Adherence | Accuracy/Score (F1, BLEU, ROUGE) | Resource Savings |
|---|---|---|---|
| TimeBill (KV eviction + RLP) | Matches α=95% | +10 points over static baselines | Variable (α*) |
| BudgetThinker (CoT) | ~87% at | ≥80% baseline accuracy | 4× speedup |
| SelfBudgeter | ~78% budget match | <2.2% drop (MATH), +3.2% (GSM8K) | −61–74% length |
| φ-Decoding | Sub-second regime | +1–1.3% over strong baselines | 6× less FLOPs |
| ODTC (simultaneous translation) | <8% revision rate | +3.1 BLEU over wait-k | −2.4 RAL |
These results confirm substantial latency reductions—routinely 2–4×—with minor or negligible performance loss when mechanisms are properly calibrated. A key pattern is the importance of accurate prediction (of tokens, time, or cost) and continuous feedback/monitoring for adherence, especially under variance in prompt shape or hardware (Fan et al., 26 Dec 2025, Wen et al., 24 Aug 2025, Li et al., 16 May 2025).
5. Practical Considerations and Limitations
Deployment of TimeBill and related time-budgeted inference frameworks requires:
- Calibration or profiling per hardware target (e.g., recalibration of coefficients for and per GPU/CPU) (Fan et al., 26 Dec 2025).
- Adaptation of the training and inference loop: fine-tuning or supervised adaptation with control tokens, and possible architectural changes for budget-predicting heads (Wen et al., 24 Aug 2025).
- Re-tuning for very tight deadlines: quality inevitably degrades if the minimum prefill or critical chain-of-thought steps exceed .
- For control-token methods, retraining is essential, as vanilla LLMs ignore unrecognized signals (Wen et al., 24 Aug 2025, Li et al., 16 May 2025).
- In Token-to-Time mapping, empirical per-token latency estimation must account for batch/load effects; under saturation, the mapping may become nonlinear (Wen et al., 24 Aug 2025).
These frameworks are extensible to multi-dimensional optimization, e.g., trading off quantization level and early stopping along with cache-control (Fan et al., 26 Dec 2025), and can be integrated with per-query or per-user scheduling (Huang et al., 11 Sep 2025).
6. Connections to Broader Research and Future Directions
Time-budgeted decoding intersects with several active research frontiers:
- Multi-Knob Dynamic Inference: Integrating adaptive quantization, speculative decoding, and layer skipping with time-aware routing (Fan et al., 26 Dec 2025).
- Learning-Based Routing: Deploying shallow policy networks for real-time selection of configuration parameters under fluctuating latency and load (Fan et al., 26 Dec 2025, Huang et al., 11 Sep 2025).
- Progressive and Chunked Generation: Decoding in blocks with dynamic per-block time allocation to further optimize throughput and deadline adherence (Fan et al., 26 Dec 2025).
- Revision-Aware Metrics for Simultaneous and Streaming Systems: Incorporating lag, revision, and correction rates (e.g., Revision-Aware Lagging [RAL]) for measuring adherence and perceptual quality in streaming and real-time outputs (Zheng et al., 2020).
- Resource-Contingent Reasoning: Allowing user interruption or continuation based on predicted or observed latency, with immediate realization of partial progress (Li et al., 16 May 2025).
A plausible implication is that TimeBill and its variants will form the substrate of LLM deployment in latency-critical, resource-constrained, and real-time settings, as continual improvements in prediction accuracy, dynamic adaptation, and multitarget optimization are realized. Robust enforcement of compute or time budgets is increasingly viewed as foundational to safe, predictable, and scalable LLM deployment (Fan et al., 26 Dec 2025, Wen et al., 24 Aug 2025).