Papers
Topics
Authors
Recent
2000 character limit reached

TimeBill: Time-Budgeted Decoding for LLMs

Updated 7 January 2026
  • The paper demonstrates that TimeBill formalizes decoding as a constrained optimization problem, maximizing quality under strict wall-clock deadlines.
  • It leverages fine-grained prediction of response lengths and execution time to adaptively control memory eviction and manage computational resources.
  • The framework achieves notable latency improvements with minimal quality loss, making it essential for real-time and safety-critical applications.

Time-Budgeted Decoding (TimeBill) refers to a family of frameworks and algorithms that guarantee large neural or sequence models—most prominently LLMs—produce outputs within a strict wall-clock time budget, while optimally balancing computational efficiency and prediction quality. This paradigm is central to deploying LLMs in real-time or safety-critical systems, including robotics, autonomous driving, and industrial automation, where exceeding the allotted response time may compromise system correctness or safety (Fan et al., 26 Dec 2025).

1. Motivation and Problem Definition

Autoregressive generation in LLMs has highly variable latency stemming from unpredictable output lengths and the inherently sequential, token-by-token decoding process. Traditional efficiency methods such as fixed-ratio key-value (KV) cache eviction cannot guarantee the dual objectives of maximizing task accuracy (e.g., F1, BLEU, ROUGE) and ensuring inference completes within an externally imposed deadline TT (Fan et al., 26 Dec 2025). Time-budgeted decoding (TimeBill) formalizes this requirement as a constrained optimization problem:

maxG  Quality(G(x))subject tote2e(G,x)T\max_{G} \;\text{Quality}(G(x)) \quad \text{subject to} \quad t_\mathrm{e2e}(G,x) \leq T

where GG is the decoding policy, xx is an input, and te2et_\mathrm{e2e} is the measured end-to-end inference time, including prefill and autoregressive components. Related paradigms in simultaneous translation and coding theory recast such constraints as soft or hard budgets on compute, latency, or output length (Zheng et al., 2020, Xu, 2021).

2. Core Algorithmic Frameworks

Recent work has instantiated TimeBill in multiple frameworks distinguished by three principal components: (i) precise response length or inference-time prediction, (ii) adaptive runtime control of compute-saving mechanisms, and (iii) explicit enforcement and monitoring of end-to-end time adherence.

2.1 Fine-grained Response Length and Execution Time Prediction

TimeBill (Fan et al., 26 Dec 2025) introduces a fine-grained Response Length Predictor (RLP), a compact transformer trained to estimate future response length NN given a prompt xx. This estimation supports a workload-aware Execution Time Estimator (ETE), modeling prefill and decoding latency as closed-form polynomials of sequence length, prompt length, and dynamic KV-cache size. Hardware-specific coefficients are learned via profiling and regression:

t^prefill(x)=aNx2+bNx+c;t^decodei(x,α)=pNkvi+q\hat{t}_\mathrm{prefill}(x) = a N_x^2 + b N_x + c; \quad \hat{t}_\mathrm{decode}^i(x, \alpha) = p N_\mathrm{kv}^i + q

where NxN_x is the prompt length, α\alpha is the eviction ratio, and NkviN_\mathrm{kv}^i the effective context window at step ii.

2.2 Adaptive Runtime Policy Optimization

TimeBill's inference controller solves a constrained minimization:

min0ααmaxαs.t.t^WCET(x,α,N^W)T\min_{0 \leq \alpha \leq \alpha_\mathrm{max}} \alpha \quad \text{s.t.} \quad \hat{t}_\mathrm{WCET}(x, \alpha, \hat{N}_W) \leq T

where N^W\hat{N}_W is a pessimistically adjusted output length. This yields a closed-form optimal α\alpha^* (fraction of KV-cache to evict), selecting the least aggressive memory-saving configuration required to meet TT while minimizing loss of quality.

Alternative approaches employ discrete routing across diverse decoding strategies (best-of-N, beam search, voting) with utility-maximizing selection under learned latency and quality predictors, or reinforcement learning (policy-gradient or GRPO) that directly incorporates time cost into the reward (Huang et al., 11 Sep 2025, Li et al., 16 May 2025, Wen et al., 24 Aug 2025).

2.3 Control and Enforcement of Time Budgets

BudgetThinker and SelfBudgeter frameworks (Wen et al., 24 Aug 2025, Li et al., 16 May 2025) enforce time budgets via control tokens and runtime monitoring. BudgetThinker, for example, injects KK control tokens at prescribed fractions of the budget:

tk=kBK,k=1,,Kt_k = k \left\lfloor \frac{B}{K} \right\rfloor, \quad k=1,\ldots,K

with BB mapped from the wall-clock time budget TT via empirical per-token latency. This makes the model aware of budget exhaustion, enabling precise truncation and consistent downstream performance.

3. Extensions and Alternative Methodologies

Several methodologies have generalized or extended the TimeBill paradigm:

  • Simultaneous Translation with Opportunistic Decoding: The ODTC framework for low-latency translation overgenerates extra tokens within a speculative window, correcting mistakes as more source context arrives, trading off latency, quality (BLEU), and revision rate within a hard time or token budget (Zheng et al., 2020).
  • Inference-Time Adaptive Candidate Selection: φ-Decoding leverages foresight sampling (short rollouts), adaptive in-width and in-depth pruning, and a value–alignment distribution to optimize token selection under explicit compute or time constraints, delivering improved pass@1 accuracy per FLOP budget (Xu et al., 17 Mar 2025).
  • Anytime Decoding by MCTS: In channel coding, MCTS enables anytime decoding by improving sequence accuracy monotonically with available compute time, naturally yielding an error–time tradeoff controllable by a budget TT (Xu, 2021).

The commonality across these frameworks is a predict–allocate–enforce cycle: (1) predict likely resource consumption, (2) allocate or schedule computation or generated content to fit within TT, and (3) enforce constraints via hard decoding rules, control tokens, or runtime truncation.

4. Empirical Performance and Trade-offs

Method / Setting Deadline Adherence Accuracy/Score (F1, BLEU, ROUGE) Resource Savings
TimeBill (KV eviction + RLP) Matches α=95% +10 points over static baselines Variable (α*)
BudgetThinker (CoT) ~87% at B=2000B=2000 ≥80% baseline accuracy 4× speedup
SelfBudgeter ~78% budget match <2.2% drop (MATH), +3.2% (GSM8K) −61–74% length
φ-Decoding Sub-second regime +1–1.3% over strong baselines 6× less FLOPs
ODTC (simultaneous translation) <8% revision rate +3.1 BLEU over wait-k −2.4 RAL

These results confirm substantial latency reductions—routinely 2–4×—with minor or negligible performance loss when mechanisms are properly calibrated. A key pattern is the importance of accurate prediction (of tokens, time, or cost) and continuous feedback/monitoring for adherence, especially under variance in prompt shape or hardware (Fan et al., 26 Dec 2025, Wen et al., 24 Aug 2025, Li et al., 16 May 2025).

5. Practical Considerations and Limitations

Deployment of TimeBill and related time-budgeted inference frameworks requires:

  • Calibration or profiling per hardware target (e.g., recalibration of coefficients for t^prefill\hat{t}_\mathrm{prefill} and t^decode\hat{t}_\mathrm{decode} per GPU/CPU) (Fan et al., 26 Dec 2025).
  • Adaptation of the training and inference loop: fine-tuning or supervised adaptation with control tokens, and possible architectural changes for budget-predicting heads (Wen et al., 24 Aug 2025).
  • Re-tuning for very tight deadlines: quality inevitably degrades if the minimum prefill or critical chain-of-thought steps exceed TT.
  • For control-token methods, retraining is essential, as vanilla LLMs ignore unrecognized signals (Wen et al., 24 Aug 2025, Li et al., 16 May 2025).
  • In Token-to-Time mapping, empirical per-token latency estimation must account for batch/load effects; under saturation, the mapping may become nonlinear (Wen et al., 24 Aug 2025).

These frameworks are extensible to multi-dimensional optimization, e.g., trading off quantization level and early stopping along with cache-control (Fan et al., 26 Dec 2025), and can be integrated with per-query or per-user scheduling (Huang et al., 11 Sep 2025).

6. Connections to Broader Research and Future Directions

Time-budgeted decoding intersects with several active research frontiers:

  • Multi-Knob Dynamic Inference: Integrating adaptive quantization, speculative decoding, and layer skipping with time-aware routing (Fan et al., 26 Dec 2025).
  • Learning-Based Routing: Deploying shallow policy networks for real-time selection of configuration parameters under fluctuating latency and load (Fan et al., 26 Dec 2025, Huang et al., 11 Sep 2025).
  • Progressive and Chunked Generation: Decoding in blocks with dynamic per-block time allocation to further optimize throughput and deadline adherence (Fan et al., 26 Dec 2025).
  • Revision-Aware Metrics for Simultaneous and Streaming Systems: Incorporating lag, revision, and correction rates (e.g., Revision-Aware Lagging [RAL]) for measuring adherence and perceptual quality in streaming and real-time outputs (Zheng et al., 2020).
  • Resource-Contingent Reasoning: Allowing user interruption or continuation based on predicted or observed latency, with immediate realization of partial progress (Li et al., 16 May 2025).

A plausible implication is that TimeBill and its variants will form the substrate of LLM deployment in latency-critical, resource-constrained, and real-time settings, as continual improvements in prediction accuracy, dynamic adaptation, and multitarget optimization are realized. Robust enforcement of compute or time budgets is increasingly viewed as foundational to safe, predictable, and scalable LLM deployment (Fan et al., 26 Dec 2025, Wen et al., 24 Aug 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Time-Budgeted Decoding (TimeBill).