TimeBill: Time-Budgeted Decoding for LLMs

Updated 7 January 2026

The paper demonstrates that TimeBill formalizes decoding as a constrained optimization problem, maximizing quality under strict wall-clock deadlines.
It leverages fine-grained prediction of response lengths and execution time to adaptively control memory eviction and manage computational resources.
The framework achieves notable latency improvements with minimal quality loss, making it essential for real-time and safety-critical applications.

Time-Budgeted Decoding (TimeBill) refers to a family of frameworks and algorithms that guarantee large neural or sequence models—most prominently LLMs—produce outputs within a strict wall-clock time budget, while optimally balancing computational efficiency and prediction quality. This paradigm is central to deploying LLMs in real-time or safety-critical systems, including robotics, autonomous driving, and industrial automation, where exceeding the allotted response time may compromise system correctness or safety (Fan et al., 26 Dec 2025).

1. Motivation and Problem Definition

Autoregressive generation in LLMs has highly variable latency stemming from unpredictable output lengths and the inherently sequential, token-by-token decoding process. Traditional efficiency methods such as fixed-ratio key-value (KV) cache eviction cannot guarantee the dual objectives of maximizing task accuracy (e.g., F1, BLEU, ROUGE) and ensuring inference completes within an externally imposed deadline $T$ (Fan et al., 26 Dec 2025). Time-budgeted decoding (TimeBill) formalizes this requirement as a constrained optimization problem:

$\max_{G} \;\text{Quality}(G(x)) \quad \text{subject to} \quad t_\mathrm{e2e}(G,x) \leq T$

where $G$ is the decoding policy, $x$ is an input, and $t_\mathrm{e2e}$ is the measured end-to-end inference time, including prefill and autoregressive components. Related paradigms in simultaneous translation and coding theory recast such constraints as soft or hard budgets on compute, latency, or output length (Zheng et al., 2020, Xu, 2021).

2. Core Algorithmic Frameworks

Recent work has instantiated TimeBill in multiple frameworks distinguished by three principal components: (i) precise response length or inference-time prediction, (ii) adaptive runtime control of compute-saving mechanisms, and (iii) explicit enforcement and monitoring of end-to-end time adherence.

2.1 Fine-grained Response Length and Execution Time Prediction

TimeBill (Fan et al., 26 Dec 2025) introduces a fine-grained Response Length Predictor (RLP), a compact transformer trained to estimate future response length $N$ given a prompt $x$ . This estimation supports a workload-aware Execution Time Estimator (ETE), modeling prefill and decoding latency as closed-form polynomials of sequence length, prompt length, and dynamic KV-cache size. Hardware-specific coefficients are learned via profiling and regression:

$\hat{t}_\mathrm{prefill}(x) = a N_x^2 + b N_x + c; \quad \hat{t}_\mathrm{decode}^i(x, \alpha) = p N_\mathrm{kv}^i + q$

where $N_x$ is the prompt length, $\alpha$ is the eviction ratio, and $N_\mathrm{kv}^i$ the effective context window at step $i$ .

2.2 Adaptive Runtime Policy Optimization

TimeBill's inference controller solves a constrained minimization:

$\min_{0 \leq \alpha \leq \alpha_\mathrm{max}} \alpha \quad \text{s.t.} \quad \hat{t}_\mathrm{WCET}(x, \alpha, \hat{N}_W) \leq T$

where $\hat{N}_W$ is a pessimistically adjusted output length. This yields a closed-form optimal $\alpha^*$ (fraction of KV-cache to evict), selecting the least aggressive memory-saving configuration required to meet $T$ while minimizing loss of quality.

Alternative approaches employ discrete routing across diverse decoding strategies (best-of-N, beam search, voting) with utility-maximizing selection under learned latency and quality predictors, or reinforcement learning (policy-gradient or GRPO) that directly incorporates time cost into the reward (Huang et al., 11 Sep 2025, Li et al., 16 May 2025, Wen et al., 24 Aug 2025).

2.3 Control and Enforcement of Time Budgets

BudgetThinker and SelfBudgeter frameworks (Wen et al., 24 Aug 2025, Li et al., 16 May 2025) enforce time budgets via control tokens and runtime monitoring. BudgetThinker, for example, injects $K$ control tokens at prescribed fractions of the budget:

$t_k = k \left\lfloor \frac{B}{K} \right\rfloor, \quad k=1,\ldots,K$

with $B$ mapped from the wall-clock time budget $T$ via empirical per-token latency. This makes the model aware of budget exhaustion, enabling precise truncation and consistent downstream performance.

3. Extensions and Alternative Methodologies

Several methodologies have generalized or extended the TimeBill paradigm:

Simultaneous Translation with Opportunistic Decoding: The ODTC framework for low-latency translation overgenerates extra tokens within a speculative window, correcting mistakes as more source context arrives, trading off latency, quality (BLEU), and revision rate within a hard time or token budget (Zheng et al., 2020).
Inference-Time Adaptive Candidate Selection: φ-Decoding leverages foresight sampling (short rollouts), adaptive in-width and in-depth pruning, and a value–alignment distribution to optimize token selection under explicit compute or time constraints, delivering improved pass@1 accuracy per FLOP budget (Xu et al., 17 Mar 2025).
Anytime Decoding by MCTS: In channel coding, MCTS enables anytime decoding by improving sequence accuracy monotonically with available compute time, naturally yielding an error–time tradeoff controllable by a budget $T$ (Xu, 2021).

The commonality across these frameworks is a predict–allocate–enforce cycle: (1) predict likely resource consumption, (2) allocate or schedule computation or generated content to fit within $T$ , and (3) enforce constraints via hard decoding rules, control tokens, or runtime truncation.

4. Empirical Performance and Trade-offs

Method / Setting	Deadline Adherence	Accuracy/Score (F1, BLEU, ROUGE)	Resource Savings
TimeBill (KV eviction + RLP)	Matches α=95%	+10 points over static baselines	Variable (α*)
BudgetThinker (CoT)	~87% at $B=2000$	≥80% baseline accuracy	4× speedup
SelfBudgeter	~78% budget match	<2.2% drop (MATH), +3.2% (GSM8K)	−61–74% length
φ-Decoding	Sub-second regime	+1–1.3% over strong baselines	6× less FLOPs
ODTC (simultaneous translation)	<8% revision rate	+3.1 BLEU over wait-k	−2.4 RAL

These results confirm substantial latency reductions—routinely 2–4×—with minor or negligible performance loss when mechanisms are properly calibrated. A key pattern is the importance of accurate prediction (of tokens, time, or cost) and continuous feedback/monitoring for adherence, especially under variance in prompt shape or hardware (Fan et al., 26 Dec 2025, Wen et al., 24 Aug 2025, Li et al., 16 May 2025).

5. Practical Considerations and Limitations

Deployment of TimeBill and related time-budgeted inference frameworks requires:

Calibration or profiling per hardware target (e.g., recalibration of coefficients for $\hat{t}_\mathrm{prefill}$ and $\hat{t}_\mathrm{decode}$ per GPU/CPU) (Fan et al., 26 Dec 2025).
Adaptation of the training and inference loop: fine-tuning or supervised adaptation with control tokens, and possible architectural changes for budget-predicting heads (Wen et al., 24 Aug 2025).
Re-tuning for very tight deadlines: quality inevitably degrades if the minimum prefill or critical chain-of-thought steps exceed $T$ .
For control-token methods, retraining is essential, as vanilla LLMs ignore unrecognized signals (Wen et al., 24 Aug 2025, Li et al., 16 May 2025).
In Token-to-Time mapping, empirical per-token latency estimation must account for batch/load effects; under saturation, the mapping may become nonlinear (Wen et al., 24 Aug 2025).

These frameworks are extensible to multi-dimensional optimization, e.g., trading off quantization level and early stopping along with cache-control (Fan et al., 26 Dec 2025), and can be integrated with per-query or per-user scheduling (Huang et al., 11 Sep 2025).

6. Connections to Broader Research and Future Directions

Time-budgeted decoding intersects with several active research frontiers:

Multi-Knob Dynamic Inference: Integrating adaptive quantization, speculative decoding, and layer skipping with time-aware routing (Fan et al., 26 Dec 2025).
Learning-Based Routing: Deploying shallow policy networks for real-time selection of configuration parameters under fluctuating latency and load (Fan et al., 26 Dec 2025, Huang et al., 11 Sep 2025).
Progressive and Chunked Generation: Decoding in blocks with dynamic per-block time allocation to further optimize throughput and deadline adherence (Fan et al., 26 Dec 2025).
Revision-Aware Metrics for Simultaneous and Streaming Systems: Incorporating lag, revision, and correction rates (e.g., Revision-Aware Lagging [RAL]) for measuring adherence and perceptual quality in streaming and real-time outputs (Zheng et al., 2020).
Resource-Contingent Reasoning: Allowing user interruption or continuation based on predicted or observed latency, with immediate realization of partial progress (Li et al., 16 May 2025).

A plausible implication is that TimeBill and its variants will form the substrate of LLM deployment in latency-critical, resource-constrained, and real-time settings, as continual improvements in prediction accuracy, dynamic adaptation, and multitarget optimization are realized. Robust enforcement of compute or time budgets is increasingly viewed as foundational to safe, predictable, and scalable LLM deployment (Fan et al., 26 Dec 2025, Wen et al., 24 Aug 2025).

Markdown Upgrade to Chat

References (7)

TimeBill: Time-Budgeted Inference for Large Language Models (2025)

Opportunistic Decoding with Timely Correction for Simultaneous Translation (2020)

Anytime Decoding by Monte-Carlo Tree Search (2021)

Latency and Token-Aware Test-Time Compute (2025)

SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning (2025)

BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens (2025)

$φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Time-Budgeted Decoding (TimeBill).

TimeBill: Time-Budgeted Decoding for LLMs

1. Motivation and Problem Definition

2. Core Algorithmic Frameworks

2.1 Fine-grained Response Length and Execution Time Prediction

2.2 Adaptive Runtime Policy Optimization

2.3 Control and Enforcement of Time Budgets

3. Extensions and Alternative Methodologies

4. Empirical Performance and Trade-offs

5. Practical Considerations and Limitations

6. Connections to Broader Research and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

TimeBill: Time-Budgeted Decoding for LLMs

1. Motivation and Problem Definition

2. Core Algorithmic Frameworks

2.1 Fine-grained Response Length and Execution Time Prediction

2.2 Adaptive Runtime Policy Optimization

2.3 Control and Enforcement of Time Budgets

3. Extensions and Alternative Methodologies

4. Empirical Performance and Trade-offs

5. Practical Considerations and Limitations

6. Connections to Broader Research and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research