Papers
Topics
Authors
Recent
Search
2000 character limit reached

Two-Stage Inference-Time Budget Control

Updated 4 July 2026
  • Two-Stage Inference-Time Budget Control is a framework that divides inference into a planning phase and a selective execution phase to enforce explicit resource budgets.
  • It dynamically allocates and regulates resources such as tokens, latency, and compute across processing stages to optimize performance.
  • The approach integrates diverse techniques—including runtime prediction, supervised fine-tuning, and reinforcement learning—to balance efficiency and accuracy under constrained budgets.

Two-stage inference-time budget control denotes a family of methods that separates budgeted inference into an upstream assessment, screening, or planning step and a downstream execution step that spends a constrained resource more selectively. In the recent literature, the controlled resource may be reasoning tokens, total generation length, sampled reasoning paths, wall-clock latency, executed Transformer layers, model capacity by phase, or the number of expensive response-model calls. The topic is heterogeneous: some systems implement a literal two-part runtime pipeline, such as predict-then-actuate latency control or cheap-answer-then-escalate serving, whereas others are “two-stage” only in training and should not be misread as two-phase inference algorithms (Fan et al., 26 Dec 2025, Brown et al., 1 Feb 2026, Yuan et al., 30 Apr 2026, Zhou et al., 8 Jun 2026, Wen et al., 24 Aug 2025).

1. Resource models and formal objectives

A first unifying feature is that the literature makes the controlled budget explicit. In token-budgeted reasoning, the controlled quantity is often the length of the intermediate reasoning process. BudgetThinker formulates budget-aware reasoning for chain-of-thought generation with a target budget BB, where the model should adapt its reasoning so that it completes within that budget while preserving as much task accuracy as possible; the budget is enforced over the generated reasoning segment before the final answer (Wen et al., 24 Aug 2025). SelfBudgeter uses an explicit textual interface, NNmaxN\le N_{\max}9 so that the model first emits a token budget and then a solution, treating output length as a proxy for inference cost and user wait time (Li et al., 16 May 2025). BET uses a related but continuous interface, TT0 where the budget is measured in reasoning tokens inside the > ... block, and the maximum completion length is Lmax=16,384L_{\max}=16{,}384 (Zhou et al., 12 May 2026).

Other papers formalize different resources. TimeBill treats the resource as end-to-end wall-clock latency per request and writes the time-budgeted inference problem as maximizing a response-performance metric M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y) subject to te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T and NNmaxN\le N_{\max}, where TT is the request deadline and θ\theta is instantiated as a KV-cache eviction ratio α\alpha (Fan et al., 26 Dec 2025). Adaptive Test-Time Compute Allocation instead assumes a discrete action set B={b1,,bK}\mathcal B=\{b_1,\ldots,b_K\} and optimizes expected accuracy under an average compute budget,

V(B)=maxπΠExD[Acc(x,π(x))]s.t.ExD[C(π(x))]B,V(B)=\max_{\pi\in\Pi} E_{x\sim D}[Acc(x,\pi(x))]\quad \text{s.t.}\quad E_{x\sim D}[C(\pi(x))]\le B,

so the budget is global and population-level rather than per request (Zhai et al., 16 Apr 2026). ROI-Reasoning moves one step further and allocates a hard total token budget across an ordered set of multiple problems,

Lmax=16,384L_{\max}=16{,}3840

making early over-spending directly detrimental to later tasks (Zhao et al., 7 Jan 2026).

Several systems budget internal compute rather than output length. BUDDY uses a normalized depth budget Lmax=16,384L_{\max}=16{,}3841, converted to an integer number of executed middle layers by Lmax=16,384L_{\max}=16{,}3842, and always executes the first and last Transformer blocks (Zhou et al., 8 Jun 2026). DDC jointly budgets sampling width and reasoning depth, minimizing total generated tokens subject to a target reliability constraint on the final consensus answer (Xu et al., 14 May 2026). Star Elastic controls compute by switching among nested submodels between the explicit thinking phase and answering phase, so budget becomes phase-specific model capacity rather than a single monolithic scale factor (Taghibakhshi et al., 8 May 2026).

This diversity suggests that “budget control” is not tied to one metric. The common structure is selective resource allocation under explicit constraints, but the resource may be tokens, latency, paths, layers, experts, or model scale.

2. Canonical runtime architectures

Across the literature, several recurrent two-stage architectures appear. One pattern is predict-then-actuate. TimeBill first predicts latent workload through a Response Length Predictor and an Execution Time Estimator, then chooses the smallest KV-cache eviction ratio Lmax=16,384L_{\max}=16{,}3843 whose predicted worst-case latency satisfies the time budget (Fan et al., 26 Dec 2025). Predictive Scheduling first runs a lightweight predictor, either a hidden-state MLP or a LoRA-based classifier, and then allocates a fixed total token budget across queries by a greedy allocator or by exhaustive search over difficulty-tier budgets (Brown et al., 1 Feb 2026). Adaptive Test-Time Compute Allocation first solves the constrained allocation problem offline by Lagrangian relaxation to obtain oracle budget labels, then trains a lightweight classifier to predict those oracle actions from cheap features at deployment (Zhai et al., 16 Apr 2026). Veroic first produces a cheap default response Lmax=16,384L_{\max}=16{,}3844, then uses verifiable observations from Lmax=16,384L_{\max}=16{,}3845 to decide whether to accept that output or trigger a higher-cost inference pathway Lmax=16,384L_{\max}=16{,}3846 (Yuan et al., 30 Apr 2026).

A second pattern is internal pre-reasoning planning within one autoregressive pass. SelfBudgeter makes the model emit <budget>...</budget> before <solution>...</solution>, so the model first predicts the reasoning budget and then spends it (Li et al., 16 May 2025). BET similarly emits a <predict> block with solvability and budget before the <think> block, and trains explicit “short solve,” “nice fold,” and “hero call” behaviors (Zhou et al., 12 May 2026). ROI-Reasoning forces pre-computation meta-cognition via <predicted_level>Level-k</predicted_level> tags, where Level-3 means “too difficult, skip reasoning and answer Lmax=16,384L_{\max}=16{,}3847” (Zhao et al., 7 Jan 2026). These methods do not separate planning and execution into different models, but they still implement a genuine two-part runtime behavior: estimate first, reason second.

A third pattern is phase-specific execution. Star Elastic preserves an explicit thinking phase followed by an answering phase and allows different nested submodels in each phase, including Lmax=16,384L_{\max}=16{,}3848, Lmax=16,384L_{\max}=16{,}3849, M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)0, and M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)1, with the paper reporting that M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)2 is the best default (Taghibakhshi et al., 8 May 2026).

Important exceptions clarify the terminology. BudgetThinker is frequently grouped with two-stage budget control, but its “two-stage” label refers to supervised fine-tuning followed by reinforcement learning; inference itself is a single decoding process with periodic control-token insertion (Wen et al., 24 Aug 2025). ORBIT is likewise multi-stage in training and exposes explicit Low, Mid, High, and Xhigh reasoning modes at inference time, but it does not learn the stage-1 mode selector from input (Liang et al., 13 Jan 2026).

3. Token-budgeted reasoning and controllable chain-of-thought

BudgetThinker is a direct attempt to make a reasoning model obey a user-specified token budget during inference. Its central mechanism is a fixed set of control tokens M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)3 that are injected deterministically during decoding at budget-fraction milestones, with M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)4 in the reported setup. The backbone model conditions on these inserted tokens exactly as on normal prior context. Training uses a two-stage pipeline: supervised fine-tuning on a 41k-example reasoning dataset, followed by GRPO with a composite reward combining correctness, format, and a length-aware term that penalizes overshooting much more strongly than undershooting, using M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)5 when M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)6. The RL curriculum decreases budgets through M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)7, then mixes budgets sampled from M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)8. In evaluation on MATH-500, AMC 2023, and AIME 2024, the paper reports that BudgetThinker improves accuracy by an average of 4.9% across tested budgets in the abstract, and by 4.2% over original models and 5.7% over ThinkPrune on average for MATH-500 and AMC 2023 in the main text, while also improving budget following ratio and budget utilization ratio (Wen et al., 24 Aug 2025).

SelfBudgeter uses a more explicit budget-first protocol. Stage 1 teaches the model to output a numeric budget before the solution; Stage 2 applies budget-guided GRPO with a budget penalty and a Precise Budget Control Reward. In autonomous mode, the model predicts its own budget; in controlled mode, the user can pre-fill the <budget> field directly. The paper reports that SelfBudgeter achieves up to 74.47% response length compression on MATH while maintaining a 2.16-point accuracy drop relative to the baseline, and on GSM8K a best variant improves accuracy from 78.32% to 81.50% while reducing average tokens from 1737.92 to 662.08 (Li et al., 16 May 2025).

BET generalizes token budgeting beyond length compression by tying budget allocation to policy-dependent solvability. It estimates current-policy solvability by Monte Carlo rollouts, defines an efficient solution cost from the shortest correct trajectories, and learns three named behaviors: short solve, nice fold, and hero call. The fold action is explicit, via \boxed{<Unsolvable>}, and is rewarded only when rollout-derived solvability is effectively zero. Across seven benchmarks and three base models, BET reduces reasoning tokens by about 55% on average while improving overall performance, and it transfers zero-shot from mathematical reasoning to scientific QA and logical reasoning (Zhou et al., 12 May 2026).

This line of work shows two distinct conceptions of inference-time budget control. One conception prioritizes budget adherence—BudgetThinker and SelfBudgeter are explicit about matching or approximately matching a target budget. The other prioritizes computational return on investment—BET and ROI-Reasoning emphasize deciding whether a problem is worth solving at all under a shared token cap (Zhao et al., 7 Jan 2026). Budget-Aware Anytime Reasoning is adjacent but importantly different: it does not learn an online controller, but it defines the Anytime Index,

M(y^(θ),y)\mathcal M(\hat{\mathbf y}(\theta), \mathbf y)9

and uses LLM-synthesized preference pairs at fixed token budgets to improve the quality of partial solutions under interruption (Zhang et al., 16 Jan 2026).

4. Latency, depth, routing, and phase-specific model selection

TimeBill exemplifies the most explicit two-stage latency controller in the literature. Stage 1 consists of an SLM-based Response Length Predictor and a workload-guided analytical Execution Time Estimator. The predictor maps a prompt to a bucketed response-length estimate te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T0, and the ETE models prefill and decode cost analytically, then inflates the prediction to a worst-case estimate te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T1. Stage 2 solves for the smallest eviction ratio te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T2 that meets the request deadline te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T3. The paper reports that the RLP with 512 buckets achieves MAE 42.71, RMSE 78.13, and te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T4; the ETE reaches Mean Absolute Percentage Error of 1.22% for prefill and 1.69% for decoding-step estimation; and TimeBill achieves the highest average response performance scores among tested approaches while attaining a similar task completion rate as te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T5 (Fan et al., 26 Dec 2025).

BUDDY brings the same predict-then-actuate idea inside the Transformer. Its optional Budget Predictor chooses an input-dependent compute level when the user does not specify one, and the Decision Module then scores middle layers and deterministically executes the top-te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T6 subset consistent with the budget. The routing rule is

te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T7

and the selected mask can be recomputed at every decoding step using first-layer KV cache information as a lightweight global context source. On Llama3-8B, the paper reports that a single multi-budget model retains approximately 99.9% of original accuracy at 12.5% sparsity, 90.8% at 25%, 80.7% at 37.5%, and 71.3% at 50%, while also supporting strict budget control and decode-time rerouting (Zhou et al., 8 Jun 2026).

Star Elastic addresses a different axis: model capacity by reasoning phase. It turns one parent model into a nested family of submodels and preserves the standard reasoning protocol of a token-bounded thinking phase followed by an answering phase. The system can therefore run, for example, a smaller model during the long token-heavy thinking phase and a larger model during the short accuracy-critical answering phase. The paper reports that the best default pairing is te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T8, and that dynamic per-phase model selection yields up to 16% higher accuracy and 1.9× lower latency. The reported experiments still recompute cache states when switching models, so the measured timings already include switching overhead (Taghibakhshi et al., 8 May 2026).

Budget-Aware Value Tree extends budget control from pure decoding to tool-augmented search. It is not an explicit hard two-stage scheduler, but it naturally decomposes into early broad exploration and late focused exploitation through the remaining-budget ratio

te2e(x,θ)Tt_{\mathrm{e2e}}(x,\theta)\le T9

As NNmaxN\le N_{\max}0 falls, the node-selection distribution sharpens from diffuse sampling toward near-greedy exploitation, and a budget backstop forces answer generation when resources are nearly exhausted (Li et al., 13 Mar 2026).

5. Shared-budget allocation, escalation, and deferral

Some of the clearest two-stage controllers allocate a shared budget across many units rather than controlling one request in isolation. Predictive Scheduling assumes a fixed total token budget over a batch of GSM8K queries. Stage 1 predicts either a full early-stopping probability vector or a coarse difficulty class before any full reasoning trace is generated. Stage 2 then allocates 16-token windows greedily across queries, or solves a discrete class-level allocation problem over easy, medium, and hard budgets. The paper reports up to 7.9 percentage points of absolute accuracy gain over uniform budgeting at identical token cost, and finds that middle transformer layers 12–17 are most informative for hidden-state prediction, with layer 16 reaching Pearson correlation 0.742 (Brown et al., 1 Feb 2026).

Adaptive Test-Time Compute Allocation formalizes the same idea more abstractly. In the Solve stage, Lagrangian relaxation yields the oracle action

NNmaxN\le N_{\max}1

and a binary search over NNmaxN\le N_{\max}2 targets the desired average budget through the monotone oracle-induced cost curve. In the Learn stage, a lightweight classifier predicts those oracle actions from 16 cheap features. On MATH and GSM8K with NNmaxN\le N_{\max}3 in self-consistency experiments, the method achieves up to 12.8% relative accuracy improvement on MATH under matched budget constraints and over 91% imitation accuracy while closely tracking the oracle upper bound (Zhai et al., 16 Apr 2026).

ROI-Reasoning applies shared-budget control to ordered multi-problem reasoning. It treats a three-problem “test paper” under a strict global token cap as an Ordered Stochastic Multiple-Choice Knapsack Problem, uses Meta-Cognitive Fine-Tuning to emit <predicted_level>Level-k</predicted_level> tags before reasoning, and then uses Rationality-Aware Reinforcement Learning to optimize solve-or-skip decisions and reasoning length under the hard cap. Under the Hard/512 setting, the paper reports Score 0.93 for MFT+RARL versus 0.81 for MFT and 0.15 for the base model, with corresponding Regret values of 0.16, 0.11, and 2.73 (Zhao et al., 7 Jan 2026).

Veroic treats repeated cheap-answer-then-escalate decisions as a long-horizon partially observable control problem. The latent state is whether the default response is reliable, the belief state is updated from verifiable hard and soft signals, and the policy decides between accepting the default output and triggering a stronger inference pathway. The discounted budget constraint is

NNmaxN\le N_{\max}4

In CMIS settings such as LLaMA-3.1-8B NNmaxN\le N_{\max}5 LLaMA-3.1-70B, Veroic improves task quality, calibration, and long-horizon robustness relative to thresholding baselines (Yuan et al., 30 Apr 2026).

HIA is similar in spirit but works in black-box prompt optimization. It generates NNmaxN\le N_{\max}6 candidate prompt modifications, scores them with heuristic reward models, keeps only the top NNmaxN\le N_{\max}7, and spends expensive response-model calls only on that filtered subset. In the low-budget regime NNmaxN\le N_{\max}8, the paper reports HelpSteer single-objective goal completion improving from 24.00 to 31.00 for BoN+H relative to BoN+Random, a 29% improvement (Nakamura et al., 7 Aug 2025). An adjacent but importantly different line is Budgeted Multiple-Expert Deferral, which uses selective expert-cost querying to reduce training-time cost when learning a two-stage router; its budget mechanism is not inference-time deployment control in the strict sense (DeSalvo et al., 30 Oct 2025).

6. Evaluation patterns, misconceptions, and limitations

The literature evaluates two-stage budget control with several recurring metrics. Token-budgeted reasoners report accuracy under a maximum budget, budget following ratio, budget utilization ratio, and length-compression figures (Wen et al., 24 Aug 2025, Li et al., 16 May 2025). Latency-oriented systems report completion rate, average response score, and deadline-sensitive performance under overrun strategies such as Kill and Skip-Next (Fan et al., 26 Dec 2025). Shared-budget allocators use score, regret, and oracle imitation accuracy (Zhao et al., 7 Jan 2026, Zhai et al., 16 Apr 2026). Belief-based escalators add Brier score, NLL, ECE, low-quality occurrence, CVaR, and recovery delay (Yuan et al., 30 Apr 2026). Anytime reasoning adds the Anytime Index to quantify how quickly quality improves with additional reasoning tokens (Zhang et al., 16 Jan 2026).

Several misconceptions recur. First, “two-stage” does not always mean a two-phase inference controller. BudgetThinker’s two stages are supervised fine-tuning and reinforcement learning; inference remains a single autoregressive process with periodic control-token insertion (Wen et al., 24 Aug 2025). ORBIT is multi-stage in training and exposes explicit budget modes at inference, but it does not learn a stage-1 mode selector from input (Liang et al., 13 Jan 2026). BAVT is often interpretable as early exploration followed by late exploitation, but the paper’s mechanism is continuous budget annealing rather than a hard stage boundary (Li et al., 13 Mar 2026). By contrast, TimeBill, Predictive Scheduling, Veroic, SelfBudgeter, BET, and Star Elastic do implement unmistakable runtime decompositions (Fan et al., 26 Dec 2025, Brown et al., 1 Feb 2026, Yuan et al., 30 Apr 2026, Li et al., 16 May 2025, Zhou et al., 12 May 2026, Taghibakhshi et al., 8 May 2026).

Second, most budgets are proxies rather than exact deployment costs. BudgetThinker explicitly notes that token budget is only a proxy for actual latency and that inserted control tokens add small overhead (Wen et al., 24 Aug 2025). BUDDY gives strict executed-layer-count control, but the paper reports that realized speedups are not perfectly proportional because routing and gather/scatter overhead can offset savings at light pruning (Zhou et al., 8 Jun 2026). TimeBill’s ETE coefficients are hardware- and implementation-dependent and must be recalibrated on new systems (Fan et al., 26 Dec 2025). Star Elastic currently recomputes cache states on model switches, so the present latency gains are conservative with respect to future cache-reuse-capable runtimes (Taghibakhshi et al., 8 May 2026).

Third, more inference-time compute is not uniformly beneficial. The robustness study based on budget forcing shows that in hidden chain-of-thought settings, larger reasoning budgets can improve robustness for prompt injection and prompt extraction, but when intermediate reasoning is exposed or tool-integrated, increased inference-time computation can reduce robustness, with an inverse scaling law formalized through the monotone growth of malicious-token exposure risk with chain length (Wu et al., 21 Jul 2025). This is a genuine controversy for deployment: a second budget stage can improve performance under one threat model and worsen security under another.

Finally, two-stage control inherits the failure modes of both stages. Predictors can be miscalibrated, as shown by the crossing point where fine-grained size-based scheduling can underperform uniform allocation at larger budgets (Brown et al., 1 Feb 2026). Belief-based escalation depends on the informativeness of black-box verifiable signals (Yuan et al., 30 Apr 2026). Preference-data prompting improves partial-solution quality but still lacks an explicit online stopping rule (Zhang et al., 16 Jan 2026). These limitations do not negate the framework; they delimit its current operational regimes.

Taken together, the literature defines two-stage inference-time budget control less as one algorithm than as a design principle: estimate, screen, or plan before spending scarce inference resources, then execute under an explicit budget with mechanisms that preserve accuracy, calibration, or utility as much as possible. The principal open distinction is no longer whether to control inference budgets, but which budget is being controlled, where the control signal is computed, and whether the two-stage decomposition occurs in training, in inference, or in both.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Stage Inference-Time Budget Control.