Dual-Phase Test-Time Scaling

Updated 3 July 2026

Dual-Phase Test-Time Scaling is a framework that splits inference into two phases, each addressing distinct uncertainty or reasoning challenges.
It allocates compute efficiently by performing a lightweight initial pass and invoking deeper processing only for difficult instances.
The methodology has shown improved accuracy and adaptive resource usage, with applications in CTR prediction, multi-step LLM reasoning, and code synthesis.

Dual-Phase Test-Time Scaling refers to a class of inference-time frameworks for machine learning systems—especially LLMs and differentiable prediction architectures—where the computation spent during prediction is split across two distinct, complementary phases. Each phase targets different dimensions of uncertainty, reasoning, or coverage. Dual-phase scaling is motivated by the need to (1) maximize the efficiency of inference, (2) adapt compute allocation to instance difficulty, and (3) robustly enhance coverage and accuracy without additional training. Approaches are highly application-dependent, spanning CTR prediction, multi-step LLM reasoning, code synthesis, and out-of-distribution generalization. Representative paradigms include selective inference (UTTSI), Markovian decomposition-contraction (Atom of Thoughts), verifier-guided hybrid scaling, reward-guided planning/execution separation, and latent-space fast-slow adaptation.

1. Motivations and Core Principles

Dual-phase test-time scaling frameworks are grounded in the insight that the limitations of single-pass or single-phase inference manifest as poor generalization on rare feature patterns, error accumulation in long-horizon reasoning, or suboptimal compute allocation. The core principles underlying dual-phase scaling are:

Separation of Uncertainty Sources or Reasoning Roles: Each phase targets different aspects—e.g., “fast” feature filtering vs. “deep” stochastic exploration in CTR (Zhang et al., 24 May 2026), or separating planning from execution in sequential reasoning (Cui et al., 29 Sep 2025).
Efficiency in Compute Allocation: By adaptively assigning deeper computational pathways only to “hard” instances identified in an initial phase, such methods maintain overall efficiency and control worst-case latency.
Modularity and Model-Agnosticism: Particularly in frameworks like UTTSI, no model weights are updated, and the dual-phase wrapper is compatible with a broad gamut of differentiable backbones.

This structure stands in contrast to single-phase approaches that allocate uniform compute or rely on static, fixed selection policies.

2. Dual-Phase Methodologies Across Domains

Dual-phase scaling admits several architectural and algorithmic instantiations:

(a) Selective Inference with Uncertainty-Triggered Exploration

The UTTSI (Uncertainty-Triggered Test-Time Selective Inference) for click-through rate prediction (Zhang et al., 24 May 2026) exemplifies dual-phase scaling in industrial tabular models:

Phase I: Lightweight forward-backward pass, dual-signal uncertainty estimation (combining logit magnitude and frequency priors), adaptive per-instance feature filtering, and a “refined” inference using only reliable features.
Phase II: For uncertain examples (quantified via the phase I uncertainty score), stochastic feature-path exploration is performed via masking and Bernoulli sampling; predictions from $K(x)$ sampled feature subsets are ensembled using consistency-weighted aggregation.

(b) Decomposition-Contraction: Markovian Reasoning

Atom of Thoughts (AoT) (Teng et al., 17 Feb 2025) defines a two-phase loop for LLM reasoning:

Phase 1 (DAG Decomposition): The model decomposes the complex query into a directed acyclic graph (DAG) of atomic subquestions, partitioned into independent and dependent nodes.
Phase 2 (Contraction): Answers to independent subquestions are integrated as “axioms,” and dependent subquestions contracted to a new, self-contained query. This Markov process continues until an irreducible atomic question is reached.

This approach ensures answer-equivalence invariance across phases and naturally integrates with test-time scaling methods like self-consistency or Tree-of-Thoughts.

(c) Hybrid Step-Level Reasoning: Parallel and Sequential

Step-level verifier-guided hybrid scaling (Chang et al., 21 Jul 2025) orchestrates:

Sequential self-refinement: A process verifier (PRM) triggers conditional critique-and-rewrite at reasoning steps falling below a reward threshold, iteratively refining low-confidence steps until verification improves.
Parallel sampling: Breadth is maintained by sampling multiple candidate expansions at each node; ultimate action selection is dynamically balanced using PUCT scores within a Monte Carlo Tree Search (MCTS).

(d) Explicit Planning–Execution Separation

DREAM (Cui et al., 29 Sep 2025) introduces:

Phase 1 (Planning): The algorithm samples and verifies high-level plans (subgoal propositions) with a planning-specific reward model, allowing for early pruning of unpromising trajectories.
Phase 2 (Execution): For promising plans, multiple execution proposals are sampled and verified, focusing resource allocation on tactical details only for strategic pathways likely to succeed.

Dynamic per-step budget allocation uses intermediate reward signals to shift budget toward difficult steps and halt on easy ones.

3. Formal Frameworks and Quantitative Foundations

A unifying property across domains is the rigorous formulation of dual-phase scaling in terms of probability, reward, or uncertainty:

Compute Budgeting and Marginal Gains: Models such as TTSPM (Wang et al., 26 May 2025) render the expected gain after $N$ parallel or sequential scaling units as $F(N) = F_{\max}[1-(1-p_x)^N]$ , with the saturation threshold ( $N^*$ ) marking when additional compute ceases to yield substantial improvement.
Uncertainty Quantification: In UTTSI, the uncertainty function

$u(x) = 1-\left[\alpha\,s_{\rm model}(x) + (1-\alpha)\,s_{\rm freq}(x)\right]$

determines the number $K(x)$ of deep explorations using logit and frequency-derived signals.

Consistency-Weighted Aggregation: Final predictions are formed by exponentially down-weighting inconsistent path outputs:

$P_{\rm final} = \frac{\sum_{P_i} w_i P_i}{\sum w_i}$

with $w_i = \exp(-\lambda|P_i - \bar{P}|)$ .

Instance-adaptive, budget-aware, and reward-guided formulations characterize leading frameworks in this paradigm.

4. Empirical Outcomes and Comparative Analysis

Dual-phase scaling approaches yield consistent gains in both accuracy and compute efficiency:

Methodology	Key Performance Gains	Reference
UTTSI (CTR)	+0.0044/+0.0046 AUC (offline), +5.3% relative CTR (online A/B, $p<0.01$ )	(Zhang et al., 24 May 2026)
AoT (HotpotQA)	+3.4 F1 over strong baseline, hit rate $89.8\%$ , strong answer-equivalence guarantee	(Teng et al., 17 Feb 2025)
Hybrid TTS	+16.8 MAJ@8 obs. (MATH500, Qwen2.5-3B), +3 pts over Best-of-N	(Chang et al., 21 Jul 2025)
DREAM(+)	Up to +10 pp accuracy @ fixed token budget (GSM8K/MBPP), 10–30% reduction in tokens	(Cui et al., 29 Sep 2025)

Ablation studies consistently demonstrate that dropping either phase—uncertainty-based deep exploration, DAG decomposition, or step-level pruning/remediation—causes measurable loss of performance.

In synchronous settings (e.g., UTTSI), all deep explorations are parallelizable, preserving worst-case wall-clock latency.

5. Practical Considerations, Limitations, and Applicability

Dual-phase test-time scaling methods have several practical properties:

Model-Agnostic Implementation: Most such frameworks do not require retraining or modifying base model weights, but leverage forward/backward passes, gradient-based attributions, and lightweight frequency sketches.
Compute Overhead: Typical average compute overhead ranges from 2.5–3 $N$ 0 (UTTSI, step-level hybrid), but with overhead heavily concentrated on “hard” instances only; worst-case latency can be controlled.
Error Modes: Limitations include sensitivity to poor phase-I decomposition (AoT), quality of step-level verifiers, and the possibility of compounding errors when contraction or filtering misrepresents the original problem structure.
Combinatorial Potential: Dual-phase scaling naturally integrates with other scaling paradigms (e.g., self-consistency, Tree-of-Thoughts, large-slate voting) and budget-aware schemes, providing a combinatorially rich design space.

Applicability spans tabular CTR, general LLM reasoning, code synthesis, math/coding benchmarks, and potentially other domains where stepwise confidence or decompositional structure can be exploited.

6. Future Trajectories and Theoretical Unification

Active research directions include:

Generalization Across Domains: Transferring dual-phase decomposition to open-ended reasoning, agentic planning, and long-context tasks.
Adaptive Budgeting: Combining dual-phase scaling with real-time dynamic resource allocation, as in DREAM(+) (Cui et al., 29 Sep 2025).
Theoretical Saturation Analysis: Leveraging closed-form “saturation budgets” for optimal allocation between phases in compute-constrained deployment (Wang et al., 26 May 2025).
Reflection and Correction: Iterative improvement via answer-preserving re-decomposition (see limitations section in (Teng et al., 17 Feb 2025)).
Dual-Space Training: Exploiting the duality of generation and judgment (as in DuST (Jiao et al., 11 May 2026)) to enhance both selection and generation within test-time scaling pipelines.

A plausible implication is that the dual-phase construct may generalize as a universal abstraction for resource-efficient, adaptive inference across the full spectrum of modern machine learning and reasoning systems.

References:

(Zhang et al., 24 May 2026) Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration
(Teng et al., 17 Feb 2025) Atom of Thoughts for Markov LLM Test-Time Scaling
(Chang et al., 21 Jul 2025) Step-level Verifier-guided Hybrid Test-Time Scaling for LLMs
(Cui et al., 29 Sep 2025) Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search
(Wang et al., 26 May 2025) Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models
(Jiao et al., 11 May 2026) Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling