Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 165 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 189 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

$φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation (2503.13288v1)

Published 17 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $\phi$-Decoding. To provide a precise and expressive estimation of step value, $\phi$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $\phi$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at https://github.com/xufangzhi/phi-Decoding, and the open-source PyPI package is coming soon.

Summary

The paper presents the novel φ-Decoding method that enhances LLM reasoning by simulating future steps and balancing exploration with exploitation.
It employs dynamic advantage estimation and alignment assessment via clustering to re-weight candidate steps based on simulated future rewards.
The approach integrates dynamic pruning strategies to optimize computational cost while achieving significant performance gains on benchmarks like GSM8K and AIME.

This paper introduces $φ$ -Decoding, a novel inference-time optimization algorithm designed to improve the reasoning capabilities of LLMs by balancing exploration and exploitation more effectively than previous methods. It addresses the limitations of standard auto-regressive decoding (which is short-sighted) and search-based methods like Tree-of-Thought (ToT) or Monte Carlo Tree Search (MCTS) (which can involve excessive exploration in vast search spaces).

The core idea is adaptive foresight sampling. Instead of just looking at the past steps $\mathbf{a}_{<t}$ , $φ$ -Decoding estimates the value of taking a potential next step $a_t$ by simulating future steps $\mathbf{a}_{>t}$ and evaluating their quality. The probability of selecting a step $\hat{a}_t$ is adjusted based on an estimated reward function $R$ derived from these future simulations:

$\hat{a}_t \sim p_\theta(a_t|x,\mathbf{a}_{<t}) \mathrm{exp} \left[ R(x, \mathbf{a}_{\leq t}, \mathbf{a}_{>t}) / \tau \right]$

The key innovation lies in how the step value estimation function $R$ is constructed. It combines two complementary perspectives:

Dynamic Advantage Estimation ( $A_t$ ): This estimates the absolute benefit of a candidate step $a_t$ . It's calculated as the difference in the average log probability of the foresight path starting from $a_t$ ( $F_t$ ) compared to the foresight path from the previous step $a_{t-1}$ ( $F_{t-1}$ ).

$A_t = F_t - F_{t-1}$

where $F_t = p_\theta(\mathbf{a}_{>t} |x, a_t, \mathbf{a}_{\le t})$ , implemented using the average log probability of the sequence to mitigate length bias. This captures the uncertainty or confidence gain provided by the step.
Alignment Assessment by Clustering ( $C_t$ ): This provides a relative value estimate to combat the risk of the model being confidently wrong (local optima). After generating foresight paths (rollouts) for multiple candidate steps, these paths are clustered (using TF-IDF in the main experiments, or sentence embeddings). The alignment score $C_t$ for a step $a_t$ is the normalized size of the cluster its foresight path belongs to.

$C_t = \frac{|\textrm{Cluster}(a_t)|}{\# \textrm{Foresight Paths}}$

Steps leading to future paths consistent with many other candidate steps receive higher alignment scores.

The final reward $R$ combines normalized versions of the Advantage and Alignment scores, sampling from their joint distribution:

$R(x, \mathbf{a}_{\leq t}, \mathbf{a}_{>t}) = \mathrm{Norm}(A_t) + \mathrm{Norm}(C_t)$

where $\mathrm{Norm}(v) = \frac{\mathrm{exp}(v / \tau_v)}{\sum_{a_t} \mathrm{exp}(v / \tau_v)}$ . In the implementation, $\tau_1=\tau_2=0.6$ and equal weighting is used.

To manage the computational cost introduced by foresight sampling and avoid "overthinking", $φ$ -Decoding incorporates a Dynamic Pruning Strategy:

In-Width Pruning: Before performing the computationally expensive foresight simulation for all candidate steps (generated via beam search with $M$ beams and $N$ rollouts per beam), this step filters out unpromising candidates. It calculates the mean ( $\mu_t$ ) and standard deviation ( $\sigma_t$ ) of the initial generation probabilities $s_t = p_\theta(a_t |x,\mathbf{a}_{<t})$ for all $M \times N$ candidates. Only candidates with $s_t^{(i)} \geq \mu_t - \sigma_t$ are kept for foresight simulation.
In-Depth Pruning: This strategy enables early stopping to save computation on later, potentially easier steps. It leverages the clustering results from the Alignment Assessment. If the largest cluster contains a fraction of the foresight paths exceeding a threshold $\delta$ (e.g., $\delta=0.7$ ), the algorithm stops the step-by-step foresight process and reverts to standard auto-regressive generation for the remainder of the sequence. This occurs only after a minimum number of foresight steps ( $T_{\mathrm{min}}$ ) have been taken.

Implementation:

The algorithm operates stepwise, maintaining $M$ active beams (sequences).
At each step $t$ , $N$ candidate next steps are sampled for each beam.
In-width pruning filters these candidates.
Foresight paths of length $T_{\mathrm{min}}$ to $T_{\mathrm{max}}$ tokens are generated for the remaining candidates using the LLM.
Advantage ( $A_t$ ) and Alignment ( $C_t$ ) scores are calculated based on these paths.
The combined score $R$ is used to re-weight the initial probabilities, and $M$ steps are sampled to form the beams for step $t+1$ .
In-depth pruning checks if early stopping is applicable.
The process uses the vLLM engine for efficient inference on GPUs.
Hyperparameters ( $M$ , $N$ , $K$ , $T_{\mathrm{min}}$ , $T_{\mathrm{max}}$ , $\delta$ ) are tuned per task and model (see Appendix Table 4 for examples). For LLaMA3.1-8B on GSM8K, typical values are $M=4, N=4, K=3, T_{\mathrm{min}}=4, T_{\mathrm{max}}=8, \delta=0.7$ .

Algorithm Pseudocode Overview (Algorithm 1):

function phi_decoding(x, model, M, N, T_min, T_max, K, delta):
  beams = initialize_beams(x)
  for t = 1 to MAX_STEPS:
    candidates = {}
    // Step Rollout (Generate M*N candidates)
    for beam in beams:
      next_steps, step_probs = sample_next_steps(model, beam, N)
      add_candidates(candidates, next_steps, step_probs, beam)

    // In-Width Pruning
    pruned_candidates = in_width_prune(candidates)

    // Step Foresight & Value Estimation
    step_values = {}
    foresight_paths = {}
    for cand_step, beam_prefix in pruned_candidates:
      path, path_prob = generate_foresight(model, beam_prefix + cand_step, T_max)
      foresight_paths[cand_step] = path
      advantage = calculate_advantage(path_prob, previous_path_prob[beam_prefix])
      step_values[cand_step] = {"advantage": advantage}

    clusters = cluster_paths(foresight_paths, K)
    for cand_step in step_values:
      alignment = calculate_alignment(cand_step, clusters, len(foresight_paths))
      step_values[cand_step]["alignment"] = alignment
      combined_value = combine_scores(step_values[cand_step]["advantage"], alignment)
      step_values[cand_step]["final_value"] = combined_value

    // Sample M Steps for next beams
    next_beams = sample_next_beams(pruned_candidates, step_values, M)
    beams = next_beams

    // In-Depth Pruning
    if t >= T_min and check_early_stop(clusters, len(foresight_paths), delta):
      final_sequence = complete_autoregressive(model, beams[0]) // Complete best beam
      return final_sequence

  // Fallback if max steps reached
  final_sequence = complete_autoregressive(model, beams[0])
  return final_sequence

Evaluation and Results:

Tested on GSM8K, MATH-500, GPQA, ReClor, LogiQA, ARC-C, and AIME benchmarks.
Compared against Auto-Regressive (CoT), ToT, MCTS, Guided Decoding, and Predictive Decoding.
Used LLaMA3.1 (8B, 70B), Mistral-v0.3-7B, Qwen2.5-3B, and R1-Distill-LLaMA-8B models.
$φ$ -Decoding significantly outperformed CoT (e.g., +14.6% avg on LLaMA3.1-8B) and strong baselines across benchmarks, often with lower or comparable computational cost (FLOPS).
Showed strong inference-time scaling: performance improved consistently with increased compute budget, outperforming other methods at similar FLOPS levels (Figure 1).
Ablation studies confirmed the positive contributions of foresight sampling, clustering, and dynamic pruning (Table 2). Pruning significantly reduced FLOPS while sometimes even improving accuracy by filtering noise.
Demonstrated good generalization across model sizes (3B to 70B) and effectiveness even on challenging competition-level tasks like AIME, improving performance even for specialized models like DeepSeek-R1 (Table 3, Table 5, Appendix C).
Analysis suggested its step value estimation is more accurate than baselines and correlates positively with final task performance (Figure 2).

Practical Implications:

$φ$ -Decoding offers a practical way to boost the reasoning performance of existing LLMs at inference time without requiring model retraining or external reward models.
It provides a better trade-off between performance and computational cost compared to methods like MCTS or ToT, making advanced reasoning more feasible.
The dynamic pruning allows for adaptive compute allocation, spending more resources on difficult steps and saving compute on easier ones.
It can be implemented as a decoding strategy within existing LLM serving frameworks like vLLM.

In summary, $φ$ -Decoding presents an effective and relatively efficient inference-time algorithm for improving LLM reasoning by combining foresight simulation, a novel step value estimation based on advantage and alignment, and dynamic pruning strategies. Its strong empirical results and scalability make it a promising technique for practical applications requiring robust reasoning.