Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

$φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation (2503.13288v1)

Published 17 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $\phi$-Decoding. To provide a precise and expressive estimation of step value, $\phi$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $\phi$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at https://github.com/xufangzhi/phi-Decoding, and the open-source PyPI package is coming soon.

Summary

  • The paper presents the novel φ-Decoding method that enhances LLM reasoning by simulating future steps and balancing exploration with exploitation.
  • It employs dynamic advantage estimation and alignment assessment via clustering to re-weight candidate steps based on simulated future rewards.
  • The approach integrates dynamic pruning strategies to optimize computational cost while achieving significant performance gains on benchmarks like GSM8K and AIME.

This paper introduces φφ-Decoding, a novel inference-time optimization algorithm designed to improve the reasoning capabilities of LLMs by balancing exploration and exploitation more effectively than previous methods. It addresses the limitations of standard auto-regressive decoding (which is short-sighted) and search-based methods like Tree-of-Thought (ToT) or Monte Carlo Tree Search (MCTS) (which can involve excessive exploration in vast search spaces).

The core idea is adaptive foresight sampling. Instead of just looking at the past steps a<t\mathbf{a}_{<t}, φφ-Decoding estimates the value of taking a potential next step ata_t by simulating future steps a>t\mathbf{a}_{>t} and evaluating their quality. The probability of selecting a step a^t\hat{a}_t is adjusted based on an estimated reward function RR derived from these future simulations:

a^tpθ(atx,a<t)exp[R(x,at,a>t)/τ]\hat{a}_t \sim p_\theta(a_t|x,\mathbf{a}_{<t}) \mathrm{exp} \left[ R(x, \mathbf{a}_{\leq t}, \mathbf{a}_{>t}) / \tau \right]

The key innovation lies in how the step value estimation function RR is constructed. It combines two complementary perspectives:

  1. Dynamic Advantage Estimation (AtA_t): This estimates the absolute benefit of a candidate step ata_t. It's calculated as the difference in the average log probability of the foresight path starting from ata_t (FtF_t) compared to the foresight path from the previous step at1a_{t-1} (Ft1F_{t-1}).

    At=FtFt1A_t = F_t - F_{t-1}

    where Ft=pθ(a>tx,at,at)F_t = p_\theta(\mathbf{a}_{>t} |x, a_t, \mathbf{a}_{\le t}), implemented using the average log probability of the sequence to mitigate length bias. This captures the uncertainty or confidence gain provided by the step.

  2. Alignment Assessment by Clustering (CtC_t): This provides a relative value estimate to combat the risk of the model being confidently wrong (local optima). After generating foresight paths (rollouts) for multiple candidate steps, these paths are clustered (using TF-IDF in the main experiments, or sentence embeddings). The alignment score CtC_t for a step ata_t is the normalized size of the cluster its foresight path belongs to.

    Ct=Cluster(at)#Foresight PathsC_t = \frac{|\textrm{Cluster}(a_t)|}{\# \textrm{Foresight Paths}}

    Steps leading to future paths consistent with many other candidate steps receive higher alignment scores.

The final reward RR combines normalized versions of the Advantage and Alignment scores, sampling from their joint distribution:

R(x,at,a>t)=Norm(At)+Norm(Ct)R(x, \mathbf{a}_{\leq t}, \mathbf{a}_{>t}) = \mathrm{Norm}(A_t) + \mathrm{Norm}(C_t)

where Norm(v)=exp(v/τv)atexp(v/τv)\mathrm{Norm}(v) = \frac{\mathrm{exp}(v / \tau_v)}{\sum_{a_t} \mathrm{exp}(v / \tau_v)}. In the implementation, τ1=τ2=0.6\tau_1=\tau_2=0.6 and equal weighting is used.

To manage the computational cost introduced by foresight sampling and avoid "overthinking", φφ-Decoding incorporates a Dynamic Pruning Strategy:

  1. In-Width Pruning: Before performing the computationally expensive foresight simulation for all candidate steps (generated via beam search with MM beams and NN rollouts per beam), this step filters out unpromising candidates. It calculates the mean (μt\mu_t) and standard deviation (σt\sigma_t) of the initial generation probabilities st=pθ(atx,a<t)s_t = p_\theta(a_t |x,\mathbf{a}_{<t}) for all M×NM \times N candidates. Only candidates with st(i)μtσts_t^{(i)} \geq \mu_t - \sigma_t are kept for foresight simulation.
  2. In-Depth Pruning: This strategy enables early stopping to save computation on later, potentially easier steps. It leverages the clustering results from the Alignment Assessment. If the largest cluster contains a fraction of the foresight paths exceeding a threshold δ\delta (e.g., δ=0.7\delta=0.7), the algorithm stops the step-by-step foresight process and reverts to standard auto-regressive generation for the remainder of the sequence. This occurs only after a minimum number of foresight steps (TminT_{\mathrm{min}}) have been taken.

Implementation:

  • The algorithm operates stepwise, maintaining MM active beams (sequences).
  • At each step tt, NN candidate next steps are sampled for each beam.
  • In-width pruning filters these candidates.
  • Foresight paths of length TminT_{\mathrm{min}} to TmaxT_{\mathrm{max}} tokens are generated for the remaining candidates using the LLM.
  • Advantage (AtA_t) and Alignment (CtC_t) scores are calculated based on these paths.
  • The combined score RR is used to re-weight the initial probabilities, and MM steps are sampled to form the beams for step t+1t+1.
  • In-depth pruning checks if early stopping is applicable.
  • The process uses the vLLM engine for efficient inference on GPUs.
  • Hyperparameters (MM, NN, KK, TminT_{\mathrm{min}}, TmaxT_{\mathrm{max}}, δ\delta) are tuned per task and model (see Appendix Table 4 for examples). For LLaMA3.1-8B on GSM8K, typical values are M=4,N=4,K=3,Tmin=4,Tmax=8,δ=0.7M=4, N=4, K=3, T_{\mathrm{min}}=4, T_{\mathrm{max}}=8, \delta=0.7.

Algorithm Pseudocode Overview (Algorithm 1):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
function phi_decoding(x, model, M, N, T_min, T_max, K, delta):
  beams = initialize_beams(x)
  for t = 1 to MAX_STEPS:
    candidates = {}
    // Step Rollout (Generate M*N candidates)
    for beam in beams:
      next_steps, step_probs = sample_next_steps(model, beam, N)
      add_candidates(candidates, next_steps, step_probs, beam)

    // In-Width Pruning
    pruned_candidates = in_width_prune(candidates)

    // Step Foresight & Value Estimation
    step_values = {}
    foresight_paths = {}
    for cand_step, beam_prefix in pruned_candidates:
      path, path_prob = generate_foresight(model, beam_prefix + cand_step, T_max)
      foresight_paths[cand_step] = path
      advantage = calculate_advantage(path_prob, previous_path_prob[beam_prefix])
      step_values[cand_step] = {"advantage": advantage}

    clusters = cluster_paths(foresight_paths, K)
    for cand_step in step_values:
      alignment = calculate_alignment(cand_step, clusters, len(foresight_paths))
      step_values[cand_step]["alignment"] = alignment
      combined_value = combine_scores(step_values[cand_step]["advantage"], alignment)
      step_values[cand_step]["final_value"] = combined_value

    // Sample M Steps for next beams
    next_beams = sample_next_beams(pruned_candidates, step_values, M)
    beams = next_beams

    // In-Depth Pruning
    if t >= T_min and check_early_stop(clusters, len(foresight_paths), delta):
      final_sequence = complete_autoregressive(model, beams[0]) // Complete best beam
      return final_sequence

  // Fallback if max steps reached
  final_sequence = complete_autoregressive(model, beams[0])
  return final_sequence

Evaluation and Results:

  • Tested on GSM8K, MATH-500, GPQA, ReClor, LogiQA, ARC-C, and AIME benchmarks.
  • Compared against Auto-Regressive (CoT), ToT, MCTS, Guided Decoding, and Predictive Decoding.
  • Used LLaMA3.1 (8B, 70B), Mistral-v0.3-7B, Qwen2.5-3B, and R1-Distill-LLaMA-8B models.
  • φφ-Decoding significantly outperformed CoT (e.g., +14.6% avg on LLaMA3.1-8B) and strong baselines across benchmarks, often with lower or comparable computational cost (FLOPS).
  • Showed strong inference-time scaling: performance improved consistently with increased compute budget, outperforming other methods at similar FLOPS levels (Figure 1).
  • Ablation studies confirmed the positive contributions of foresight sampling, clustering, and dynamic pruning (Table 2). Pruning significantly reduced FLOPS while sometimes even improving accuracy by filtering noise.
  • Demonstrated good generalization across model sizes (3B to 70B) and effectiveness even on challenging competition-level tasks like AIME, improving performance even for specialized models like DeepSeek-R1 (Table 3, Table 5, Appendix C).
  • Analysis suggested its step value estimation is more accurate than baselines and correlates positively with final task performance (Figure 2).

Practical Implications:

  • φφ-Decoding offers a practical way to boost the reasoning performance of existing LLMs at inference time without requiring model retraining or external reward models.
  • It provides a better trade-off between performance and computational cost compared to methods like MCTS or ToT, making advanced reasoning more feasible.
  • The dynamic pruning allows for adaptive compute allocation, spending more resources on difficult steps and saving compute on easier ones.
  • It can be implemented as a decoding strategy within existing LLM serving frameworks like vLLM.

In summary, φφ-Decoding presents an effective and relatively efficient inference-time algorithm for improving LLM reasoning by combining foresight simulation, a novel step value estimation based on advantage and alignment, and dynamic pruning strategies. Its strong empirical results and scalability make it a promising technique for practical applications requiring robust reasoning.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 75 likes.

Upgrade to Pro to view all of the tweets about this paper:

Reddit Logo Streamline Icon: https://streamlinehq.com