Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A*-Decoding: Token-Efficient Inference Scaling (2505.13672v1)

Published 19 May 2025 in cs.AI

Abstract: Inference-time scaling has emerged as a powerful alternative to parameter scaling for improving LLM performance on complex reasoning tasks. While existing methods have shown strong performance gains under fixed compute budgets, there has been little focus on optimally utilizing that budget during inference. In this work, we introduce A*-decoding, a search-based inference-time strategy that builds on the A* search algorithm to optimally utilize a fixed compute budget by prioritizing high-quality reasoning paths during generation. We frame LLM decoding as a structured search in a state space of partial solutions, applying the A* transition model to identify promising continuations guided by an external process supervision signal. In our experiments, A*-decoding reaches the performance levels of strong inference scaling baselines like best-of-N and particle filtering while using up to 3x fewer tokens and 30% fewer PRM passes under equivalent compute budgets. On the MATH500 and AIME 2024 benchmarks, A*-decoding enables Llama-3.2-1B-Instruct to match the performance of the 70x larger Llama-3.1-70B-Instruct, and allows Qwen3-1.7B to reach o1-like reasoning accuracy. These results highlight the power of structured search in decoding, offering an alternative to brute-force sampling or scale-driven gains. Our work demonstrates how thoughtful inference-time strategies can enhance reasoning in SLMs, pointing toward future advances in more efficient and scalable LLM deployment.

Summary

A*-Decoding: Token-Efficient Inference Scaling

In a departure from traditional approaches that focus on scaling LLMs (LMs) in terms of parameter count and training data, the paper entitled "A*-Decoding: Token-Efficient Inference Scaling" introduces an inference-time strategy that harnesses the principles of structured search, specifically through the A* search algorithm, to enhance reasoning in small LLMs (SLMs). This work posits that inference represents an untapped resource for improving performance in complex reasoning tasks, especially under fixed compute budgets.

Core Contribution and Methodology

The paper introduces A*-decoding, a novel framework that reimagines LLM decoding as a structured search task in a state space of partial solutions. Utilizing the A* search algorithm, the proposed method focuses on identifying high-quality reasoning paths by considering both heuristic estimates and cost functions. This approach optimally allocates compute resources during inference, contrasting with brute-force methods like best-of-N and particle filtering, which typically increase sample count or diversity without explicitly optimizing token usage.

The framework operationalizes A* search across a dynamic graph of candidate continuations, leveraging a process-supervised heuristic that is informed by Process Reward Models (PRMs). These models supply task-specific feedback signals that guide the search toward high-value trajectories while effectively pruning less promising paths. A*-decoding demonstrates reduced token usage—up to threefold—as compared to traditional methods, while maintaining or enhancing performance levels.

Experimental Insights and Numerical Results

The evaluation is conducted on math benchmarks MATH500 and AIME 2024, with a particular focus on reasoning tasks that demand precise and token-efficient generation. Noteworthy results show that Llama-3.2-1B-Instruct, a relatively small model, achieves the performance levels of substantially larger models like Llama-3.1-70B-Instruct. Qwen3-1.7B further demonstrates competitive accuracy, matching o1-like reasoning accuracy while preserving token efficiency. This indicates that intelligent inference-time strategies can effectively bridge the performance gap between smaller and larger models.

A*-decoding achieves competitive exact-match accuracy with significant reductions in overall token generation—from 3x fewer tokens and 30% fewer PRM passes—compared to other approaches under equivalent compute budgets.

Implications and Future Research Directions

The implications of this research are manifold. Practically, it paves the way for cost-effective deployment of SLMs in resource-constrained environments, significantly lowering operational costs associated with inference-time compute. Theoretically, it challenges the entrenched model of performance gains through parameter scaling alone, suggesting that strategic inference-time interventions can offer sustainable and scalable enhancements.

Future research could expand on integrating different types of process supervision signals, allowing A*-decoding to adapt to diverse problem domains while maintaining inference efficiency. Further exploration into model-specific optimizations and adaptive tuning of hyperparameters for various tasks could amplify the benefits observed, providing deeper insights into state space traversal dynamics.

Overall, this paper contributes a robust framework for decoding that merges structured search algorithms with advanced LM policies, heralding innovative pathways in LLM deployment strategies in constrained computational frameworks.

X Twitter Logo Streamline Icon: https://streamlinehq.com