A*-Decoding: Token-Efficient Inference Scaling
In a departure from traditional approaches that focus on scaling LLMs (LMs) in terms of parameter count and training data, the paper entitled "A*-Decoding: Token-Efficient Inference Scaling" introduces an inference-time strategy that harnesses the principles of structured search, specifically through the A* search algorithm, to enhance reasoning in small LLMs (SLMs). This work posits that inference represents an untapped resource for improving performance in complex reasoning tasks, especially under fixed compute budgets.
Core Contribution and Methodology
The paper introduces A*-decoding, a novel framework that reimagines LLM decoding as a structured search task in a state space of partial solutions. Utilizing the A* search algorithm, the proposed method focuses on identifying high-quality reasoning paths by considering both heuristic estimates and cost functions. This approach optimally allocates compute resources during inference, contrasting with brute-force methods like best-of-N and particle filtering, which typically increase sample count or diversity without explicitly optimizing token usage.
The framework operationalizes A* search across a dynamic graph of candidate continuations, leveraging a process-supervised heuristic that is informed by Process Reward Models (PRMs). These models supply task-specific feedback signals that guide the search toward high-value trajectories while effectively pruning less promising paths. A*-decoding demonstrates reduced token usage—up to threefold—as compared to traditional methods, while maintaining or enhancing performance levels.
Experimental Insights and Numerical Results
The evaluation is conducted on math benchmarks MATH500 and AIME 2024, with a particular focus on reasoning tasks that demand precise and token-efficient generation. Noteworthy results show that Llama-3.2-1B-Instruct, a relatively small model, achieves the performance levels of substantially larger models like Llama-3.1-70B-Instruct. Qwen3-1.7B further demonstrates competitive accuracy, matching o1-like reasoning accuracy while preserving token efficiency. This indicates that intelligent inference-time strategies can effectively bridge the performance gap between smaller and larger models.
A*-decoding achieves competitive exact-match accuracy with significant reductions in overall token generation—from 3x fewer tokens and 30% fewer PRM passes—compared to other approaches under equivalent compute budgets.
Implications and Future Research Directions
The implications of this research are manifold. Practically, it paves the way for cost-effective deployment of SLMs in resource-constrained environments, significantly lowering operational costs associated with inference-time compute. Theoretically, it challenges the entrenched model of performance gains through parameter scaling alone, suggesting that strategic inference-time interventions can offer sustainable and scalable enhancements.
Future research could expand on integrating different types of process supervision signals, allowing A*-decoding to adapt to diverse problem domains while maintaining inference efficiency. Further exploration into model-specific optimizations and adaptive tuning of hyperparameters for various tasks could amplify the benefits observed, providing deeper insights into state space traversal dynamics.
Overall, this paper contributes a robust framework for decoding that merges structured search algorithms with advanced LM policies, heralding innovative pathways in LLM deployment strategies in constrained computational frameworks.