Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (2408.00724v2)

Published 1 Aug 2024 in cs.AI

Abstract: While the scaling laws of LLMs training have been extensively studied, optimal inference configurations of LLMs remain underexplored. We study inference scaling laws and compute-optimal inference, focusing on the trade-offs between model sizes and generating additional tokens with different inference strategies. As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies such as greedy search, majority voting, best-of-$n$, weighted voting, and two different tree search algorithms, using different model sizes and compute budgets. Our findings indicate smaller models (e.g., Llemma-7B) can outperform larger models given the same computation budgets, and that smaller models paired with advanced inference algorithms yield Pareto-optimal cost-performance trade-offs. For instance, the Llemma-7B model, equipped with our novel tree search algorithm, consistently outperforms Llemma-34B with standard majority voting on the MATH benchmark across all FLOPs budgets. We hope these findings contribute to a broader understanding of inference scaling laws for LLMs.

PDF HTML Abstract

An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with LLMs

The paper "An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with LLMs," authored by Yangzhen Wu et al., explores the computation-efficiency trade-offs during inference with LLMs. Previous research has extensively explored optimal training configurations involving model sizes and compute budgets, but inference-time efficiency remains underexplored. This paper contributes significantly by assessing the performance and computational efficiency of multiple inference strategies—such as Greedy Search, Best-of-N, Majority Voting, and tree search variants—across different model sizes and computational environments.

Key Findings

Inference Computation Scaling Laws:
- The research finds that optimal model size depends on the given computation budget. With limited budgets, smaller models outperform larger ones owing to their computational efficiency. The performance improvement curve typically shows a decrease in error rate with increasing computation until saturation.
- Through regression analysis, the paper establishes a relationship between the inference FLOPs and the optimal model size, expressed as $\log_{10}(C) = 1.19 \log_{10}(N) + 2.03$ , enabling practitioners to estimate the optimal model size for a given computational constraint.
Comparative Performance Analysis:
- The performance of smaller models, such as Llemma-7B, is shown to be comparable to larger models like Llemma-34B while utilizing significantly fewer FLOPs (approximately half). This finding highlights that efficient deployment of smaller models coupled with sophisticated inference algorithms can yield competitive performance.
- Various inference strategies exhibit different levels of performance improvements. Weighted Majority Voting consistently outperforms standard Majority Voting, aligning with theoretical expectations of higher accuracy limits given the effectiveness of the reward model.
Introduction of REBASE:
- The paper introduces a novel tree search algorithm called Reward Balanced Search (REBASE). This method optimizes the balance between exploitation and exploration by leveraging node rewards from a Process Reward Model (PRM).
- Empirical evidence shows REBASE’s computational efficiency and its superior performance over traditional Monte Carlo Tree Search (MCTS), achieving higher accuracies with significantly fewer generated tokens and computational costs.
- For example, REBASE enabled Llemma-7B to maintain competitive accuracy to Llemma-34B on mathematical reasoning tasks while using fewer FLOPs, underscoring its potential for practical, budget-constrained applications.
Theoretical Contributions:
- The paper theoretically validates the asymptotic behavior of Majority Voting and Weighted Majority Voting through Theorems 1 and 2. These theorems demonstrate the convergence of accuracy as the number of samples increases, with better performance bounds projected for Weighted Voting strategies.

Implications and Future Directions

Practical Implications

The findings provide a compelling argument for the deployment of smaller, optimized models with advanced decoding strategies (such as REBASE) in real-world applications where compute resources are limited. This has immense practical relevance for deploying LLMs on end-devices, enhancing problem-solving accuracy without incurring prohibitive computational costs.

Theoretical Implications

On a theoretical front, this paper advances our understanding of inference-time computation scaling laws. By systematically analyzing various configurations, it lays the groundwork for future research on optimizing inference methods for different model families and task types.

Future Developments in AI

The paper's results could spur further studies into compute-optimal strategies across a broader range of generative tasks beyond mathematical problem-solving. Exploring how different training datasets and varied domains influence these inference strategies holds potential for broader AI applications. Moreover, refinements in tree search algorithms inspired by REBASE could inspire enhancements in other computationally intensive LLM applications.

Conclusion

Wu et al. have provided a thorough and insightful analysis that bridges a significant research gap concerning inference-time efficiency of LLMs. Their empirical findings, combined with the innovative REBASE algorithm, offer a balanced view of the trade-offs between model size, computational cost, and accuracy. This work is bound to influence both theoretical research and practical deployment strategies in the field of LLMs.