Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2407.21787v2)

Published 31 Jul 2024 in cs.LG and cs.AI

Abstract: Scaling the amount of compute used to train LLMs has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.

View on arXiv

Authors (7)

Bradley Brown (6 papers)
Jordan Juravsky (7 papers)
Ryan Ehrlich (4 papers)
Ronald Clark (42 papers)
Quoc V. Le (128 papers)
Christopher Ré (194 papers)
Azalia Mirhoseini (40 papers)

Citations (60)

View on Semantic Scholar

Summary

An Analytical Perspective on "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling"

The paper under examination titled "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling" provides an in-depth exploration into the efficacy of scaling inference compute by increasing the number of generated samples. This paper diverges from the traditional focus on scaling model size or pre-training datasets and instead emphasizes the potential benefits of computational scaling during the inference phase.

Summary of Findings

The authors demonstrate that scaling the number of samples generated during inference leads to substantial improvements in the fraction of problems solved (referred to as coverage) across various domains and tasks. Notably, they show that in domains where automatic verification of answers is feasible—such as coding tasks or formal proofs—improvements in coverage directly enhance model performance.

Key Numerical Results:

SWE-bench Lite Enhancement: By using 250 samples from DeepSeek-V2-Coder-Instruct, the fraction of solved issues increased from 15.9% to 56%. This surpasses the current single-attempt state-of-the-art (SOTA) of 43%, which utilized more advanced (and costlier) models like GPT-4o and Claude 3.5 Sonnet.
Cost Efficiency: Amplifying a cheaper model (DeepSeek-V2) using five samples proved to be more cost-effective and resulted in more issues solved than a single attempt from premium models like GPT-4o or Claude 3.5 Sonnet. This approach was over three times cheaper while achieving superior performance.
Coverage Scaling: For math word problems, using Llama-3 models, coverage exceeded 95% with 10,000 samples.

Repeated Sampling: Coverage and Precision

The paper rigorously defines the benefits of repeated sampling based on two key properties:

Coverage: The fraction of problems that can be solved by any generated sample increases with the number of samples.
Precision: The ability to accurately identify correct samples from the pool of generated samples remains a critical aspect.

The empirical evaluations confirm that across various tasks—ranging from GSM8K and MATH to competitive programming and formal proofs like Lean4—the relationship between coverage and the number of samples is often log-linear. This log-linear behavior hints at the existence of scaling laws akin to those observed with training compute.

Verification Challenges and Future Directions

While automatic verifiers (e.g., proof checkers, unit tests) yield direct performance improvements from increased coverage, tasks without such verifiers pose a unique challenge. Standard methods like majority voting or reward models plateau beyond several hundred samples, which prevents them from fully leveraging the expanded sample budget.

The authors highlight that improving tools for identifying correct samples is crucial, particularly for domains that lack robust verification mechanisms. Future research could focus on enhancing verification methods or developing algorithms that can effectively learn from multiple generations, thereby amplifying correct solutions even when embedded in a larger set of incorrect ones.

Practical and Theoretical Implications

Practical Implications:

Cost Optimization: The findings suggest that practitioners can achieve better cost-performance trade-offs using repeated sampling with less capable but cheaper models, rather than relying solely on single attempts from top-tier models.
System Efficiency: The distinct inference workload introduced by repeated sampling could benefit from optimized system throughput and specialized attention mechanisms, potentially lowering operational costs further.

Theoretical Implications:

Inference-Time Scaling Laws: The observed log-linear relationship between coverage and the number of samples presents an intriguing line of inquiry. Defining these scaling laws more precisely could enable more predictable improvements and resource allocation strategies during model deployment.
Verifiers Development: The gap between coverage and practical performance in non-verifiable domains signals a need for advanced algorithms that can self-assess or be externally evaluated with higher precision.

Conclusion

The paper provides a valuable contribution to understanding how inference compute scaling via repeated sampling can significantly enhance LLMs across a range of tasks. By transforming how inference is performed, the authors propose a paradigm shift that allows for more cost-efficient and performance-optimized model deployments. The key insights into coverage expansion and the necessity of improving precision mechanisms outline a fertile ground for future AI research and application.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/DrJimFan/status/1834279865933332752

https://twitter.com/_philschmid/status/1870396154241843312

https://twitter.com/Azaliamirh/status/1819077194385445302

https://twitter.com/arankomatsuzaki/status/1818828480366219601

https://twitter.com/charles_irl/status/1820562704094068773

https://twitter.com/jordanjuravsky/status/1821348040365506967