Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
This presentation explores a paradigm shift in improving language model performance: instead of scaling model size or training data, the authors demonstrate how generating multiple samples at inference time can dramatically increase problem-solving success. By examining tasks from coding to mathematical reasoning, they reveal log-linear scaling laws for coverage and show that amplifying cheaper models with repeated sampling can outperform single attempts from premium models at a fraction of the cost. The talk covers the methodology, striking empirical results including a 56% solve rate on SWE-bench Lite, and the critical challenge of identifying correct solutions from large sample pools.Script
What if the secret to better AI performance isn't building bigger models, but simply asking the same model to try again and again? This paper reveals a surprisingly powerful approach to scaling inference compute through repeated sampling.
Building on that idea, the authors demonstrate that generating multiple candidate solutions and using domain-specific verifiers creates a powerful performance multiplier. The fraction of problems that can be solved by at least one sample increases predictably as more samples are generated.
Let's examine the remarkable empirical evidence.
The results on software engineering benchmarks are particularly striking. By amplifying an open-source model through repeated sampling, the researchers exceeded the performance of premium models in single attempts, fundamentally challenging assumptions about model capability versus computational strategy.
This visualization captures the core finding across multiple domains. Notice how coverage rises systematically with more samples for coding, mathematics, and formal proofs. The SWE-bench curve shows the dramatic jump from baseline to the 56% solve rate, while mathematical reasoning tasks approach near-perfect coverage with sufficient samples. The log-linear relationship suggests predictable scaling behavior similar to training compute laws.
This comparison reveals a fundamental insight about resource allocation. The researchers found that generating multiple samples from a less capable but cheaper model proved more cost-effective than single attempts from premium models, while simultaneously achieving better results.
The universality of these findings is noteworthy. Repeated sampling delivered coverage improvements regardless of model size, training approach, or architecture family, suggesting this is a fundamental property of language model inference rather than a quirk of specific models.
However, converting coverage into practical performance requires solving a critical problem.
This limitation reveals the frontier of this approach. While tasks with unit tests or proof checkers benefit immediately from expanded coverage, domains lacking robust verification see diminishing returns because existing selection methods cannot effectively identify rare correct solutions buried among many incorrect attempts.
Looking forward, these findings open multiple research directions. Formalizing inference-time scaling laws could enable predictable performance improvements, while breakthroughs in solution verification would unlock the full potential of repeated sampling across all domains where language models are deployed.
The Large Language Monkeys paper demonstrates that sometimes the path to better AI isn't building a bigger brain, but learning to think multiple times. Visit EmergentMind.com to explore this research and discover how inference-time compute scaling is reshaping our approach to language model performance.