Scaling LLM Inference with Optimized Sample Compute Allocation (2410.22480v1)
Abstract: Sampling is a basic operation in many inference-time algorithms of LLMs. To scale up inference efficiently with a limited compute, it is crucial to find an optimal allocation for sample compute budgets: Which sampling configurations (model, temperature, language, etc.) do we use? How many samples do we generate in each configuration? We formulate these choices as a learning problem and propose OSCA, an algorithm that Optimizes Sample Compute Allocation by finding an optimal mix of different inference configurations. Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration with 128x less compute on code generation and 25x less compute on 4 reasoning tasks. OSCA is also shown to be effective in agentic workflows beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration. Our code and generations are released at https://github.com/LeiLiLab/OSCA.
- Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787.
- Codet: Code generation with generated tests. arXiv e-prints, pages arXiv–2207.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Thoughtsculpt: Reasoning with intermediate revision and search. arXiv preprint arXiv:2404.05966.
- Learning How Hard to Think: Input-Adaptive Allocation of LM Computation.
- Universal transformers. In International Conference on Learning Representations.
- Language model cascades. arXiv preprint arXiv:2207.10342.
- Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning.
- Alex Graves. 2016. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983.
- Enhancing large language models in coding through multi-perspective self-consistency. arXiv preprint arXiv:2309.17272.
- Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974.
- LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, Toronto, Canada. Association for Computational Linguistics.
- Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
- Hypertree proof search for neural theorem proving. Advances in neural information processing systems, 35:26337–26349.
- Rémi Leblond et al. 2023. Alphacode 2 technical report. Technical report, DeepMind.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36.
- Stuart J Russell and Peter Norvig. 2016. Artificial intelligence: a modern approach. Pearson.
- Large language model programs. arXiv preprint arXiv:2305.05364.
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36.
- Scaling llm test-time compute optimally can be more effective than scaling model parameters.
- Toward self-improvement of llms via imagination, searching, and criticizing. Preprint, arXiv:2404.12253.
- Cost-effective hyperparameter optimization for large language model generation inference.
- Cost-effective hyperparameter optimization for large language model generation inference. In International Conference on Automated Machine Learning, pages 21–1. PMLR.
- Llm can achieve self-regulation via hyperparameter aware generation. In Findings of the Association for Computational Linguistics ACL 2024, pages 6632–6646. Association for Computational Linguistics.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- From decoding to meta-generation: Inference-time algorithms for large language models. arXiv preprint arXiv:2406.16838.
- Livebench: A challenging, contamination-free llm benchmark. arXiv preprint arXiv:2406.19314.
- An empirical analysis of compute-optimal inference for problem-solving with language models. arXiv preprint arXiv:2408.00724.
- Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489.
- Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
- Algo: Synthesizing algorithmic programs with generated oracle verifiers. Advances in Neural Information Processing Systems, 36:54769–54784.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.