Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation (2404.00725v2)

Published 31 Mar 2024 in cs.SE, cs.AI, cs.CL, and cs.LG

Abstract: It is a common belief that LLMs are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model. We consider a standard unit-test setup, which can be used to select the correct output from the smaller model. Our findings reveal that the repeated use of smaller models can yield consistent improvements, with gains of up to 15% across five tasks. On the other hand, in scenarios where unit-tests are unavailable, a ranking-based selection of candidates from the smaller model falls short of the performance of a single output from larger ones. Our results highlight the potential of using smaller models instead of larger ones, and the importance of studying approaches for ranking LLM outputs.

Improved Code Generation with LLMs via Budget Reallocation

The paper "The Larger the Better? Improved LLM Code-Generation via Budget Reallocation" provides a nuanced analysis of LLMs concerning their effectiveness in code generation tasks under constrained computational budgets. Contrary to established belief, which assumes that larger LLMs inherently yield better performance, this research explores the gains achievable by optimally reallocating computational resources between model size and the number of inference passes—a paradigm shift from the traditional approach of scaling models.

The authors embark on a comparative paper between different LLM architectures by evaluating their performance in code generation tasks across various model sizes, constrained by identical budgets. Primary findings suggest that smaller models such as the 7B and 13B Code Llama significantly outperform larger models like the 34B and 70B variants in fixed budget scenarios. These evaluations were conducted on widely recognized benchmarks such as HumanEval, MBPP, and APPS, with performance improvements reaching up to 15%.

The methodological cornerstone of this paper is the consideration of both FLOPs and wall-time as budgetary constraints, allowing multiple sample generations from smaller models against singular passes from larger counterparts. In their systematic approach, outputs were produced and rank-selected based on an adaptation of the pass@kk metric, which traditionally evaluates model performance via multiple outputs but here considers computational budget constraints.

Results across benchmarks consistently indicate that, given the same computational overhead, smaller models leveraged with multiple generations not only meet but exceed the performance of their larger alternatives. For example, in the competition split of APPS—the most demanding task in their evaluation—the 13B model exhibited superior performance across almost all computational thresholds.

The paper also addresses a compelling secondary investigation into scenarios without readily available evaluation metrics, such as unit-tests. The authors explore ranking-based selection mechanisms leveraging NLL scores, demonstrating enhanced outcomes using larger models as rankers for smaller LLM generations. Nonetheless, employing rank-based approaches did not consistently outperform the more traditional greedy approaches inherent to larger models, suggesting an ongoing trade-off between selection sophistication and inherent model capability.

Implications drawn from this research are significant both theoretically and practically. The findings propose a feasible alternative to large-scale models that reinforces computational efficiency without markedly compromising performance—particularly relevant given the increasing deployment costs associated with model scaling. Furthermore, the paper highlights the necessity of optimizing not just model architectures but operational policies regarding inference strategies, acknowledging different modeling constraints and application purposes.

Looking forward, the paper provides an empirical base for continued research into adaptive computation allocation within AI systems. With the release of substantial data comprising over a million outputs from smaller models, the authors facilitate future investigations into optimization across model scales and ranking strategies.

In conclusion, this paper contributes critical insights and operational strategies to the evolving field of efficient AI by questioning the hegemony of increasingly larger models, thereby offering an efficient alternative rooted in strategic computational deployments. As the AI community advances, embracing such nuanced explorations will be pivotal for informed decision-making in model training and deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Google DeepMind AlphaCode Team. Alphacode 2 technical report, 2023. URL https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf.
  2. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732. arXiv:2108.07732.
  3. Accelerating large language model decoding with speculative sampling, 2023. URL https://arxiv.org/abs/2302.01318. arXiv:2302.01318.
  4. Evaluating large language models trained on code, 2021. URL https://arxiv.org/abs/2107.03374. arXiv:2107.03374.
  5. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  6. Unified scaling laws for routed language models. In International conference on machine learning, pp.  4057–4086. PMLR, 2022.
  7. Training verifiers to solve math word problems, 2021a. URL https://arxiv.org/abs/2110.14168. arXiv:2110.14168.
  8. Training verifiers to solve math word problems, 2021b. URL https://arxiv.org/abs/2110.14168. arXiv:2110.14168.
  9. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp.  7480–7512. PMLR, 2023.
  10. LLM.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, 2022.
  11. Stepcoder: Improve code generation with reinforcement learning from compiler feedback, 2024. URL https://arxiv.org/abs/2402.01391. arXiv:2402.01391.
  12. Alpacafarm: A simulation framework for methods that learn from human feedback. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2305.14387.
  13. Language models scale reliably with over-training and on downstream tasks, 2024. URL https://arxiv.org/abs/2403.08540. arXiv:2403.08540.
  14. Scaling laws for discriminative speech recognition rescoring models, 2023. URL https://arxiv.org/abs/2306.15815. arXiv:2306.15815.
  15. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36, 2024.
  16. Teaching large language models to reason with reinforcement learning, 2024. URL https://arxiv.org/abs/2403.04642. arXiv:2403.04642.
  17. Measuring coding challenge competence with APPS. In Advances in Neural Information Processing Systems, 2021.
  18. Scaling laws for transfer, 2021. URL https://arxiv.org/abs/2102.01293. arXiv:2102.01293.
  19. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556. arXiv:2203.15556.
  20. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019.
  21. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361. arXiv:2001.08361.
  22. Speculative decoding with big little decoder. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  39236–39256. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/7b97adeafa1c51cf65263459ca9d0d7c-Paper-Conference.pdf.
  23. Spoc: Search-based pseudocode to code. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/7298332f04ac004a0ca44cc69ecf6f6b-Paper.pdf.
  24. Fast inference from transformers via speculative decoding. In Proc. of ICLR, 2023. URL https://arxiv.org/abs/2211.17192.
  25. Common 7B language models already possess strong math capabilities, 2024. URL https://arxiv.org/abs/2403.04706. arXiv:2403.04706.
  26. G-Eval: NLG evaluation using GPT-4 with better human alignment, 2023. URL https://arxiv.org/abs/2303.16634. arXiv:2303.16634.
  27. Transformers are multi-state RNNs, 2024. URL https://arxiv.org/abs/2401.06104. arXiv:2401.06104.
  28. Large language models are effective text rankers with pairwise ranking prompting, 2023. URL https://arxiv.org/abs/2306.17563. arXiv:2306.17563.
  29. Scaling language models: Methods, analysis & insights from training gopher, 2021. URL https://arxiv.org/abs/2112.11446. arXiv:2112.11446.
  30. Code llama: Open foundation models for code, 2023. URL https://arxiv.org/abs/2308.12950. arXiv:2308.12950.
  31. Branch-solve-merge improves large language model evaluation and generation. In Proc. of NAACL, 2024. URL https://arxiv.org/abs/2310.15123.
  32. When do we not need larger vision models?, 2024. URL https://arxiv.org/abs/2403.13043. arXiv:2403.13043.
  33. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  14918–14937, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.923. URL https://aclanthology.org/2023.emnlp-main.923.
  34. Gemini: a family of highly capable multimodal models, 2023. URL https://arxiv.org/abs/2312.11805. arXiv:2312.11805.
  35. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288. arXiv:2307.09288.
  36. Solving math word problems with process-and outcome-based feedback, 2022. URL https://arxiv.org/abs/2211.14275. arXiv:2211.14275.
  37. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
  38. xiaoju ye. calflops: a flops and params calculate tool for neural networks in pytorch framework, 2023. URL https://github.com/MrYxJ/calculate-flops.pytorch.
  39. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Advances in Neural Information Processing Systems: Datasets and Benchmarks Track, 2023.
  40. Pre-trained language model based ranking in baidu search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.  4014–4022, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Michael Hassid (12 papers)
  2. Tal Remez (26 papers)
  3. Jonas Gehring (14 papers)
  4. Roy Schwartz (74 papers)
  5. Yossi Adi (96 papers)
Citations (12)