The Larger the Better? Improved LLM Code-Generation via Budget Reallocation (2404.00725v2)

Published 31 Mar 2024 in cs.SE, cs.AI, cs.CL, and cs.LG

Abstract: It is a common belief that LLMs are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model. We consider a standard unit-test setup, which can be used to select the correct output from the smaller model. Our findings reveal that the repeated use of smaller models can yield consistent improvements, with gains of up to 15% across five tasks. On the other hand, in scenarios where unit-tests are unavailable, a ranking-based selection of candidates from the smaller model falls short of the performance of a single output from larger ones. Our results highlight the potential of using smaller models instead of larger ones, and the importance of studying approaches for ranking LLM outputs.

PDF HTML Abstract

Improved Code Generation with LLMs via Budget Reallocation

The paper "The Larger the Better? Improved LLM Code-Generation via Budget Reallocation" provides a nuanced analysis of LLMs concerning their effectiveness in code generation tasks under constrained computational budgets. Contrary to established belief, which assumes that larger LLMs inherently yield better performance, this research explores the gains achievable by optimally reallocating computational resources between model size and the number of inference passes—a paradigm shift from the traditional approach of scaling models.

The authors embark on a comparative paper between different LLM architectures by evaluating their performance in code generation tasks across various model sizes, constrained by identical budgets. Primary findings suggest that smaller models such as the 7B and 13B Code Llama significantly outperform larger models like the 34B and 70B variants in fixed budget scenarios. These evaluations were conducted on widely recognized benchmarks such as HumanEval, MBPP, and APPS, with performance improvements reaching up to 15%.

The methodological cornerstone of this paper is the consideration of both FLOPs and wall-time as budgetary constraints, allowing multiple sample generations from smaller models against singular passes from larger counterparts. In their systematic approach, outputs were produced and rank-selected based on an adaptation of the pass@ $k$ metric, which traditionally evaluates model performance via multiple outputs but here considers computational budget constraints.

Results across benchmarks consistently indicate that, given the same computational overhead, smaller models leveraged with multiple generations not only meet but exceed the performance of their larger alternatives. For example, in the competition split of APPS—the most demanding task in their evaluation—the 13B model exhibited superior performance across almost all computational thresholds.

The paper also addresses a compelling secondary investigation into scenarios without readily available evaluation metrics, such as unit-tests. The authors explore ranking-based selection mechanisms leveraging NLL scores, demonstrating enhanced outcomes using larger models as rankers for smaller LLM generations. Nonetheless, employing rank-based approaches did not consistently outperform the more traditional greedy approaches inherent to larger models, suggesting an ongoing trade-off between selection sophistication and inherent model capability.

Implications drawn from this research are significant both theoretically and practically. The findings propose a feasible alternative to large-scale models that reinforces computational efficiency without markedly compromising performance—particularly relevant given the increasing deployment costs associated with model scaling. Furthermore, the paper highlights the necessity of optimizing not just model architectures but operational policies regarding inference strategies, acknowledging different modeling constraints and application purposes.

Looking forward, the paper provides an empirical base for continued research into adaptive computation allocation within AI systems. With the release of substantial data comprising over a million outputs from smaller models, the authors facilitate future investigations into optimization across model scales and ranking strategies.

In conclusion, this paper contributes critical insights and operational strategies to the evolving field of efficient AI by questioning the hegemony of increasingly larger models, thereby offering an efficient alternative rooted in strategic computational deployments. As the AI community advances, embracing such nuanced explorations will be pivotal for informed decision-making in model training and deployment.

PDF Markdown Bookmark Chat (Pro)

References (40)

Authors (5)

Michael Hassid (12 papers)
Tal Remez (26 papers)
Jonas Gehring (14 papers)
Roy Schwartz (74 papers)
Yossi Adi (96 papers)

Citations (12)

View on Semantic Scholar

Tweets

https://twitter.com/MichaelHassid/status/1817842450519437378

https://twitter.com/MichaelHassid/status/1779715415700365720

https://twitter.com/ComputerPapers/status/1775316922432725125

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation (2404.00725v2)

Improved Code Generation with LLMs via Budget Reallocation

Related Papers

Tweets