Will larger LLMs eventually surpass smaller ones with sufficient compute?

Determine whether, under fixed-budget evaluation for code generation tasks with unit tests (e.g., HumanEval and MBPP), increasing compute sufficiently will cause larger models (such as Code Llama 34B or 70B) to overtake smaller models (such as Code Llama 7B or 13B) in performance, or whether models of all sizes eventually saturate at a similar performance level.

Background

The paper compares large and small Code Llama models under fixed compute budgets (FLOPs and wall-time) for code generation tasks, finding that multiple samples from smaller models can outperform single generations from larger models in many regimes.

While these results hold across HumanEval, MBPP, and APPS splits for the budgets explored, the authors note they could not explore very large numbers of generations due to compute constraints, leaving uncertain whether the observed advantage of smaller models persists indefinitely or if larger models eventually overtake as compute increases.

References

An interesting question we do not fully address is whether, given enough compute, the larger models will overtake the smaller ones, or perhaps they will all saturate at a similar performance level at some point.

— The Larger the Better? Improved LLM Code-Generation via Budget Reallocation (2404.00725 - Hassid et al., 31 Mar 2024) in Discussion and Limitations, first paragraph

Will larger LLMs eventually surpass smaller ones with sufficient compute?

Background

References

Related Problems