Will larger LLMs eventually surpass smaller ones with sufficient compute?
Determine whether, under fixed-budget evaluation for code generation tasks with unit tests (e.g., HumanEval and MBPP), increasing compute sufficiently will cause larger models (such as Code Llama 34B or 70B) to overtake smaller models (such as Code Llama 7B or 13B) in performance, or whether models of all sizes eventually saturate at a similar performance level.
References
An interesting question we do not fully address is whether, given enough compute, the larger models will overtake the smaller ones, or perhaps they will all saturate at a similar performance level at some point.
                — The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
                
                (2404.00725 - Hassid et al., 31 Mar 2024) in Discussion and Limitations, first paragraph