Transcending Scaling Laws with 0.1% Extra Compute (2210.11399v2)

Published 20 Oct 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Scaling LLMs improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing LLMs and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art LLM (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of LLMs on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.

PDF Abstract

Overview of "Transcending Scaling Laws with 0.1% Extra Compute"

The paper presents "UL2R," a novel method devised to enhance the performance of LLMs such as PaLM with minimal computational overhead. The authors explore the hypothesis that LLMs can attain considerable improvements in scaling properties merely by continuing training with a diverse set of training objectives, specifically referencing the UL2 mixture-of-denoisers objective. This approach, referred to as UL2Restore (UL2R), requires a negligible increase in computational demands—approximately between 0.1% to 1% of the original training FLOPs—significantly optimizing the scaling efficiency of these models.

Key Findings and Methodology

UL2R Approach: The core idea behind UL2R involves integrating diverse pretraining objectives like prefix LLMing and long-short span corruption tasks using UL2's versatile framework. This methodology is implemented by supplementing standard left-to-right causal LLMs with bidirectional attention capabilities, essential in enhancing the models' downstream task performance.
Reduced Computational Costs: Implementing UL2R requires a small fraction of additional compute relative to the immense gains in model performance, achieving an approximately 2x computational savings at 540B model scale compared to continuing with traditional training methods. Specifically, the U-PaLM model achieves competitive performance at half the computational expense of the conventional PaLM 540B model, saving an estimated 4.4 million TPUv4 hours.
Emergent Abilities: One of the notable outcomes of this approach is the emergence of new capabilities in LLMs. Such emergent abilities were evaluated using challenging BIG-Bench tasks, where U-PaLM showcased superior performance relative to its PaLM counterpart, frequently at smaller model scales.
Few-shot and Zero-shot Learning: Across several downstream NLP tasks and benchmarks such as commonsense reasoning, closed-book QA, and multilingual tasks, U-PaLM consistently outperformed PaLM. The strong improvements observed on both reasoning challenges and chain-of-thought tasks emphasize the beneficial impact of UL2R on few-shot learning paradigms.
Practical and Theoretical Implications: The successful application of UL2R highlights efficient compute usage in training large-scale models, which has profound implications for both production-level AI systems and theoretical scaling law studies. From a practical standpoint, enhancing models with existing resources promotes renewability and reduces overheads associated with training from scratch.

Speculation on Future Developments

This work provides an intriguing perspective on refining large model training strategies without escalating computational resources. Future investigations could delve into more extensive integration of UL2R with various LLM frameworks and architectures. Additionally, implementing such train-enhancement techniques might stimulate new research on task-specific performance projection and broaden our understanding of emergent properties in neural networks as they scale.

In conclusion, the UL2R framework fosters a compelling methodology to leverage existing compute assets in AI research and production, underscoring future trajectories in model-efficient learning strategies.