Rethinking Optimization and Architecture for Tiny Language Models (2402.02791v2)

Published 5 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The power of LLMs has been demonstrated through numerous data and computing resources. However, the application of LLMs on mobile devices is facing huge challenge on the computation and memory costs, that is, tiny LLMs with high performance are urgently required. Limited by the highly complex training process, there are many details for optimizing LLMs that are seldom studied carefully. In this study, based on a tiny LLM with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component. Three perspectives are mainly discussed, \ie, neural architecture, parameter initialization, and optimization strategy. Several design formulas are empirically proved especially effective for tiny LLMs, including tokenizer compression, architecture tweaking, parameter inheritance and multiple-round training. Then we train PanGu-$\pi$-1B Pro and PanGu-$\pi$-1.5B Pro on 1.6T multilingual corpora, following the established formulas. Experimental results demonstrate the improved optimization and architecture yield a notable average improvement of 8.87 on benchmark evaluation sets for PanGu-$\pi$-1B Pro. Besides, PanGu-$\pi$-1.5B Pro surpasses a range of SOTA models with larger model sizes, validating its superior performance. The code is available at https://github.com/YuchuanTian/RethinkTinyLM.

Authors (10)

Yehui Tang (63 papers)
Fangcheng Liu (7 papers)
Yunsheng Ni (6 papers)
Yuchuan Tian (11 papers)
Zheyuan Bai (5 papers)
Yi-Qi Hu (4 papers)
Sichao Liu (3 papers)
Shangling Jui (36 papers)
Kai Han (184 papers)
Yunhe Wang (145 papers)

Citations (8)

View on Semantic Scholar

Summary

The paper demonstrates that optimizing tokenizer design, parameter initialization, and multi-round training significantly improves tiny language model performance.
The research achieved performance gains from 42.41 to 51.28 while reducing model size by 16.67%, surpassing larger predecessors.
The study provides practical strategies for deploying efficient, high-performing language models on mobile devices.

Overview of Study Goals

This paper thoroughly examines the optimization and architecture of tiny LLMs (LMs), targeting the enhancement of their performance specifically for deployment on mobile devices. The motivation for this research stems from the urgent need for LMs that are both capable and computationally efficient, a demand that conventional training methods and configurations for large LMs often fail to meet.

Analysis of Model Components

The paper is structured around an empirical analysis of various components involved in LM construction:

Neural Architecture: The paper begins by evaluating the impact of tokenizer complexity on the model's efficiency. Highly compact tokenizers, streamlined to exclude low-frequency vocabularies, were found to significantly decrease computational overhead while maintaining representational adequacy.
Parameter Initialization: Next, the research assessed various parameter initialization strategies, underscoring the advantage of parameter inheritance from larger models. The paper showed that this approach notably accelerates model convergence and is especially impactful for tiny LMs due to their limited capacity.
Optimization Strategy: Finally, the investigation delved into the unique optimization challenges faced by tiny LMs which large models do not encounter as severely, such as data forgetting. A strategic multi-round training methodology was proposed, demonstrating its efficacy in enhancing model performance through focused learning on difficult examples.

Experimental Design and Results

Methodically, the researchers engaged in a series of controlled experiments that iteratively introduced improvements to a baseline 1B-parameter model. The outcomes highlighted substantial performance gains, with improvements quantified for each modification: - Tokenizer compression increased performance from an average of 42.41 to 44.11. - Architectural tweaks further pushed this to 46.53. - Parameter inheritance led to a remarkable rise to 49.79. - A refined multi-round training method culminated in achieving an average performance of 51.28.

This rigorous component-wise evaluation corroborated the effectiveness of the proposed methodologies and the final model, dubbed PanGu-π-1B Pro, which exceeded its 1.5B-parameter predecessor in benchmark evaluations while employing 16.67% fewer parameters.

Conclusion

In providing a detailed and methodological insight into the development and optimization of tiny LMs, the paper makes a compelling case for a tailored approach that differs significantly from the strategies employed for large LMs. The researchers propose that successfully addressing the crucial aspects of tokenizer efficiency, parameter initialization, and optimization strategy is central to realizing high-performing tiny LMs that can be feasibly deployed on mobile devices.

The documentation of empirical evidence concerning each component of model design renders this paper a valuable resource for further work in the field. It charts a clear path for the development of tiny LMs that do not compromise on capability for compactness, balancing the trade-offs efficiently.

PDF Markdown

Related Papers

GitHub

GitHub - YuchuanTian/RethinkTinyLM (105 stars)

Tweets

https://twitter.com/_akhaliq/status/1754727987172295055

https://twitter.com/BrianRoemmele/status/1754881279680483470

https://twitter.com/deci_ai/status/1755911157678998001

https://twitter.com/skylerrosling/status/1755267576022929484

https://twitter.com/pawelmarciniuk/status/1755773544376508697

https://twitter.com/deci_ai/status/1755911160044605564