52B to 1T: Lessons Learned via Tele-FLM Series (2407.02783v1)

Published 3 Jul 2024 in cs.CL and cs.AI

Abstract: LLMs represent a significant stride toward Artificial General Intelligence. As scaling laws underscore the potential of increasing model sizes, the academic community has intensified its investigations into LLMs with capacities exceeding 50 billion parameters. This technical report builds on our prior work with Tele-FLM (also known as FLM-2), a publicly available 52-billion-parameter model. We delve into two primary areas: we first discuss our observation of Supervised Fine-tuning (SFT) on Tele-FLM-52B, which supports the "less is more" approach for SFT data construction; second, we demonstrate our experiments and analyses on the best practices for progressively growing a model from 52 billion to 102 billion, and subsequently to 1 trillion parameters. We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research.

PDF HTML Abstract

Insights from “#52B to 1T: Lessons Learned via Tele-FLM Series”

The paper "#52B to 1T: Lessons Learned via Tele-FLM Series" presents a robust analysis on the development of extremely LLMs, particularly focusing on the Tele-FLM series. This series originated from a 52-billion-parameter model known as Tele-FLM-52B, and was progressively expanded to a 1 trillion-parameter model, Tele-FLM-1T. The authors explore two primary methodologies during this venture: Supervised Fine-tuning (SFT) and progressive growth strategies, both integral to pushing parameter boundaries efficiently while maintaining model performance.

Supervised Fine-tuning (SFT) Strategies

The paper critically examines the impact of SFT data construction on LLM performance, emphasizing a "less is more" approach. Particularly, the research suggests that leveraging a smaller, high-quality subset of instruction-following data can yield better generalization across standard language tasks. For example, they curate a limited dataset of 30k samples focused mainly on mathematics and coding dialogues, yielding strong performance relative to some of the most advanced models, such as GPT-4, on specific language understanding and generation tasks. Impressively, Tele-FLM-Chat achieves up to 91% of GPT-4's performance and up to 107% on Chinese language tasks against GPT-4-0613, confirming the hypothesis that a well-selected, compact data corpus can elicit the inherent capabilities of large pre-trained models.

While a robust performance in routine tasks has been noted, the necessity for a larger, sophisticated data corpus in complex reasoning tasks is also underlined. The paper finds gaps in performance concerning mathematical reasoning, implying the need for enhanced SFT strategies with richer logical narratives and comprehensive instructional data.

Progressive Growth Techniques

Utilizing a structured, stage-wise growth approach, the development from Tele-FLM-52B to Tele-FLM-1T highlighted the viability of function-preserving growth strategies in scaling models beyond 100B parameters. Critical parameters—including the width and depth—were methodically expanded to increase model capacity without compromising on training viability. Employing techniques like MSG (Masked Structural Growth), they effectively transitioned the model size while preserving the learned knowledge from previous states, ensuring continuity in model effectiveness.

An intriguing aspect of this exploration is the depth and width growth strategies, which rely heavily on a mask-based approach for gradual integration of new architectures. The researchers emphasize the importance of maintaining an optimal ratio between various model parameters to preserve training stability. Furthermore, experimental insights inform the adaptation of hyperparameter configurations at each growth stage, influencing convergence and allowing for efficient knowledge transfer.

Theoretical and Practical Implications

This research presents significant practical implications for the advancement and cost-efficiency of training extensive LLMs. The capacity to grow existing models to extreme scales in a functionally preserving manner could significantly reduce computational needs, democratizing access to high-end AI capabilities. The paper importantly contributes to the theoretical understanding of LLM scalability, illustrating pathways to optimizing model architecture and data utilization.

In future AI research, these insights could shift focus towards optimizing architectural growth strategies and refined SFT methods, crucial for handling increasingly complex language tasks. Bridging performance gaps in high-reasoning tasks remains a priority, necessitating more refined approaches for data utilization and model tuning. This paper's contributions underscore crucial methodologies that could shape the landscape of future LLM development, especially as models scale beyond trillions of parameters.