Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 93 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 17 tok/s

GPT-5 High 14 tok/s Pro

GPT-4o 97 tok/s

GPT OSS 120B 455 tok/s Pro

Kimi K2 194 tok/s Pro

2000 character limit reached

YuLan-Mini: An Open Data-efficient Language Model (2412.17743v2)

Published 23 Dec 2024 in cs.CL

Abstract: Effective pre-training of LLMs has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

Collections

Summary

The paper demonstrates that a curated data pipeline and incremental curriculum enable efficient pre-training over 1.08 trillion tokens.
The work employs a decoder-only transformer architecture enhanced with techniques like Pre-RMSNorm, SwiGLU, and WeSaR to stabilize training.
The paper shows that using a data-efficient approach yields competitive results in tasks such as mathematical reasoning and code generation.

Overview of YuLan-Mini: An Open Data-efficient LLM

The paper "YuLan-Mini: An Open Data-efficient LLM" presents the development and evaluation of a 2.42 billion parameter LLM, named YuLan-Mini. The focus of the work lies in achieving competitive performance with LLMs through a data-efficient pre-training approach. This is significant given the demanding computational and data resources typically required for training LLMs. The authors detail a methodology that prioritizes pre-training efficacy via a meticulous data pipeline, stabilization of training, and an effective annealing approach for training across 1.08 trillion tokens.

Pre-training Strategy

YuLan-Mini encapsulates innovations in pre-training that optimize both the learning process and resource utilization. The model architecture is based on a decoder-only transformer with a specified parameter allocation of 2.23 billion non-embedding parameters. In contributing to training stability and efficiency, techniques like embedding tying, Pre-RMSNorm, and SwiGLU are employed. Additionally, the dataset pipeline is carefully curated and structured to include both English and Chinese datasets, coding and math reasoning data, along with synthetically generated reasoning sequences.

Technical Contributions and Model Performance

Significant effort in the technical design of YuLan-Mini is apparent in three primary areas:

Data Pipeline: The design integrates data cleaning and scheduling strategies. The division of training into incremental curriculum phases allows for controlled adjustments of data proportions, enhancing the training trajectory's flexibility and adaptiveness.
Optimization and Stability: The model employs systematic optimizations to mitigate typical training instabilities such as loss spikes and gradient explosions. Techniques like the combination of $\mu$ P-like initialization and re-parametrization (WeSaR) play a crucial role here.
Annealing Approach: By incorporating long contexts and specific data selection, the paper emphasizes the importance of the annealing phase in the training process, which helps in incrementally refining the model's capacity and robustness.

Empirical results and comparisons illustrate YuLan-Mini's competitive edge against established models of similar scales in diverse benchmarks—particularly those involving mathematical reasoning and code generation. For instance, in the MATH-500 benchmark, YuLan-Mini achieves a performance indicative of efficient training strategies, outperforming several counterparts.

Implications and Prospective Directions

YuLan-Mini signifies a stride towards producing high-performing LLMs with substantially less training data than typically required by industry models. The release of the full pre-training details, along with efforts for data openness and efficiency, holds promise for replication in academic settings where resources are comparatively constrained.

The theoretical and practical implications of this work suggest potential avenues for further exploration. Future iterations could include extending context windows beyond current limits and adapting YuLan-Mini's methodologies to other LLM architectures or specialized domain tasks. The paper’s contribution also offers a foundation for investigating the developmental trajectory of LLM capabilities through intermediate checkpoint analyses, further enriching our understanding of large-scale model training dynamics.

In conclusion, this paper underscores the viability of a data-efficient approach in training LLMs like YuLan-Mini, balancing breadth of capabilities with resource-conscious constraints, and setting benchmarks for future research in the domain of artificial intelligence and natural language processing.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (11)

GitHub

GitHub - RUC-GSAI/YuLan-Mini: A highly capable 2.42B lightweight LLM using only 1.08T pre-training data. (61 stars)

Tweets

https://twitter.com/AdinaYakup/status/1873839228838355104

https://twitter.com/_akhaliq/status/1872923646441996616

https://twitter.com/susumuota/status/1873158259776823385