Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 17 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 97 tok/s
GPT OSS 120B 455 tok/s Pro
Kimi K2 194 tok/s Pro
2000 character limit reached

YuLan-Mini: An Open Data-efficient Language Model (2412.17743v2)

Published 23 Dec 2024 in cs.CL

Abstract: Effective pre-training of LLMs has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that a curated data pipeline and incremental curriculum enable efficient pre-training over 1.08 trillion tokens.
  • The work employs a decoder-only transformer architecture enhanced with techniques like Pre-RMSNorm, SwiGLU, and WeSaR to stabilize training.
  • The paper shows that using a data-efficient approach yields competitive results in tasks such as mathematical reasoning and code generation.

Overview of YuLan-Mini: An Open Data-efficient LLM

The paper "YuLan-Mini: An Open Data-efficient LLM" presents the development and evaluation of a 2.42 billion parameter LLM, named YuLan-Mini. The focus of the work lies in achieving competitive performance with LLMs through a data-efficient pre-training approach. This is significant given the demanding computational and data resources typically required for training LLMs. The authors detail a methodology that prioritizes pre-training efficacy via a meticulous data pipeline, stabilization of training, and an effective annealing approach for training across 1.08 trillion tokens.

Pre-training Strategy

YuLan-Mini encapsulates innovations in pre-training that optimize both the learning process and resource utilization. The model architecture is based on a decoder-only transformer with a specified parameter allocation of 2.23 billion non-embedding parameters. In contributing to training stability and efficiency, techniques like embedding tying, Pre-RMSNorm, and SwiGLU are employed. Additionally, the dataset pipeline is carefully curated and structured to include both English and Chinese datasets, coding and math reasoning data, along with synthetically generated reasoning sequences.

Technical Contributions and Model Performance

Significant effort in the technical design of YuLan-Mini is apparent in three primary areas:

  1. Data Pipeline: The design integrates data cleaning and scheduling strategies. The division of training into incremental curriculum phases allows for controlled adjustments of data proportions, enhancing the training trajectory's flexibility and adaptiveness.
  2. Optimization and Stability: The model employs systematic optimizations to mitigate typical training instabilities such as loss spikes and gradient explosions. Techniques like the combination of μ\muP-like initialization and re-parametrization (WeSaR) play a crucial role here.
  3. Annealing Approach: By incorporating long contexts and specific data selection, the paper emphasizes the importance of the annealing phase in the training process, which helps in incrementally refining the model's capacity and robustness.

Empirical results and comparisons illustrate YuLan-Mini's competitive edge against established models of similar scales in diverse benchmarks—particularly those involving mathematical reasoning and code generation. For instance, in the MATH-500 benchmark, YuLan-Mini achieves a performance indicative of efficient training strategies, outperforming several counterparts.

Implications and Prospective Directions

YuLan-Mini signifies a stride towards producing high-performing LLMs with substantially less training data than typically required by industry models. The release of the full pre-training details, along with efforts for data openness and efficiency, holds promise for replication in academic settings where resources are comparatively constrained.

The theoretical and practical implications of this work suggest potential avenues for further exploration. Future iterations could include extending context windows beyond current limits and adapting YuLan-Mini's methodologies to other LLM architectures or specialized domain tasks. The paper’s contribution also offers a foundation for investigating the developmental trajectory of LLM capabilities through intermediate checkpoint analyses, further enriching our understanding of large-scale model training dynamics.

In conclusion, this paper underscores the viability of a data-efficient approach in training LLMs like YuLan-Mini, balancing breadth of capabilities with resource-conscious constraints, and setting benchmarks for future research in the domain of artificial intelligence and natural language processing.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube