SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model (2502.02737v1)

Published 4 Feb 2025 in cs.CL

Abstract: While LLMs have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmoLLM2, a state-of-the-art "small" (1.7 billion parameter) LLM (LM). To attain strong performance, we overtrain SmoLLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmoLLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmoLLM2 as well as all of the datasets we prepared in the course of this project.

Summary

The paper presents SmolLM2, a 1.7B parameter language model achieving state-of-the-art performance through extensive data-centric, multi-stage training on 11 trillion tokens.
It introduces new datasets, FineMath, Stack-Edu, and SmolTalk, specifically created to enhance mathematical reasoning, coding, and instruction-following abilities.
SmolLM2's base and instruction-tuned models demonstrate state-of-the-art performance relative to similarly sized competitors on key benchmarks.

The paper "SmoLLM2: When Smol Goes Big -- Data-Centric Training of a Small LLM" details the development of SmoLLM2, a 1.7 billion parameter LLM. The primary focus is on achieving high performance through extensive training on approximately 11 trillion tokens, employing a multi-stage training process that blends web text with specialized math, code, and instruction-following data. Additionally, the authors introduce new datasets, namely FineMath, Stack-Edu, and SmolTalk, to address perceived limitations in the size or quality of existing datasets.

The authors' contributions include:

A detailed evaluation of available web, code, math, and instruction-following datasets.
A multi-stage training strategy for SmoLLM2 involving manual rebalancing of data sources to optimize performance. The compute cost for SmoLLM2 was around $1 \times 10^{23}$ FLOPs, equivalent to \$250,000 USD in GPU compute.
The creation of FineMath, Stack-Edu, and SmolTalk datasets to enhance mathematical, coding, and instruction-following capabilities, respectively.
Demonstration that both the base and instruction-tuned variants of SmoLLM2 achieve state-of-the-art performance among models of similar size.

The training process incorporates "pretraining" on a large corpus of unstructured text to enable the model to learn language structure and factual knowledge. High-quality data curation pipelines are used to filter and reformat web texts, balancing data quantity with quality. The inclusion of specialized data, such as code and mathematics, aims to improve reasoning and world knowledge capabilities. Multi-stage pretraining incorporates specialized datasets later in the training process. Instruction tuning and preference learning are then employed to align the model for helpful and safe responses.

Ablation studies were conducted to evaluate and compare English web datasets, employing 1.7B parameter Transformers with a sequence length of 2048 and a global batch size of approximately 2 million tokens. Evaluation was performed using the \href{https://github.com/huggingface/lighteval/}{lighteval} library on benchmarks such as MMLU, HellaSwag, OpenBookQA, PIQA, WinoGrande, ARC, and CommonSenseQA. Math and code datasets were evaluated using annealing, starting from a mid-training checkpoint of SmoLLM2 at 3T tokens. Math ablation models were evaluated on GSM8K, MATH, and MMLU-STEM, while code ablation models were evaluated on HumanEval and MultiPL-E using the \href{https://github.com/bigcode-project/bigcode-evaluation-harness}{BigCode-Evaluation-Harness}.

Key findings from dataset ablations include:

FineWeb-Edu excels on educational benchmarks such as MMLU, ARC, and OpenBookQA.
DCLM demonstrates superior performance on HellaSwag and CommonsenseQA.
A 60\% FineWeb-Edu and 40\% DCLM mix achieves a balance of performance across benchmarks.
InfiMM-WebMath achieves a peak accuracy of 14\% on GSM8K, compared to 10\% for OpenWebMath (OWM).
FineMath subsets consistently outperform OWM and InfiMM-WebMath on GSM8K, MATH, and MMLU-STEM.

The authors constructed Stack-Edu, a filtered variant of StarCoder2Data that focuses on educational code, by training language-specific classifiers using the StarEncoder model on synthetic annotations generated by Llama3-70B-Instruct. Filtering with a threshold of 3 generally improved performance across most languages.

The pretraining of SmoLLM2 involved a multi-stage training approach guided by four key principles: performance-driven interventions, upsampling high-quality math and code during annealing, strategic introduction of medium-sized datasets mid-training, and avoiding excessive data repetition. The model was trained on 256 H100s using the \href{https://github.com/huggingface/nanotron/}{nanotron} framework and the AdamW optimizer with $(\beta, \beta_2) = (0.9,0.95)$ . A Warmup Stable Decay (WSD) learning rate schedule was employed, with a 2,000-step warmup phase and a peak learning rate of $5.0 \times 10^{-4}$ .

$\beta$ is the exponential decay rate for the first moment estimates. $\beta_2$ is the exponential decay rate for the second moment estimates.

The first phase of pretraining (0 to 6T tokens) used a dataset mixture of 60\% FineWeb-Edu, 40\% DCLM, and 10\% StarCoderData. OWM was added in the second stage (6T to 8T tokens) at a 5\% ratio. In the third stage (8T to 10T tokens), the text-only English portion of InfiMM-WebMath was added, and StarCoderData was replaced with Stack-Edu. The final stage involved decaying the learning rate linearly to 0, and introducing the highest quality mathematical datasets: InfiWebMath-3+ and FineMath 4+.

To support long-context applications, the context length was extended from 2k to 8k tokens by continuing training with a data mixture that included 40% long-context documents sourced from DCLM, FineWeb-Edu, and the books subset of Dolma.

The base SmoLLM2 model outperforms the Qwen2.5 base model on HellaSwag and ARC. While SmoLLM2 lags behind Qwen2.5-1.5B on math and coding benchmarks, it outperforms Llama3.2-1B on GSM8K, MATH and HumanEval.

For post-training, the authors created SmolTalk, a new instruction tuning dataset. MagPie-Ultra, a multi-turn dataset generated using Llama-3.1-405B-Instruct-FP8, was further filtered using smaller Llama models and ArmoRM. Additional task-specific datasets were developed to enhance instruction-following with detailed constraints (Smol-Constraint), summarization (Smol-Summarization), and rewriting (Smol-Rewrite) capabilities. Public math instruction datasets, including NuminaMath-CoT and MetaMathQA, were incorporated into SmolTalk.

Direct Preference Optimization (DPO) was used for preference learning, with UltraFeedback proving the most consistently effective across benchmarks.

SmoLLM2-Instruct exhibits strong instruction following capabilities, outperforming Qwen2.5-1.5B-Instruct on IFEval.