SmolLM2 Family: High-Performance Compact LMs

Updated 1 October 2025

SmolLM2 Family is a collection of compact language models (e.g., 1.7B parameters) leveraging multi-stage training and data-centric strategies to bridge the gap with larger models.
The architecture enhances Llama2 with 24 transformer layers, tied embeddings, and extended context lengths up to 8,000 tokens for superior reasoning and instruction following.
Innovative optimization dynamics and alignment methods, including Direct Preference Optimization, drive measurable improvements in reasoning benchmarks and code generation.

The SmolLM2 family represents a class of highly optimized, data-centric “small” LLMs designed for strong performance at reduced parameter scales. Centered around the SmolLM2-1.7B model, the family employs careful architectural choices, multi-stage large-scale training over 11 trillion tokens, and novel datasets to extend the performance envelope of sub-2B parameter transformer models in domains such as reasoning, code, and instruction following (Allal et al., 4 Feb 2025). The SmolLM2 family further includes fine-tuned variants such as SmolTulu-1.7b-Instruct, which adapts post-training alignment techniques from large models and introduces advances in optimization dynamics (Alrashed, 11 Dec 2024). This comprehensive approach seeks to bridge the gap in capability between small and LLMs through a combination of architecture, training strategy, data engineering, and alignment methods.

1. Model Architecture and Parameterization

SmolLM2-1.7B is built atop the Llama2 architecture but is sized for computational efficiency, featuring 1.7 billion parameters organized into 24 transformer layers. Each layer has a model dimension of 2,048, feed-forward blocks with 8,192 units, and 32 attention heads. The model uses tied embeddings, SwiGLU activation functions, and Rotational Positional Embedding (RoPE) with θ = 10,000, supporting context lengths up to 8,000 tokens after long-context extension. This configuration enables competitive performance while remaining feasible for deployment in resource-constrained settings (Allal et al., 4 Feb 2025).

Hyperparameter	Value	Note
Layers	24	Transformer blocks
Hidden dimension	2,048	Model size per layer
FFN dimension	8,192	Feed-forward network
Attention heads	32	Multihead attention
Embedding	Tied	Shared input/output projections
Positional Encode	RoPE, θ=10k	Rotational embedding
Context Length	2k → 8k	Extended midtraining

A plausible implication is that this compact transformer setup, when paired with tailored data and optimization strategies, allows SmolLM2 to compete with large models in selected tasks without their resource demands.

2. Data-Centric Multi-Stage Training

SmolLM2's training involves a multi-stage process across ~11T tokens, sequentially integrating diverse data sources. Training stages adjust dataset mixtures and emphasize incremental specialization:

Stage 1 (0–6T tokens): Pretraining on English web data, using a 60/40 split of FineWeb-Edu and DCLM, with 10% code data (StarCoderData).
Stage 2 (6–8T tokens): Introducing 5% math data (e.g., OWM), with code upsampling.
Stage 3 (8–10T tokens): Rebalancing English split, replacing code data with Stack-Edu, adding InfiMM-WebMath and Jupyter Notebooks.
Stage 4 (10–11T tokens): Decay phase with increased math-content (up to 14%) via FineMath 4+, InfiWebMath-3+, and expanded Stack-Edu. Synthetic educational text from Cosmopedia v2 is included.

After full pretraining, context extension to 8k tokens is executed via checkpointing and modified RoPE. The final alignment consists of supervised instruction tuning (SmolTalk dataset) and Direct Preference Optimization (DPO).

Stage	Web Split	Code	Math	Special Integration
1	60/40	StarCoder	None	Baseline language/data mix
2	60/40	Upsampled	5% OWM	Gap addressing
3	40/60	Stack-Edu	InfiMM-WebMath	Improved code/math datasets
4	Decay	Stack-Edu+	14% FineMath+	Cosmopedia v2, structure

Manual refinement after each two-trillion-token increment enables rapid response to observed deficiencies in task performance.

3. Specialized Datasets and Performance-Driven Ablations

Three novel datasets are central to SmolLM2's success:

FineMath: Curated for detailed step-by-step mathematical reasoning.
Stack-Edu: Filtered for high-quality, educational code snippets using classifier tags derived from Llama-3.1-70B-Instruct.
SmolTalk: A composite set containing conversational, math, code, and instruction-rich samples, incorporating MagPie-Ultra and custom instruction modules (Smol-Constraints, Smol-Rewrite, Smol-Summarization).

Performance ablation studies, including small-scale runs, direct dataset comparisons, and mix adjustments, inform all key decisions. Metrics are tracked for reasoning (ARC, GSM8K), code (HumanEval, MultiPL-E), general understanding (MMLU, PIQA, OpenBookQA), and instruction following (IFEval, MT-Bench).

A plausible implication is that this granular performance monitoring and tailored dataset creation underpin the observed advances in mathematical and instructional capabilities.

4. Optimization Dynamics and Alignment in Small Models

SmolTulu, a instruction-tuned derivative of SmolLM2-1.7B, demonstrates that the learning rate to batch size (LR/BS) ratio exerts a profound task-dependent effect on model performance (Alrashed, 11 Dec 2024). Empirical analysis using both the 135M and 1.7B parameter models shows:

High LR/BS ratios yield monotonic improvements on reasoning benchmarks (ARC—57.1%, GSM8K—51.6%), linked to larger per-example updates and efficient navigation of flatter loss landscape regions.
Pattern recognition tasks (HellaSwag, IFEval) attain peak performance at lower LR/BS ratios due to stabilized gradient dynamics.

This is formalized as:

$\text{Performance}_{\text{reasoning}} \propto f \left( \frac{\text{LR}}{\text{BS}} \right), \quad \text{with } f(\cdot) \text{ increasing for higher ratios.}$

Direct Preference Optimization (DPO) is employed for final alignment, using the following objective:

$\max_{\pi_\theta} \ \mathbb{E}_{y_c,y_r \sim \mathcal{D}}\left[ \log \sigma \left(\beta \left( \log \frac{\pi_\theta(y_c|x)}{\pi_{\text{ref}}(y_c|x)} - \log \frac{\pi_\theta(y_r|x)}{\pi_{\text{ref}}(y_r|x)} \right) \right) \right]$

Here, $\pi_{\text{ref}}$ is the supervised fine-tuning baseline and $\beta$ modulates KL divergence.

5. Comparative Performance and Benchmark Results

SmolLM2 outperforms contemporary small LMs, including Qwen2.5-1.5B and Llama3.2-1B, in a range of benchmarks, particularly in instruction following and mathematical reasoning (Allal et al., 4 Feb 2025). Reported metrics include:

IFEval: 67.7% (Δ11%)
GSM8K: 51.6% (Δ3.4%)
ARC: 57.1% (Δ5.4%)
Intermediate math scores improving from ~4–10 to >30 as training advances.

Instruction-following ability is verified through MT-Bench and dialogue datasets. HumanEval and MultiPL-E demonstrate competitive code reasoning. These results corroborate the effectiveness of the data-centric and optimization-driven training approach.

6. Release Strategy and Directions for Future Research

All primary SmolLM2 models and curated datasets (FineMath, Stack-Edu, SmolTalk) are publicly released (Allal et al., 4 Feb 2025). This open strategy is intended to advance efficient LLM training and encourage broader exploration of small models for diverse applications.

Future research is likely to elaborate on data curation protocols, investigate new formalisms for optimization dynamics in small-scale models, refine multi-stage training methodologies, and pursue capabilities in long-context reasoning and advanced alignment. This suggests a sustained trajectory for high-performance small LLMs in both academic and applied domains.

7. Significance and Task-Specific Model Scaling

The SmolLM2 family illustrates that effective architecture design, data-centric curation, and nuanced optimization scheduling are critical to bridging the gap between small and large transformer-based models. By leveraging detailed ablation studies, manual intervention in data mixing, and targeted instructional alignment, the family demonstrates state-of-the-art performance in sub-2B parameter regimes. A plausible implication is that small LLMs, when engineered with such rigor, can efficiently address a growing spectrum of tasks previously considered exclusive to much larger systems.

PDF Markdown Chat (Pro)

References (2)

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model (2025)

SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs (2024)

Follow Topic

Get notified by email when new papers are published related to SmolLM2 Family.