SmolTulu-1.7b-Instruct: Optimized Small LLM

Updated 28 November 2025

SmolTulu-1.7b-Instruct is an instruction-tuned small language model built from SmolLM2-1.7B using specialized post-training alignment techniques.
It employs a two-stage pipeline with Supervised Finetuning and Direct Preference Optimization, optimizing the learning-rate-to-batch-size ratio for reasoning and pattern recognition.
Empirical results show state-of-the-art gains on benchmarks like GSM8K and IFEval, highlighting its balanced performance across diverse tasks.

SmolTulu-1.7b-Instruct, also referenced as SmolTulu-DPO-1130, is an instruction-tuned LLM that adapts AllenAI’s Tulu 3 post-training pipeline to Huggingface’s SmolLM2-1.7B base model. Its design leverages empirical findings on the critical role of the learning-rate-to-batch-size ratio in optimizing reasoning and pattern recognition performance in small LLMs (SLMs). SmolTulu-1.7b-Instruct achieves state-of-the-art results among sub-2B parameter models on instruction following and mathematical reasoning tasks by meticulous adaptation of post-training procedures and optimization dynamics, without modifying the underlying model architecture (Alrashed, 11 Dec 2024).

1. Model Foundation and Architectural Characteristics

SmolTulu-1.7b-Instruct is built upon the SmolLM2-1.7B decoder-only Transformer architecture described by Allal et al. (2024). The key configuration parameters are:

Number of Transformer layers: 32
Model (hidden) dimensionality: 2,048
Feed-forward inner dimensionality: 8,192
Number of self-attention heads: 16
Total parameter count: approximately 1.7 billion

No architectural modifications are introduced by SmolTulu; all improvements are attributable to post-training and alignment strategies. The model’s enhancements derive exclusively from procedure-level adaptations, encompassing instruction tuning and preference modeling, rather than network structural changes.

2. Instruction-Tuning and Alignment Pipeline

SmolTulu-1.7b-Instruct employs a post-training pipeline closely following the Tulu 3 recipe but calibrated for 1.7B parameters, comprising two principal stages:

A. Supervised Finetuning (SFT):

Data: allenai/tulu-3-sft-mixture (includes sources such as alpaca, BBH, GSM8K, IFEval).
Max sequence length: 2,048 tokens.
Training: Single epoch over the dataset.
Learning rate schedule: Linear decay with 10% warm-up.

B. Direct Preference Optimization (DPO):

Data: allenai/llama-3.1-tulu-3-8b-preference-mixture (UltraFeedback + Tulu3 synthetic preferences).
Loss function: Length-normalized DPO loss with KL-penalty $\beta=5$ ,

$\mathcal{L}_{\mathrm{DPO}} = -\mathbb{E}_{(x,y_c,y_r)\sim\mathcal{D}} \log \sigma\biggl( \beta \Bigl[\log \frac{\pi_\theta(y_c|x)}{\pi_{\rm ref}(y_c|x)}\Bigr] -\beta \Bigl[\log \frac{\pi_\theta(y_r|x)}{\pi_{\rm ref}(y_r|x)}\Bigr] \biggr)$

where $\pi_{\rm ref}$ refers to the SFT model.

Max sequence length: 2,048 tokens.
Training: One epoch, linear learning rate decay with 10% warm-up.

These procedures allow transfer of alignment strategies designed for larger LLMs to computationally efficient SLM baselines, preserving key alignment properties post-tuning.

3. Optimization Dynamics and Task-Dependent Hyperparameter Effects

A central finding is the strong impact of the learning-rate-to-batch-size ratio $R = \eta / B$ (where $\eta$ is peak learning rate and $B$ is batch size) on downstream model performance. Empirical exploration of several $(\eta, B)$ configurations elucidated that optimal $R$ is task dependent:

SFT Hyperparameters:

Variant	$\eta$	$B$	$R = \eta/B$
SmolTulu SFT-1130	$9.0 \times 10^{-5}$	8	$1.125 \times 10^{-5}$
SmolTulu SFT-1207	$3.1 \times 10^{-6}$	32	$9.69 \times 10^{-8}$
Tulu 3 (8B)	$5.0 \times 10^{-6}$	128	$3.91 \times 10^{-8}$
Tulu 3 (70B)	$2.0 \times 10^{-6}$	128	$1.56 \times 10^{-8}$

DPO Hyperparameters:

Variant	$\eta$	$B$	$R = \eta/B$
SmolTulu DPO-1130	$8.0 \times 10^{-7}$	12	$6.67 \times 10^{-8}$
SmolTulu DPO-1207	$5.0 \times 10^{-7}$	32	$1.56 \times 10^{-8}$
Tulu 3 DPO 8B	$5.0 \times 10^{-7}$	128	$3.91 \times 10^{-9}$
Tulu 3 DPO 70B	$2.0 \times 10^{-7}$	128	$1.56 \times 10^{-9}$

In all cases, reasoning-centric benchmarks (e.g., ARC, GSM8K) benefit from higher $R$ , while pattern-recognition benchmarks (e.g., HellaSwag, IFEval) favor lower $R$ . The model capacity at 1.7B parameters generates a regime in which a single $R$ cannot optimize all metrics uniformly, necessitating task-specific tuning.

4. Empirical Performance Evaluation

Post SFT and DPO, SmolTulu-1.7b-Instruct (DPO-1130) demonstrates substantial improvements on key tasks:

Task	SmolLM2-1.7B	SmolTulu DPO-1130	$\Delta$
GSM8K (5-shot)	48.2%	51.6%	+3.4%
ARC (avg)	51.7%	51.5%*	–0.2%*
ARC (alt config)	—	57.1%	+5.4%
HellaSwag	66.1%	61.1%*	–5.0%*
IFEval (avg)	56.7%	67.7%	+11.0%

*DPO-1130 yields strong reasoning gains on GSM8K and IFEval, with alternate (lower $R$ ) DPO runs restoring ARC accuracy substantially. The improvements on GSM8K and IFEval are substantially above typical benchmark noise (±1–2%). No formal statistical confidence intervals were reported.

5. Ablation Analyses and Theoretical Context

Comprehensive ablation studies on a 135M-parameter SmolLM2 model reveal near-monotonic improvement in ARC and GSM8K accuracy as $R$ increases, whereas HellaSwag and IFEval show non-monotonic behavior, peaking at intermediate $R$ values ( $\approx 2\times10^{-6}$ – $5\times10^{-6}$ ). At 1.7B parameters, the model reveals capacity-limited divergence: high $R$ benefits reasoning, while low $R$ benefits pattern tasks, limiting universal gains from a single configuration.

Varying training epochs (from 1 to 2) shifts absolute scores by approximately 1–2% without altering $R$ –performance curve shapes. These trends substantiate the theoretical perspective articulated by Keskar et al. (2017) and Masters & Luschi (2018), suggesting that high- $R$ (small batch, higher LR) training encourages flatter minima conducive to reasoning, whereas low- $R$ (large batch) training supports pattern matching.

A plausible implication is that in small model regimes, optimization hyperparameters deserve particular scrutiny, specifically with respect to the end-task distribution of interest.

6. Training Recipe, Best Practice, and Generalization Guidelines

The canonical SmolTulu-1.7b-Instruct (DPO-1130) was produced with:

SFT: LR $=9\times10^{-5}$ , BS $=8$ ( $R=1.125\times10^{-5}$ ), 1 epoch, max seq. length $=2048$ , 10% warm-up.
DPO: LR $=8\times10^{-7}$ , BS $=12$ ( $R=6.67\times10^{-8}$ ), 1 epoch, max seq. length $=2048$ , 10% warm-up, KL penalty $\beta=5$ .

Best practice guidelines for sub-2B parameter LMs are documented as:

For SFT, use $R\approx10^{-5}$ if $B\leq16$ .
For DPO, employ $R$ values 5–10× larger than those for 8B+ models ( $R\approx10^{-7}$ – $10^{-8}$ ).
Favor higher $R$ for math and logic reasoning tasks.
Favor lower $R$ for pattern recognition/commonsense tasks.
Only adjust $\eta$ and $B$ —keep all other recipe elements (sequence length, warmup, epochs) consistent with Tulu 3.

Careful calibration of $R$ enables SLMs to bridge much of the capability gap with multi-billion-parameter models on both instruction-following and reasoning benchmarks (Alrashed, 11 Dec 2024).

PDF Markdown Chat (Pro)

References (1)

SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SmolTulu-1.7b-Instruct.