Papers
Topics
Authors
Recent
2000 character limit reached

SmolTulu-1.7b-Instruct: Optimized Small LLM

Updated 28 November 2025
  • SmolTulu-1.7b-Instruct is an instruction-tuned small language model built from SmolLM2-1.7B using specialized post-training alignment techniques.
  • It employs a two-stage pipeline with Supervised Finetuning and Direct Preference Optimization, optimizing the learning-rate-to-batch-size ratio for reasoning and pattern recognition.
  • Empirical results show state-of-the-art gains on benchmarks like GSM8K and IFEval, highlighting its balanced performance across diverse tasks.

SmolTulu-1.7b-Instruct, also referenced as SmolTulu-DPO-1130, is an instruction-tuned LLM that adapts AllenAI’s Tulu 3 post-training pipeline to Huggingface’s SmolLM2-1.7B base model. Its design leverages empirical findings on the critical role of the learning-rate-to-batch-size ratio in optimizing reasoning and pattern recognition performance in small LLMs (SLMs). SmolTulu-1.7b-Instruct achieves state-of-the-art results among sub-2B parameter models on instruction following and mathematical reasoning tasks by meticulous adaptation of post-training procedures and optimization dynamics, without modifying the underlying model architecture (Alrashed, 11 Dec 2024).

1. Model Foundation and Architectural Characteristics

SmolTulu-1.7b-Instruct is built upon the SmolLM2-1.7B decoder-only Transformer architecture described by Allal et al. (2024). The key configuration parameters are:

  • Number of Transformer layers: 32
  • Model (hidden) dimensionality: 2,048
  • Feed-forward inner dimensionality: 8,192
  • Number of self-attention heads: 16
  • Total parameter count: approximately 1.7 billion

No architectural modifications are introduced by SmolTulu; all improvements are attributable to post-training and alignment strategies. The model’s enhancements derive exclusively from procedure-level adaptations, encompassing instruction tuning and preference modeling, rather than network structural changes.

2. Instruction-Tuning and Alignment Pipeline

SmolTulu-1.7b-Instruct employs a post-training pipeline closely following the Tulu 3 recipe but calibrated for 1.7B parameters, comprising two principal stages:

A. Supervised Finetuning (SFT):

  • Data: allenai/tulu-3-sft-mixture (includes sources such as alpaca, BBH, GSM8K, IFEval).
  • Max sequence length: 2,048 tokens.
  • Training: Single epoch over the dataset.
  • Learning rate schedule: Linear decay with 10% warm-up.

B. Direct Preference Optimization (DPO):

  • Data: allenai/llama-3.1-tulu-3-8b-preference-mixture (UltraFeedback + Tulu3 synthetic preferences).
  • Loss function: Length-normalized DPO loss with KL-penalty β=5\beta=5,

LDPO=E(x,yc,yr)Dlogσ(β[logπθ(ycx)πref(ycx)]β[logπθ(yrx)πref(yrx)])\mathcal{L}_{\mathrm{DPO}} = -\mathbb{E}_{(x,y_c,y_r)\sim\mathcal{D}} \log \sigma\biggl( \beta \Bigl[\log \frac{\pi_\theta(y_c|x)}{\pi_{\rm ref}(y_c|x)}\Bigr] -\beta \Bigl[\log \frac{\pi_\theta(y_r|x)}{\pi_{\rm ref}(y_r|x)}\Bigr] \biggr)

where πref\pi_{\rm ref} refers to the SFT model.

  • Max sequence length: 2,048 tokens.
  • Training: One epoch, linear learning rate decay with 10% warm-up.

These procedures allow transfer of alignment strategies designed for larger LLMs to computationally efficient SLM baselines, preserving key alignment properties post-tuning.

3. Optimization Dynamics and Task-Dependent Hyperparameter Effects

A central finding is the strong impact of the learning-rate-to-batch-size ratio R=η/BR = \eta / B (where η\eta is peak learning rate and BB is batch size) on downstream model performance. Empirical exploration of several (η,B)(\eta, B) configurations elucidated that optimal RR is task dependent:

SFT Hyperparameters:

Variant η\eta BB R=η/BR = \eta/B
SmolTulu SFT-1130 9.0×1059.0 \times 10^{-5} 8 1.125×1051.125 \times 10^{-5}
SmolTulu SFT-1207 3.1×1063.1 \times 10^{-6} 32 9.69×1089.69 \times 10^{-8}
Tulu 3 (8B) 5.0×1065.0 \times 10^{-6} 128 3.91×1083.91 \times 10^{-8}
Tulu 3 (70B) 2.0×1062.0 \times 10^{-6} 128 1.56×1081.56 \times 10^{-8}

DPO Hyperparameters:

Variant η\eta BB R=η/BR = \eta/B
SmolTulu DPO-1130 8.0×1078.0 \times 10^{-7} 12 6.67×1086.67 \times 10^{-8}
SmolTulu DPO-1207 5.0×1075.0 \times 10^{-7} 32 1.56×1081.56 \times 10^{-8}
Tulu 3 DPO 8B 5.0×1075.0 \times 10^{-7} 128 3.91×1093.91 \times 10^{-9}
Tulu 3 DPO 70B 2.0×1072.0 \times 10^{-7} 128 1.56×1091.56 \times 10^{-9}

In all cases, reasoning-centric benchmarks (e.g., ARC, GSM8K) benefit from higher RR, while pattern-recognition benchmarks (e.g., HellaSwag, IFEval) favor lower RR. The model capacity at 1.7B parameters generates a regime in which a single RR cannot optimize all metrics uniformly, necessitating task-specific tuning.

4. Empirical Performance Evaluation

Post SFT and DPO, SmolTulu-1.7b-Instruct (DPO-1130) demonstrates substantial improvements on key tasks:

Task SmolLM2-1.7B SmolTulu DPO-1130 Δ\Delta
GSM8K (5-shot) 48.2% 51.6% +3.4%
ARC (avg) 51.7% 51.5%* –0.2%*
ARC (alt config) 57.1% +5.4%
HellaSwag 66.1% 61.1%* –5.0%*
IFEval (avg) 56.7% 67.7% +11.0%

*DPO-1130 yields strong reasoning gains on GSM8K and IFEval, with alternate (lower RR) DPO runs restoring ARC accuracy substantially. The improvements on GSM8K and IFEval are substantially above typical benchmark noise (±1–2%). No formal statistical confidence intervals were reported.

5. Ablation Analyses and Theoretical Context

Comprehensive ablation studies on a 135M-parameter SmolLM2 model reveal near-monotonic improvement in ARC and GSM8K accuracy as RR increases, whereas HellaSwag and IFEval show non-monotonic behavior, peaking at intermediate RR values (2×106\approx 2\times10^{-6}5×1065\times10^{-6}). At 1.7B parameters, the model reveals capacity-limited divergence: high RR benefits reasoning, while low RR benefits pattern tasks, limiting universal gains from a single configuration.

Varying training epochs (from 1 to 2) shifts absolute scores by approximately 1–2% without altering RR–performance curve shapes. These trends substantiate the theoretical perspective articulated by Keskar et al. (2017) and Masters & Luschi (2018), suggesting that high-RR (small batch, higher LR) training encourages flatter minima conducive to reasoning, whereas low-RR (large batch) training supports pattern matching.

A plausible implication is that in small model regimes, optimization hyperparameters deserve particular scrutiny, specifically with respect to the end-task distribution of interest.

6. Training Recipe, Best Practice, and Generalization Guidelines

The canonical SmolTulu-1.7b-Instruct (DPO-1130) was produced with:

  • SFT: LR =9×105=9\times10^{-5}, BS =8=8 (R=1.125×105R=1.125\times10^{-5}), 1 epoch, max seq. length =2048=2048, 10% warm-up.
  • DPO: LR =8×107=8\times10^{-7}, BS =12=12 (R=6.67×108R=6.67\times10^{-8}), 1 epoch, max seq. length =2048=2048, 10% warm-up, KL penalty β=5\beta=5.

Best practice guidelines for sub-2B parameter LMs are documented as:

  1. For SFT, use R105R\approx10^{-5} if B16B\leq16.
  2. For DPO, employ RR values 5–10× larger than those for 8B+ models (R107R\approx10^{-7}10810^{-8}).
  3. Favor higher RR for math and logic reasoning tasks.
  4. Favor lower RR for pattern recognition/commonsense tasks.
  5. Only adjust η\eta and BB—keep all other recipe elements (sequence length, warmup, epochs) consistent with Tulu 3.

Careful calibration of RR enables SLMs to bridge much of the capability gap with multi-billion-parameter models on both instruction-following and reasoning benchmarks (Alrashed, 11 Dec 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SmolTulu-1.7b-Instruct.