Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
115 tokens/sec
GPT-4o
79 tokens/sec
Gemini 2.5 Pro Pro
55 tokens/sec
o3 Pro
15 tokens/sec
GPT-4.1 Pro
76 tokens/sec
DeepSeek R1 via Azure Pro
54 tokens/sec
2000 character limit reached

Supervised Fine-Tuning (SFT): Enhancing LLM Performance

Last updated: June 10, 2025

Certainly! Here is a polished, thoroughly referenced version of the article on Supervised Fine-Tuning (SFT °), strictly using findings from "How Abilities in LLMs are Affected by Supervised Fine-tuning Data Composition" (Dong et al., 2023 ° ):


Supervised Fine-Tuning (SFT): Data Composition and Model Capability in LLMs

Supervised Fine-Tuning (SFT) is a critical methodology for enhancing the capabilities of LLMs ° beyond raw pretraining. By further tuning models on curated instruction–response datasets, SFT sharpens skills such as mathematical reasoning, code generation, and general conversational alignment.

This synthesis distills the empirical findings and practical guidance from a systematic paper of how SFT data ° composition—specifically the balance and volume of domain-specific examples—affects LLM ° abilities, scaling, and trade-offs (Dong et al., 2023 ° ).


1. Data Composition: Impacts on Model Abilities

  • Mathematical reasoning improves strongly and steadily as the amount of domain-specific math data increases.
  • Code generation performance scales nearly log-linearly with additional code data for large models, while smaller models show less consistent gains.
  • General instructional or chat ability (human-alignment) emerges rapidly—plateauing after ~1,000 in-domain samples °—such that further data yields diminishing returns °.

Key empirical insight: When tuning data is scarce (i.e., low-resource SFT), mixing samples from math, code, and general domains yields synergistic improvements across all abilities. However, in high-resource regimes, combining domains can introduce domain conflicts: optimizing one skill may degrade another.

Practical implication: Optimal SFT should prioritize increasing the absolute number of in-domain samples for each target skill. Careful domain mixing is beneficial under low data, but strict separation is often better when sufficient data per domain is available.


2. Scaling Properties: Data, Composition, and Model Size

  • Absolute data quantity for each domain dominates performance scaling; tuning data ratios (e.g., math:code:general) are only secondary except at extremes.
  • Larger models consistently outperform smaller models when trained on the same data volume, especially in specialized skills (math, code). For code, gains are close to log-linear with additional data.
  • Emergent behavior ° in general ability: General conversational (instruction/chat) skills reach high performance with relatively little data, and further increases offer limited benefit.

Table Example (LLaMA-33B, summarized):

SFT Data GSM8K ° (Math) HumanEval ° (Code) MT-Bench ° (General)
General SFT 26.06 24.30 High (emerges fast)
Math SFT Only 57.91 24.74
Code SFT Only 26.23 26.82

Takeaways:

  • For specialized domains, sufficient in-domain data ° is essential.
  • For general abilities, moderate data is usually enough; excess offers little gain.

3. SFT Strategies: Catastrophic Forgetting & Skill Retention

Sequential SFT (tuning for different abilities in a series, e.g., code → math → general) is prone to catastrophic forgetting: acquiring new skills erases previous ones.

Multi-task SFT (mixing all data in a single run) better preserves diverse capabilities but can compromise peak performance per skill, especially with large data volumes.

Method GSM8K HumanEval MT-Bench
Multi-task 50.94 19.50 5.73
Sequential 39.12 20.12 5.93
Mixed Sequential 40.48 18.30 5.93
Dual-stage Mixed (DMT °) 46.47 19.50 6.03

Dual-stage Mixed Fine-Tuning (DMT) Strategy:

  • Stage 1: Fully fine-tune on specialist data ° (e.g., all math, all code) to consolidate those skills.
  • Stage 2: Fine-tune primarily on general data, injecting a small fraction (e.g., k = 1/256) of specialist data to maintain those skills and prevent forgetting.

Recommendation: Adopt DMT or similar strategies when balancing multiple complex abilities is needed.


4. Quantity vs. Ratio: What Drives Ability Improvement?

  • Data amount, not just ratio, is decisive: Improving a given skill is most strongly determined by the number of relevant samples in the SFT set, not their proportion relative to other domains—unless a skill is starved entirely.
  • Mixing data acts as a regularizer in low-resource settings, but becomes “noise” at scale, reducing peak specialization.
  • Ablation studies: Demonstrated that scaling data of one domain (while holding others constant) has far more impact than tweaking ratios.

5. Mathematical Foundations for SFT Planning

Data sets: Let {D1,D2,...,Dk}\{D_1, D_2, ..., D_k\} be domain-specific SFT datasets, with each Di={qi,j,ri,j}jD_i = \{q_{i,j}, r_{i,j}\}_j.

  • Performance metrics: Computed on in-domain benchmarks for each DiD_i.

FLOPs ° for Training:

nctx=nQ+nRn_\text{ctx} = n_Q + n_R

Ctrain6NnctxNsC_\text{train} \approx 6N n_\text{ctx} N_s

where nQ,nRn_Q, n_R = average question and response length, NN = number of model params, NsN_s = # SFT samples.


6. Summary and Practical Guidelines

  • Specialized skills (math, code, etc.) require volume growth; general alignment emerges with moderate data.
  • Larger LLMs are notably more effective in both mixed and low-resource SFT.
  • Absolute sample size per domain governs performance; fix sample number before finetuning ratio.
  • Avoid catastrophic forgetting via multi-stage or dual-mixed SFT (e.g., DMT).
  • When resources are limited, use data mixing ° for balanced skill uplift; with ample data, keep domains separate or use staged approaches.

For practitioners: To train robust, multi-skill LLMs:

  • Secure a large, domain-targeted SFT dataset ° for each target ability.
  • Employ DMT or similar SFT strategies for maintaining multiple competencies.
  • Scale model size where feasible; larger models are more versatile, especially under low-resource or mixed-domain SFT.

References: All experimental details, metrics, and strategies are sourced from "How Abilities in LLMs are Affected by Supervised Fine-tuning Data Composition" (Dong et al., 2023 ° ). Benchmarks: GSM8K (math reasoning), HumanEval (code generation), MT-Bench (instruction/general chat). FLOPs and data ratio formulas follow the original notation.