Supervised Fine-Tuning (SFT): Enhancing LLM Performance
Last updated: June 10, 2025
Certainly! Here is a polished, thoroughly referenced version of the article on Supervised Fine-Tuning (SFT °), strictly using findings from "How Abilities in LLMs are Affected by Supervised Fine-tuning Data Composition" (Dong et al., 2023 ° ):
Supervised Fine-Tuning (SFT): Data Composition and Model Capability in LLMs
Supervised Fine-Tuning (SFT) is a critical methodology for enhancing the capabilities of LLMs ° beyond raw pretraining. By further tuning models on curated instruction–response datasets, SFT sharpens skills such as mathematical reasoning, code generation, and general conversational alignment.
This synthesis distills the empirical findings and practical guidance from a systematic paper of how SFT data ° composition—specifically the balance and volume of domain-specific examples—affects LLM ° abilities, scaling, and trade-offs (Dong et al., 2023 ° ).
1. Data Composition: Impacts on Model Abilities
- Mathematical reasoning improves strongly and steadily as the amount of domain-specific math data increases.
- Code generation performance scales nearly log-linearly with additional code data for large models, while smaller models show less consistent gains.
- General instructional or chat ability (human-alignment) emerges rapidly—plateauing after ~1,000 in-domain samples °—such that further data yields diminishing returns °.
Key empirical insight: When tuning data is scarce (i.e., low-resource SFT), mixing samples from math, code, and general domains yields synergistic improvements across all abilities. However, in high-resource regimes, combining domains can introduce domain conflicts: optimizing one skill may degrade another.
Practical implication: Optimal SFT should prioritize increasing the absolute number of in-domain samples for each target skill. Careful domain mixing is beneficial under low data, but strict separation is often better when sufficient data per domain is available.
2. Scaling Properties: Data, Composition, and Model Size
- Absolute data quantity for each domain dominates performance scaling; tuning data ratios (e.g., math:code:general) are only secondary except at extremes.
- Larger models consistently outperform smaller models when trained on the same data volume, especially in specialized skills (math, code). For code, gains are close to log-linear with additional data.
- Emergent behavior ° in general ability: General conversational (instruction/chat) skills reach high performance with relatively little data, and further increases offer limited benefit.
Table Example (LLaMA-33B, summarized):
SFT Data | GSM8K ° (Math) | HumanEval ° (Code) | MT-Bench ° (General) |
---|---|---|---|
General SFT | 26.06 | 24.30 | High (emerges fast) |
Math SFT Only | 57.91 | 24.74 | – |
Code SFT Only | 26.23 | 26.82 | – |
Takeaways:
- For specialized domains, sufficient in-domain data ° is essential.
- For general abilities, moderate data is usually enough; excess offers little gain.
3. SFT Strategies: Catastrophic Forgetting & Skill Retention
Sequential SFT (tuning for different abilities in a series, e.g., code → math → general) is prone to catastrophic forgetting: acquiring new skills erases previous ones.
Multi-task SFT (mixing all data in a single run) better preserves diverse capabilities but can compromise peak performance per skill, especially with large data volumes.
Method | GSM8K | HumanEval | MT-Bench |
---|---|---|---|
Multi-task | 50.94 | 19.50 | 5.73 |
Sequential | 39.12 | 20.12 | 5.93 |
Mixed Sequential | 40.48 | 18.30 | 5.93 |
Dual-stage Mixed (DMT °) | 46.47 | 19.50 | 6.03 |
Dual-stage Mixed Fine-Tuning (DMT) Strategy:
- Stage 1: Fully fine-tune on specialist data ° (e.g., all math, all code) to consolidate those skills.
- Stage 2: Fine-tune primarily on general data, injecting a small fraction (e.g., k = 1/256) of specialist data to maintain those skills and prevent forgetting.
Recommendation: Adopt DMT or similar strategies when balancing multiple complex abilities is needed.
4. Quantity vs. Ratio: What Drives Ability Improvement?
- Data amount, not just ratio, is decisive: Improving a given skill is most strongly determined by the number of relevant samples in the SFT set, not their proportion relative to other domains—unless a skill is starved entirely.
- Mixing data acts as a regularizer in low-resource settings, but becomes “noise” at scale, reducing peak specialization.
- Ablation studies: Demonstrated that scaling data of one domain (while holding others constant) has far more impact than tweaking ratios.
5. Mathematical Foundations for SFT Planning
Data sets: Let be domain-specific SFT datasets, with each .
- Performance metrics: Computed on in-domain benchmarks for each .
FLOPs ° for Training:
where = average question and response length, = number of model params, = # SFT samples.
6. Summary and Practical Guidelines
- Specialized skills (math, code, etc.) require volume growth; general alignment emerges with moderate data.
- Larger LLMs are notably more effective in both mixed and low-resource SFT.
- Absolute sample size per domain governs performance; fix sample number before finetuning ratio.
- Avoid catastrophic forgetting via multi-stage or dual-mixed SFT (e.g., DMT).
- When resources are limited, use data mixing ° for balanced skill uplift; with ample data, keep domains separate or use staged approaches.
For practitioners: To train robust, multi-skill LLMs:
- Secure a large, domain-targeted SFT dataset ° for each target ability.
- Employ DMT or similar SFT strategies for maintaining multiple competencies.
- Scale model size where feasible; larger models are more versatile, especially under low-resource or mixed-domain SFT.
References: All experimental details, metrics, and strategies are sourced from "How Abilities in LLMs are Affected by Supervised Fine-tuning Data Composition" (Dong et al., 2023 ° ). Benchmarks: GSM8K (math reasoning), HumanEval (code generation), MT-Bench (instruction/general chat). FLOPs and data ratio formulas follow the original notation.