Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 18 tok/s

GPT-5 High 27 tok/s Pro

GPT-4o 94 tok/s

GPT OSS 120B 450 tok/s Pro

Kimi K2 224 tok/s Pro

2000 character limit reached

Supervised Fine-Tuning (SFT)

Updated 1 July 2025

Supervised Fine-Tuning (SFT) is a method for aligning large language models with specific skills and human expectations using curated datasets after pretraining.
Different LLM abilities, such as mathematical reasoning, code generation, and general alignment, exhibit distinct scaling laws with SFT data, requiring tailored data allocation strategies.
The Dual-Stage Mixed Fine-Tuning (DMT) strategy mitigates catastrophic forgetting by combining specialized data training followed by general alignment training mixed with a small subset of specialized data.

Supervised Fine-Tuning (SFT) is a central paradigm for aligning LLMs with specific skills and human expectations through additional post-pretraining optimization on curated datasets. With the emergence of LLMs demonstrating broad abilities in mathematics, coding, and general human communication, understanding how data composition, scaling, and fine-tuning strategy impact multi-ability acquisition is crucial. Notably, the distinction between improving single abilities and mastering multiple capabilities concurrently has deep ramifications for open-source and proprietary model training.

1. Influence of Data Composition on Ability Emergence and Transfer

SFT’s impact on LLM abilities is fundamentally shaped by the composition of the supervised training data. Three core domains—mathematical reasoning (GSM8K), code generation (Code Alpaca, HumanEval), and general human alignment (ShareGPT, MT-Bench)—serve as benchmarks for multi-ability acquisition.

In low-resource regimes (little per-domain data), combining data from multiple domains during SFT enables substantial cross-domain transfer. For example, training jointly on math, code, and general alignment data in small quantities yields superior performance compared to any domain alone, with larger LLM architectures (e.g., LLaMA-33B) better capitalizing on this synergy.
In high-resource settings, however, mixed-domain SFT gives way to performance conflicts (“interference”) as irrelevant domain data becomes noise, diluting the learning signal for high-volume domains. Thus, data mixing is beneficial at small scale but potentially deleterious as amounts increase.

Empirical evidence further substantiates these claims. Performance metrics on GSM8K, HumanEval, and MT-Bench—as catalogued in Table 1—clearly show gains or losses contingent on data mixture and model size. Scaling curves (Figures 1–2) demonstrate that the performance of mathematical and code abilities continues to increase linearly with growing data, but general alignment scores plateau rapidly with only a modest amount of data.

Formal Data Scaling Patterns

Let $D=\bigcup_i D_i$ be the total SFT data, where $D_i$ is the dataset of domain i, and $n$ its sample count. Performance for math and code abilities strictly increases with more data: $M_\text{math, code}(n) \uparrow \quad \text{as}\ n \uparrow$ but for general alignment,

$M_\text{gen}(n) \sim \text{plateau for } n \gtrsim 1\text{k}$

This relationship underscores the non-uniform scaling laws for model abilities.

2. Scaling Laws and Model Size Effects

Distinct abilities scale with SFT data differently:

Mathematical reasoning benefits monotonically from more data at all scales—more math data always yields improvement.
Code generation shows irregular scaling in small models and approximately log-linear uptick in larger models with more code data.
General alignment can be induced with as little as ~1,000 high-quality samples, after which additional data has little effect.
Model size modulates these trends: Larger LLMs amplify the benefits of data mixing at small scales, but, paradoxically, at very small data sizes, smaller models may outperform due to robustness and reduced overfitting.

Implication: Each ability’s emergence and improvement requires domain-specific data scaling. There is no universal optimal data split; rather, data allocation must align with the scaling properties of each ability domain.

3. Dual-Stage Mixed Fine-Tuning (DMT): Mitigating Catastrophic Forgetting

Sequential SFT (training on one ability, then the next) catastrophically forgets earlier abilities. Multi-task SFT (simple mixture) can induce inter-domain interference, especially at higher data volumes. To address this, the paper proposes the Dual-Stage Mixed Fine-Tuning (DMT) strategy:

Stage 1: SFT is performed on all specialized data (math and code) jointly, maximizing those abilities.
Stage 2: SFT is run on general alignment data, mixed with a small fraction $k \ll 1$ (e.g., $k=1/256$ ) of the specialized data, preserving prior abilities and reducing forgetting.

Diagrammatically:

1	[SFT: Math + Code] → [SFT: General + (small subset of Math + Code)]

Empirical results (see Table 1, DMT column) show that DMT recovers math/code ability otherwise lost after sequential or naive multi-task SFT, while maintaining general alignment ability. This approach exploits the rehearsal mechanism from continual learning, integrating a minimal amount of prior-domain data to inhibit parameter drift.

4. Data Amount and Composition Ratio: Quantitative Effects

Absolute data amount in each domain is the predominant driver of ability improvement, rather than their ratio within the total mix.
- For example, increasing the quantity of math data amidst a fixed general/code ratio directly raises math scores.
Composition ratio (e.g., code:general at 1:1 or 1:256) is largely irrelevant unless the domains overlap semantically (such as code/general), at which point high overlap can induce modest interference.
Performance stability: SFT outcomes are robust to a wide range of mixture ratios, except under extreme imbalance or overlap (see Figure 1, Table in Appendix).

t-SNE visualizations further reveal that code and general alignment abilities can share representational space, explaining their marginally higher ratio sensitivity compared to math/general compositions.

5. Recommendations and Practical Guidelines

Match SFT data allocation to ability-specific scaling: For hard domains like math/code, prioritize sample count; for general abilities, a small curated set (~1k) suffices.
Use larger models for joint or multi-ability SFT, especially in low-resource regimes, to best leverage cross-domain transfer and minimize interference.
Employ the DMT strategy to preserve and accumulate multiple skills with minimal catastrophic forgetting.
Avoid excessive focus on data ratios; overall coverage and scale, tailored by domain, are more influential.
When abilities overlap (e.g., code and conversational instruction), consider more nuanced mixing or modular architectures.

These findings generalize to other ability domains, such as language understanding and translation, as demonstrated in supplementary experiments.

6. Future Directions and Open Research Problems

The paper indicates several research avenues:

Extending data composition analysis to additional skills (e.g., creative generation, factual reasoning).
Investigating parameter-efficient fine-tuning techniques (e.g., adapters, LoRA) within the multi-ability SFT framework.
Automating DMT ratio selection ( $k$ ) and monitoring performance online during SFT to dynamically tune strategy.
Integrating these SFT strategies at compatible stages with RLHF or preference optimization for composite alignment objectives.

7. Summary Table: SFT Data Strategy Optimization

Consideration	Empirical Finding	Practical Impact
Data amount per domain	Strictly determines ability's emergence	Prioritize scale for hard skills, minimum for gen
Data ratio (mixing)	Less important than amount	Ratio tuning is not critical in most regimes
Model size	Large models amplify mixing benefit	Prefer large LLMs for composite SFT
DMT vs. sequential/mixed	DMT best preserves all abilities	Use DMT in multi-ability SFT scenarios
Catastrophic forgetting	Major risk with sequential SFT	Always mix in prior-task data to prevent loss

Supervised Fine-Tuning (SFT) data composition exerts a profound influence on the development and retention of diverse LLM abilities. Domain-specific scaling rules dictate effective data allocation, while the DMT strategy offers a practical solution to learn—and preserve—multiple skills in modern LLMs, particularly at scale. Allocation of data, model size selection, and rehearsal-informed training design together form the cornerstone of effective SFT in broad-domain, multi-ability settings (Dong et al., 2023).

PDF Markdown Chat (Upgrade)

References (1)

How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition (2023)