Supervised Fine-Tuning (SFT) Strategy
Supervised Fine-Tuning (SFT) is a central paradigm for aligning LLMs with specific skills and human expectations through additional post-pretraining optimization on curated datasets. With the emergence of LLMs demonstrating broad abilities in mathematics, coding, and general human communication, understanding how data composition, scaling, and fine-tuning strategy impact multi-ability acquisition is crucial. Notably, the distinction between improving single abilities and mastering multiple capabilities concurrently has deep ramifications for open-source and proprietary model training.
1. Influence of Data Composition on Ability Emergence and Transfer
SFT’s impact on LLM abilities is fundamentally shaped by the composition of the supervised training data. Three core domains—mathematical reasoning (GSM8K), code generation (Code Alpaca, HumanEval), and general human alignment (ShareGPT, MT-Bench)—serve as benchmarks for multi-ability acquisition.
- In low-resource regimes (little per-domain data), combining data from multiple domains during SFT enables substantial cross-domain transfer. For example, training jointly on math, code, and general alignment data in small quantities yields superior performance compared to any domain alone, with larger LLM architectures (e.g., LLaMA-33B) better capitalizing on this synergy.
- In high-resource settings, however, mixed-domain SFT gives way to performance conflicts (“interference”) as irrelevant domain data becomes noise, diluting the learning signal for high-volume domains. Thus, data mixing is beneficial at small scale but potentially deleterious as amounts increase.
Empirical evidence further substantiates these claims. Performance metrics on GSM8K, HumanEval, and MT-Bench—as catalogued in Table 1—clearly show gains or losses contingent on data mixture and model size. Scaling curves (Figures 1–2) demonstrate that the performance of mathematical and code abilities continues to increase linearly with growing data, but general alignment scores plateau rapidly with only a modest amount of data.
Formal Data Scaling Patterns
Let be the total SFT data, where is the dataset of domain i, and its sample count. Performance for math and code abilities strictly increases with more data: but for general alignment,
This relationship underscores the non-uniform scaling laws for model abilities.
2. Scaling Laws and Model Size Effects
Distinct abilities scale with SFT data differently:
- Mathematical reasoning benefits monotonically from more data at all scales—more math data always yields improvement.
- Code generation shows irregular scaling in small models and approximately log-linear uptick in larger models with more code data.
- General alignment can be induced with as little as ~1,000 high-quality samples, after which additional data has little effect.
- Model size modulates these trends: Larger LLMs amplify the benefits of data mixing at small scales, but, paradoxically, at very small data sizes, smaller models may outperform due to robustness and reduced overfitting.
Implication: Each ability’s emergence and improvement requires domain-specific data scaling. There is no universal optimal data split; rather, data allocation must align with the scaling properties of each ability domain.
3. Dual-Stage Mixed Fine-Tuning (DMT): Mitigating Catastrophic Forgetting
Sequential SFT (training on one ability, then the next) catastrophically forgets earlier abilities. Multi-task SFT (simple mixture) can induce inter-domain interference, especially at higher data volumes. To address this, the paper proposes the Dual-Stage Mixed Fine-Tuning (DMT) strategy:
- Stage 1: SFT is performed on all specialized data (math and code) jointly, maximizing those abilities.
- Stage 2: SFT is run on general alignment data, mixed with a small fraction (e.g., ) of the specialized data, preserving prior abilities and reducing forgetting.
Diagrammatically:
1 |
[SFT: Math + Code] → [SFT: General + (small subset of Math + Code)] |
Empirical results (see Table 1, DMT column) show that DMT recovers math/code ability otherwise lost after sequential or naive multi-task SFT, while maintaining general alignment ability. This approach exploits the rehearsal mechanism from continual learning, integrating a minimal amount of prior-domain data to inhibit parameter drift.
4. Data Amount and Composition Ratio: Quantitative Effects
- Absolute data amount in each domain is the predominant driver of ability improvement, rather than their ratio within the total mix.
- For example, increasing the quantity of math data amidst a fixed general/code ratio directly raises math scores.
- Composition ratio (e.g., code:general at 1:1 or 1:256) is largely irrelevant unless the domains overlap semantically (such as code/general), at which point high overlap can induce modest interference.
- Performance stability: SFT outcomes are robust to a wide range of mixture ratios, except under extreme imbalance or overlap (see Figure 3, Table in Appendix).
t-SNE visualizations further reveal that code and general alignment abilities can share representational space, explaining their marginally higher ratio sensitivity compared to math/general compositions.
5. Recommendations and Practical Guidelines
- Match SFT data allocation to ability-specific scaling: For hard domains like math/code, prioritize sample count; for general abilities, a small curated set (~1k) suffices.
- Use larger models for joint or multi-ability SFT, especially in low-resource regimes, to best leverage cross-domain transfer and minimize interference.
- Employ the DMT strategy to preserve and accumulate multiple skills with minimal catastrophic forgetting.
- Avoid excessive focus on data ratios; overall coverage and scale, tailored by domain, are more influential.
- When abilities overlap (e.g., code and conversational instruction), consider more nuanced mixing or modular architectures.
These findings generalize to other ability domains, such as language understanding and translation, as demonstrated in supplementary experiments.
6. Future Directions and Open Research Problems
The paper indicates several research avenues:
- Extending data composition analysis to additional skills (e.g., creative generation, factual reasoning).
- Investigating parameter-efficient fine-tuning techniques (e.g., adapters, LoRA) within the multi-ability SFT framework.
- Automating DMT ratio selection () and monitoring performance online during SFT to dynamically tune strategy.
- Integrating these SFT strategies at compatible stages with RLHF or preference optimization for composite alignment objectives.
7. Summary Table: SFT Data Strategy Optimization
Consideration | Empirical Finding | Practical Impact |
---|---|---|
Data amount per domain | Strictly determines ability's emergence | Prioritize scale for hard skills, minimum for gen |
Data ratio (mixing) | Less important than amount | Ratio tuning is not critical in most regimes |
Model size | Large models amplify mixing benefit | Prefer large LLMs for composite SFT |
DMT vs. sequential/mixed | DMT best preserves all abilities | Use DMT in multi-ability SFT scenarios |
Catastrophic forgetting | Major risk with sequential SFT | Always mix in prior-task data to prevent loss |
Supervised Fine-Tuning (SFT) data composition exerts a profound influence on the development and retention of diverse LLM abilities. Domain-specific scaling rules dictate effective data allocation, while the DMT strategy offers a practical solution to learn—and preserve—multiple skills in modern LLMs, particularly at scale. Allocation of data, model size selection, and rehearsal-informed training design together form the cornerstone of effective SFT in broad-domain, multi-ability settings (Dong et al., 2023 ).