The paper "How Abilities in LLMs are Affected by Supervised Fine-tuning Data Composition" investigates the impact of supervised fine-tuning (SFT) data composition on the emergent abilities of LLMs. LLMs, with extensive pre-training, demonstrate a range of capabilities such as mathematical reasoning, code generation, and instruction following. As the open-source community continues to refine individual capabilities using ad-hoc SFT, proprietary models display a broad versatility across various skills.
The paper aims to understand how multiple abilities are facilitated through SFT by focusing on the interplay of data composition in three key areas: mathematical reasoning, code generation, and general human-aligning abilities. The authors propose four research questions that explore how model performance correlates with factors like data amount, composition ratio, model size, and SFT strategies.
Key findings from the experiments include:
- Scaling of Capabilities: Different abilities scale distinctively. Larger models outperform smaller ones with the same data volume, highlighting that model size significantly affects performance.
- Data Amount Influence: Mathematical reasoning and code generation capabilities improve consistently with increased data. However, general abilities tend to plateau after approximately a thousand samples, indicating a saturation point for enhancement.
- Data Composition Effects: Under limited data scenarios, data composition appears to bolster multiple abilities. In contrast, abundant data may lead to conflicting performances among the skills being fine-tuned, suggesting an optimal balance is needed.
- Impact of Composition Data vs. Ratio: The volume of composition data has a more substantial influence on performance than the specific ratio of different data types, indicating that ensuring enough data for each ability is crucial.
- SFT Strategies and Catastrophic Forgetting: Sequential learning of multiple skills can cause catastrophic forgetting, where previously learned skills degrade. To combat this, the authors propose the Dual-stage Mixed Fine-tuning (DMT) strategy, which effectively accommodates different scaling patterns and mitigates forgetting by allowing simultaneous learning.
This paper provides valuable insights into optimizing SFT for enhancing diverse abilities in LLMs and suggests strategies to balance competing demands in data-limited environments.