Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition (2310.05492v4)

Published 9 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs with enormous pre-training tokens and parameters emerge diverse abilities, including math reasoning, code generation, and instruction following. These abilities are further enhanced by supervised fine-tuning (SFT). While the open-source community has explored ad-hoc SFT for enhancing individual capabilities, proprietary LLMs exhibit versatility across various skills. Therefore, understanding the facilitation of multiple abilities via SFT is paramount. In this study, we specifically focuses on the interplay of data composition between mathematical reasoning, code generation, and general human-aligning abilities during SFT. We propose four intriguing research questions to explore the association between model performance and various factors including data amount, composition ratio, model size and SFT strategies. Our experiments reveal that distinct capabilities scale differently and larger models generally show superior performance with same amount of data. Mathematical reasoning and code generation consistently improve with increasing data amount, whereas general abilities plateau after roughly a thousand samples. Moreover, we observe data composition appears to enhance various abilities under limited data conditions, yet can lead to performance conflicts when data is plentiful. Our findings also suggest the amount of composition data influences performance more than the composition ratio. In analysis of SFT strategies, we find that sequentially learning multiple skills risks catastrophic forgetting. Our proposed Dual-stage Mixed Fine-tuning (DMT) strategy offers a promising solution to learn multiple abilities with different scaling patterns.

The paper "How Abilities in LLMs are Affected by Supervised Fine-tuning Data Composition" investigates the impact of supervised fine-tuning (SFT) data composition on the emergent abilities of LLMs. LLMs, with extensive pre-training, demonstrate a range of capabilities such as mathematical reasoning, code generation, and instruction following. As the open-source community continues to refine individual capabilities using ad-hoc SFT, proprietary models display a broad versatility across various skills.

The paper aims to understand how multiple abilities are facilitated through SFT by focusing on the interplay of data composition in three key areas: mathematical reasoning, code generation, and general human-aligning abilities. The authors propose four research questions that explore how model performance correlates with factors like data amount, composition ratio, model size, and SFT strategies.

Key findings from the experiments include:

  • Scaling of Capabilities: Different abilities scale distinctively. Larger models outperform smaller ones with the same data volume, highlighting that model size significantly affects performance.
  • Data Amount Influence: Mathematical reasoning and code generation capabilities improve consistently with increased data. However, general abilities tend to plateau after approximately a thousand samples, indicating a saturation point for enhancement.
  • Data Composition Effects: Under limited data scenarios, data composition appears to bolster multiple abilities. In contrast, abundant data may lead to conflicting performances among the skills being fine-tuned, suggesting an optimal balance is needed.
  • Impact of Composition Data vs. Ratio: The volume of composition data has a more substantial influence on performance than the specific ratio of different data types, indicating that ensuring enough data for each ability is crucial.
  • SFT Strategies and Catastrophic Forgetting: Sequential learning of multiple skills can cause catastrophic forgetting, where previously learned skills degrade. To combat this, the authors propose the Dual-stage Mixed Fine-tuning (DMT) strategy, which effectively accommodates different scaling patterns and mitigates forgetting by allowing simultaneous learning.

This paper provides valuable insights into optimizing SFT for enhancing diverse abilities in LLMs and suggests strategies to balance competing demands in data-limited environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Guanting Dong (46 papers)
  2. Hongyi Yuan (23 papers)
  3. Keming Lu (35 papers)
  4. Chengpeng Li (10 papers)
  5. Mingfeng Xue (10 papers)
  6. Dayiheng Liu (75 papers)
  7. Wei Wang (1793 papers)
  8. Zheng Yuan (117 papers)
  9. Chang Zhou (105 papers)
  10. Jingren Zhou (198 papers)
Citations (91)
X Twitter Logo Streamline Icon: https://streamlinehq.com