Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe (2410.05248v1)

Published 7 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: To induce desired behaviors in LLMs for interaction-driven tasks, the instruction-tuning stage typically trains LLMs on instruction-response pairs using the next-token prediction (NTP) loss. Previous work aiming to improve instruction-tuning performance often emphasizes the need for higher-quality supervised fine-tuning (SFT) datasets, which typically involves expensive data filtering with proprietary LLMs or labor-intensive data generation by human annotators. However, these approaches do not fully leverage the datasets' intrinsic properties, resulting in high computational and labor costs, thereby limiting scalability and performance gains. In this paper, we propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm, without the need for well-curated datasets. Observing that LLMs exhibit uneven confidence across the semantic representation space, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels, then applies a Mixup-based regularization to mitigate overfitting on confident examples while propagating supervision signals to improve learning on relatively unconfident ones. This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks, demonstrating its adaptability to diverse LLM families and scalability to datasets of any size. Comprehensive ablation studies further verify the robustness of SFTMix's design choices, underscoring its versatility in consistently enhancing performance across different LLMs and datasets in broader natural language processing applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuxin Xiao (12 papers)
  2. Shujian Zhang (28 papers)
  3. Wenxuan Zhou (61 papers)
  4. Marzyeh Ghassemi (96 papers)
  5. Sanqiang Zhao (9 papers)

Summary

SFTMix: Enhancing LLM Instruction Tuning Through Mixup Regularization

The paper introduces SFTMix, a methodological advance in the instruction-tuning of LLMs by exploiting a novel Mixup-based approach to enhance performance without reliance on well-curated datasets. Conventional instruction-tuning methodologies employ next-token prediction (NTP) utilizing high-quality supervised fine-tuning (SFT) datasets, often necessitating expensive data filtering and preparation processes. SFTMix transcends these limitations by harnessing the inherent characteristics of datasets and leveraging training dynamics to improve fine-tuning efficiency and efficacy.

Methodology Overview

The novelty of SFTMix is rooted in the observation that LLM confidence varies across the semantic space during instruction tuning. By identifying data subsets based on confidence levels using perplexity metrics at multiple training checkpoints, SFTMix separates the SFT dataset into confident and relatively unconfident subsets. Mixup, traditionally used for regularization in deep learning, is adapted to this context to generate interpolated data instances from these subsets, acting as a regularization mechanism.

The Mixup-based regularization mitigates overfitting on confident examples and propagates supervisory signals to less confident ones. By integrating this regularization with the NTP loss, SFTMix enhances generalization across a range of tasks, exhibiting robustness across LLM architectures and dataset scales.

Experimental Findings

The empirical evaluation against baseline NTP instruction-tuning underscores the efficacy of SFTMix. Notable performance improvements are recorded across various instruction-following and healthcare domain-specific tasks:

  • Instruction-Following Tasks: SFTMix consistently outperformed NTP, with enhancements in multi-turn conversational contexts as evidenced by results on MT-Bench and AlpacaEval-2. Evaluations show significant gains in single-turn and multi-turn conversation metrics, with observable improvements in diverse task categories such as extraction and coding.
  • Healthcare Domain-Specific Tasks: In specialized domains, SFTMix demonstrated a 1.5%1.5\% average increase in accuracy over NTP across medical benchmarks like MedQA and PubMedQA, outperforming existing domain-specific models.

Implications and Future Directions

From a theoretical perspective, SFTMix's ability to leverage model-specific training dynamics introduces a promising pathway to reduce reliance on costly dataset curation without sacrificing performance. This technique encourages a rethinking of data utilization strategies in LLM instruction tuning.

Practical implications of SFTMix include enhanced scalability and adaptability to varied tasks, paving the way for cost-effective and efficient deployment of LLMs in both general and domain-specific contexts. The reduced overfitting and improved generalization performance underscore its potential utility in real-world applications.

Future work could explore the integration of SFTMix with parameter-efficient training methods or apply it to larger models and diverse datasets. The potential for scaling SFTMix to pre-training stages or integrating it with emerging AI methodologies could further broaden its applicability and impact on advancing NLP technologies.

In conclusion, SFTMix represents a significant methodological advance in instruction tuning, offering a refined approach to managing and exploiting training data's intrinsic variability. It delivers consistent performance enhancements, establishing its value across the spectrum of NLP applications.

Youtube Logo Streamline Icon: https://streamlinehq.com