Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 216 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

Preserving Diversity in Supervised Fine-Tuning of Large Language Models (2408.16673v2)

Published 29 Aug 2024 in cs.LG and cs.AI

Abstract: LLMs typically rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks, with the Cross Entropy (CE) loss being the de facto choice. However, CE maximizes the likelihood of observed data without accounting for alternative possibilities. As such, CE usually leads to reduced diversity in the model's outputs, which hinders further development that requires sampling to explore better responses. To address this limitation, this paper introduces a new game-theoretic formulation for SFT. In this framework, an auxiliary variable is introduced to regulate the learning process. We prove that the proposed game-theoretic approach connects to the problem of reverse KL minimization with entropy regularization. This regularization prevents over-memorization of training data and promotes output diversity. To implement this framework, we develop GEM, a new training algorithm that is computationally efficient as CE by leveraging some unique properties of LLMs. Empirical studies of pre-trained models from 3B to 70B parameters show that GEM achieves comparable downstream performance to CE while significantly enhancing output diversity. This increased diversity translates to performance gains in test-time compute scaling for chat and code generation tasks. Moreover, we observe that preserving output diversity has the added benefit of mitigating forgetting, as maintaining diverse outputs encourages models to retain pre-trained knowledge throughout the training process.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces the GEM method, applying maximum entropy regularization with reverse KL divergence to reduce overfitting during supervised fine-tuning.
  • Experiments show that GEM lowers perplexity, improves instruction-following, and boosts diversity in tasks like creative writing, math reasoning, and code generation.
  • The findings imply that GEM can enhance model robustness and versatility, with potential applications in RLHF pipelines and synthetic data generation.

Entropic Distribution Matching in Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity

Introduction

LLMs [openai2023gpt4, touvron2023llama, team2024gemma] are prominent tools in various applications, achieving notable success through pre-training, where they develop a robust ability to predict the next token given a preceding text sequence. Despite their extensive pre-training, these models often underperform in specific tasks, requiring additional fine-tuning to enhance their ability to follow instructions and provide satisfactory responses. Supervised Fine-Tuning (SFT) is commonly employed to refine these models, typically using Cross Entropy (CE) loss to maximize the likelihood of labeled data. However, this method frequently leads to overfitting and reduced output diversity, limiting the models' practical applicability in generating diverse and creative outputs.

Contributions and Methodology

The paper introduces a novel distribution matching method named Generative Entropy-regularized Matching (GEM) to address the limitations of CE loss in SFT. GEM applies the maximum entropy principle to promote models that generate flatter, more generalized distributions, thereby mitigating overfitting and fostering better output diversity. The GEM approach is formulated as an optimization problem that minimizes reverse Kullback-Leibler (KL) divergence with an entropy regularization term.

Key Elements of GEM:

  • Generative Approach to Distribution Matching: Unlike the CE loss, which focuses solely on imitating supervised data, GEM encourages models to learn from both correct responses and their own generated mistakes.
  • Entropy Regularization: This aspect aims to prevent over-memorizing specific data samples, reducing overfitting and enhancing the diversity of generated outputs.

Experiments and Results

The GEM method was evaluated on several metrics to substantiate its efficacy over traditional CE loss, showcasing improvements in both generalist and specialized applications.

General-Instruction Following

Using the UltraFeedback dataset to fine-tune Llama-3-8B models, GEM demonstrated superior performance compared to CE in various aspects:

  • Reduced Perplexity: The GEM-trained models exhibited lower evaluation perplexity, suggesting less overfitting.
  • Enhanced Instruction-Following Performance: When tested on the IFEval benchmark, GEM outperformed CE, showing better adherence to provided instructions.

Output Diversity and Creativity

In tasks requiring creative outputs, such as poem and story writing, GEM-trained models achieved significantly higher diversity. This was measured using:

  • N-Gram Diversity
  • Self-BLEU Diversity
  • Sentence-BERT Diversity

These metrics indicate a broader and more varied generation capability, enhancing the models' usefulness in applications where creativity and flexibility are paramount.

Specialized Tasks: Math Reasoning and Code Generation

When fine-tuned for domain-specific tasks, GEM maintained its advantages:

  • Math Reasoning: Using datasets like GSM8K and MATH, GEM showed improved performance measured by Majority Voting (MV) and Best-Of-N (BON) sampling methods.
  • Code Generation: In benchmarks like HumanEval and MBPP, GEM achieved higher pass rates across samples, demonstrating its effectiveness in generating correct and varied programming solutions.

Implications and Future Directions

The introduction of GEM suggests substantial improvements in the fine-tuning of LLMs, with direct implications for both theory and practice. The advancements in reducing overfitting and enhancing diversity can lead to more robust and versatile models, applicable in diverse fields from creative writing to technical problem-solving.

Future work may explore the integration of GEM-trained models into Reinforcement Learning from Human Feedback (RLHF) pipelines, potentially reducing the preference collapse issue and improving alignment with human values. Additionally, GEM's enhanced diversity may prove beneficial in self-distillation practices and synthetic data generation, paving the way for more sophisticated and autonomous AI systems.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube