Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition (2505.18091v1)

Published 23 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.

Summary

  • The paper reveals that mixing datasets during LLM training causes abrupt phase transitions in knowledge acquisition, dependent on model size and data ratio.
  • These transitions involve sudden shifts in the model's ability to acquire knowledge as model size or the ratio of knowledge-dense data cross critical thresholds.
  • Practical strategies like random subsampling and compact knowledge mixing can help mitigate these phase transitions and optimize LLM training.

Insights into Data Mixing and Phase Transitions in Knowledge Acquisition for LLMs

This paper, authored by Xinran Gu, Kaifeng Lyu, Jiazheng Li, and Jingzhao Zhang, explores the phenomena of phase transitions in knowledge acquisition when training LLMs using mixed datasets. Unlike conventional approaches where mutual scaling laws guide the training process, this paper reveals that data mixing can lead to abrupt and non-linear transitions, significantly impacting the effectiveness of knowledge acquisition.

Overview

LLMs are often trained on a combination of vast corpora scraped from the web and smaller, knowledge-dense datasets. The latter are strategically curated to enhance the model's proficiency in specific domains or tasks. This combination, while intuitively beneficial, raises concerns about the optimal balance between data types. Specifically, the paper investigates how varying the mixing ratio and model size influences knowledge retention and overall performance.

Key Findings

Two primary findings are highlighted: phase transitions in model size and mixing ratio. Controlled experiments conducted on synthetic datasets demonstrated that:

  1. Model Size Transition: As model size increases beyond a critical threshold, the model abruptly shifts from memorizing minimal information to capturing most available knowledge within the dataset. For smaller ratios of knowledge-dense data, this threshold is noticeably higher.
  2. Mixing Ratio Transition: Similarly, below certain critical mixing ratios, despite prolonged training periods, models exhibit minimal retention of specialized knowledge. As the mixing ratio increases past this threshold, the knowledge acquisition escalates rapidly.

These transitions are attributed to a phenomenon akin to a capacity allocation problem, where the model optimally distributes its limited capacity across datasets to minimize test losses effectively.

Theoretical Contributions

The authors employ an information-theoretic framework to formalize the intuition behind these phase transitions. They present the notion of marginal values—quantifying the incremental reduction in test loss per additional unit of capacity on each dataset—which varies discontinuously with shifts in mixing ratios and model sizes. The derived insights suggest predictability in these transitions, governed by power-law relationships.

Implications

This work implies considerable ramifications for training protocol designs, especially emphasizing that strategies effective for large models may not translate to smaller counterparts. It cautions against applying uniform data-mixing strategies without considering model size variations, necessitating tailored approaches to training LLMs.

Practical Strategies

To mitigate the challenges posed by these transitions, the authors propose two practical strategies:

  • Random Subsampling: Reducing dataset size to improve the frequency of exposure per fact, thereby enhancing model performance.
  • Compact Knowledge Mixing (CKM): Rephrasing information into more compact formats, which elevates the dataset's marginal value by increasing exposure frequency.

Future Directions

This research opens avenues for extending the paper of phase transitions to complex reasoning tasks and more heterogeneous datasets, potentially offering nuanced insights for optimizing LLM training frameworks. The findings emphasize the need for empirical validation and adjustments in data mixing laws, which could lead to more efficient and targeted LLM deployments across various applications.

In conclusion, the paper provides substantial evidence and theoretical underpinnings that challenge the prevailing assumptions about scaling laws in the context of data mixing, advocating for an informed and nuanced methodology in LLM pre-training strategies. This contributes to the broader discourse on refining AI model training practices and optimizing resource allocation.

X Twitter Logo Streamline Icon: https://streamlinehq.com