Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 178 tok/s Pro

GPT OSS 120B 385 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws (2504.09597v5)

Published 13 Apr 2025 in cs.AI, cs.IT, cs.LG, and math.IT

Abstract: LLMs have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap's and Zipf's laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors of LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.

Summary

The paper demonstrates that LLM training can be viewed as constructing a two-part code, showing how model capacity influences the sequential learning of syntax and factual knowledge.
It employs a hierarchical Syntax-Knowledge model that integrates finite-dimensional syntax with a nonparametric knowledge component using Pitman-Yor processes.
Empirical results validate that high-frequency patterns are learned early, while limited capacity leads to slower acquisition of rare facts and capacity-induced hallucinations.

This paper, "Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws" (2504.09597), offers a theoretical framework to explain several key behaviors of LLMs, including scaling laws, knowledge acquisition dynamics, and factual hallucinations. The core idea is to leverage the classical relationship between prediction and compression, interpreting LLM training through the lens of the Kolmogorov Structure Function (KSF).

Key Ideas and Contributions:

Kolmogorov Structure Function (KSF) Perspective:

The authors propose viewing LLM training as a process of constructing a two-part code for the training data, as described by KSF. The first part is the LLM itself (the model/compressor), which learns to encode regularities. The second part is the compressed data.
- Practical Implication: This perspective suggests that LLMs prioritize learning more "compressible" patterns first. Simpler models (smaller $\alpha$ in KSF) capture pervasive regularities like syntax. As model complexity ( $\alpha$ ) increases, LLMs progressively learn common knowledge and then rarer knowledge elements. This directly mirrors observed model scaling laws where loss decreases with model size.
- Minimal Sufficient Statistics: The irreducible part of the test loss (entropy) corresponds to the point where the model has captured all learnable structure (minimal sufficient statistics). Further training on this data might lead to memorizing noise.
Syntax-Knowledge Model:

Motivated by KSF and empirical observations like Heap's and Zipf's laws, the paper introduces a hierarchical data generation model.
- Parametric Syntax Model: Captures syntactic structures (e.g., grammar rules) and is assumed to be finite-dimensional.
- Nonparametric Knowledge Model: Represents factual world knowledge using a Pitman-Yor Chinese Restaurant Process (PYCRP). This choice is crucial because PYCRP naturally models:
  - The unbounded growth of knowledge (like Heap's Law).
  - The power-law distribution of knowledge element frequencies (like Zipf's Law), where some facts are very common and many are rare.
- Generation Process: A sentence is generated by first sampling an abstract knowledge element from the knowledge model, which then influences the choice of a syntax encoder (from the syntax model) to produce the final sentence.
Explaining Scaling Laws: The paper analyzes the Syntax-Knowledge model within a Bayesian sequential prediction framework, where optimal prediction is related to minimizing redundancy (KL divergence between true and model distributions).
- Data Scaling Law:
  - Theoretical Result: The optimal Bayesian redundancy (per sentence) is shown to scale as $\widetilde{O}\left(\frac{C_{\text{knw}}}{N^{1-\alpha}} + \frac{C_{\text{syn}}}{N}\right)$ , where $N$ is data size, $\alpha$ is the PYP discount parameter, and $C_{\text{knw}}, C_{\text{syn}}$ are constants.
  - Practical Implication: Syntax redundancy ( $C_{\text{syn}}/N$ ) decreases faster than knowledge redundancy ( $C_{\text{knw}}/N^{1-\alpha}$ ). This explains why LLMs learn syntax quickly, while knowledge acquisition is slower and depends on the frequency of knowledge elements.
  - Experimental Validation: Experiments on power-law distributed data confirm this scaling and show that higher-frequency data is learned earlier (Figure 1 in the paper). Uniformly distributed data shows different scaling.

* Model Scaling Law: * Theoretical Result: Focusing on the knowledge model (assuming syntax is learned quickly), the paper formulates redundancy minimization as an optimization problem constrained by model capacity $C$ . The optimal redundancy $Red(C)$ scales as $\Theta(C^{-1/\alpha+1})$ . Furthermore, the contribution of the $k$ -th knowledge cluster to redundancy is $\Theta(\min\{k^{-1/\alpha}, C^{-1/\alpha}\})$ . * Practical Implication: This explains how model capacity dictates which knowledge elements are learned. Less frequent knowledge requires larger models. For a fixed capacity, if a knowledge element's frequency is below a certain threshold, the model may "choose" not to learn it, leading to hallucinations even if the fact was seen during training. This aligns with the "simplicity bias" phenomenon. * Experimental Validation: Empirical results (Figure 2 and Figure 3) show that smaller models only capture high-frequency knowledge, while larger models progressively acquire less frequent knowledge. Data with more skewed (power-law) distributions show faster loss decay with model size compared to uniform distributions.

Explaining Fine-Tuning:

The framework provides insights into fine-tuning for instruction-following and knowledge injection.
- Instruction-Following: If fine-tuning uses knowledge already seen during pretraining but with a new syntax (e.g., Q&A format), the model primarily learns the new syntax quickly, retaining most of the pretrained knowledge. The redundancy for knowledge is small ( $O(N^{\alpha-1})$ if $N$ is large pretraining data), while syntax redundancy on new format is $O(n^{-1})$ for $n$ fine-tuning samples.
- Knowledge Injection: If new knowledge is injected using a drastically different syntax, the model has to learn both new syntax and new knowledge. This can be less effective and lead to more forgetting of pretrained knowledge, especially if model capacity is limited, due to the "perplexity shift" and competition for resources.
- Practical Recommendations:
  - For knowledge injection, use formats similar to pretraining data to minimize syntactic overhead, especially with capacity-constrained models. Mix new knowledge with pretraining data.
  - For instruction fine-tuning, use knowledge distributions similar to pretraining to focus learning on the new syntax/style.
Hallucinations: The model scaling law directly offers an explanation for capacity-related hallucinations: if a model's capacity $C$ is insufficient to store all knowledge elements it has been exposed to, it will prioritize more frequent ones. Less frequent knowledge, even if present in the training data, might not be encoded, leading the model to generate plausible but incorrect information (hallucinate) for queries related to these rarer facts (Figure 4).

Implementation Considerations from the Paper:

Dataset Generation: The experimental validation relies on synthetically generated datasets where knowledge elements (attributes of individuals) and syntactic structures (sentence templates) can be controlled. This is crucial for testing the theoretical predictions about frequency and model capacity. Generating data with specific power-law distributions (Equation 7) is key to observing the predicted scaling.
Model Architecture: Standard RoPE-encoded GPT-like models are used in experiments. The theory itself is relatively model-agnostic but assumes an autoregressive predictive model.
Training: Standard LLM training procedures (cross-entropy loss, AdamW, learning rate schedules) are employed.
Evaluation Metrics: Perplexity/loss is the primary metric, aligned with the compression perspective. Accuracy on specific knowledge facts is used to analyze knowledge acquisition and hallucination.

Connecting Theory to Practice:

The paper translates abstract concepts like Kolmogorov complexity and PYCRP into concrete explanations for LLM behaviors:

KSF's $\alpha$ (model complexity) $\rightarrow$ LLM parameter count: Larger models can represent more complex patterns.
KSF's two-part code $\rightarrow$ LLM parameters + compressed data representation: Training minimizes the total "description length."
PYCRP's power-law $\rightarrow$ Realistic data distributions: Explains why some knowledge is learned before others.
Redundancy minimization $\rightarrow$ Loss minimization: The training objective.

Pseudocode for Conceptual Understanding of Data Generation (Syntax-Knowledge Model):

function generate_sentence(syntax_model_prior, knowledge_model_priors):
  // 1. Initialize models
  knowledge_params = sample_from_PYP(knowledge_model_priors.alpha,
                                     knowledge_model_priors.beta,
                                     knowledge_model_priors.base_measure_H)
  // knowledge_params is like (p_1, phi_1), (p_2, phi_2), ...

  syntax_params_list = []
  for i in 1 to num_syntax_types:
    syntax_params_list.append(sample_from_prior(syntax_model_prior))

  // 2. Generate a knowledge element
  // First, pick a knowledge cluster based on PYCRP weights (p_i)
  knowledge_cluster_idx = sample_from_discrete_distribution(weights=[p_1, p_2, ...])
  selected_knowledge_phi = knowledge_params[knowledge_cluster_idx].phi

  // Then, sample an abstract knowledge_item from this cluster
  abstract_knowledge_item = sample_from_distribution(P_knowledge_phi)

  // 3. Determine which syntax encoder to use based on knowledge_item
  // (Simplified: assume knowledge_item maps to a syntax_type_idx)
  syntax_type_idx = determine_syntax_type(abstract_knowledge_item)
  selected_syntax_theta = syntax_params_list[syntax_type_idx]

  // 4. Generate sentence using the syntax encoder and knowledge
  // The syntax encoder takes the abstract_knowledge_item and selected_syntax_theta
  // to produce the actual sentence tokens.
  sentence_tokens = syntax_encoder(abstract_knowledge_item, selected_syntax_theta)

  return sentence_tokens

Limitations and Future Work from the Paper:

The model focuses on factual knowledge; extending it to other knowledge types (e.g., procedural, commonsense) is a future direction.
Integrating compositional reasoning and inference mechanisms.
Exploring how LLMs might approximate universal predictors (like Solomonoff's) under practical constraints.

In summary, the paper provides a principled, compression-based theoretical framework using the Syntax-Knowledge model to explain data/model scaling laws, the order of knowledge acquisition (syntax then frequent knowledge then rarer knowledge), and capacity-driven hallucinations in LLMs. Its experimental results on synthetic data support the theoretical predictions, offering valuable insights for understanding and potentially improving LLM training and fine-tuning.## Understanding LLM Behaviors via Compression: A Detailed Summary

The paper "Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws" (2504.09597) proposes a novel theoretical framework to explain several emergent behaviors of LLMs, including scaling laws, the dynamics of knowledge acquisition, and factual hallucinations. The central thesis is that LLM behaviors can be understood by viewing them as compressors, drawing on concepts from Kolmogorov complexity and Shannon information theory.

Core Idea: LLMs as Compressors via Kolmogorov Structure Function

The authors revisit the fundamental link between prediction and compression. Optimal prediction of a data sequence is intrinsically tied to its most efficient compression. LLM training is framed as constructing a two-part code for the training data, inspired by the Kolmogorov Structure Function (KSF).

The Model Part: The LLM itself, with its parameters, learns to represent and compress regularities in the data.
The Data Part: The residual information not captured by the model, encoded using the LLM as a compressor.

The KSF, $h_X(\alpha) = \min_{M} \{-\log P_M(X) : K(P_M) \le \alpha\}$ , describes the minimum code length (or cross-entropy loss) for data $X$ achievable by a model $M$ with complexity (e.g., description length of parameters) at most $\alpha$ .

Practical Implication: This function directly mirrors model scaling laws in LLMs, where loss decreases as model size (parameters) increases. Initially, small $\alpha$ (simple models) capture the most pervasive structures like syntax because they offer the largest compression gains. As $\alpha$ (model capacity) grows, LLMs learn common factual knowledge, then progressively rarer knowledge elements. Information exceeding capacity or inherently random is left in the "data part" of the code.

The Syntax-Knowledge Hierarchical Data Generation Model

Motivated by this compression perspective and empirical linguistic laws (Heap's Law for vocabulary growth, Zipf's Law for word frequency), the paper introduces a simplified data generation model:

Parametric Syntax Model ( $P_{\Theta_{syn}}$ ): Captures the syntactic structures of language (e.g., grammar, sentence templates). This component is assumed to be learnable with a finite set of parameters.
Nonparametric Knowledge Model ( $P_{\Theta_{knw}}$ ): Encodes factual world knowledge. This is modeled using a Pitman-Yor Chinese Restaurant Process (PYCRP).
- Why PYCRP? It naturally accommodates the unbounded nature of human knowledge (new facts are always emerging, similar to Heap's Law) and the power-law distribution of factual occurrences (some facts are very common, many are rare, like Zipf's Law).
- A sample from the PYCRP is a discrete distribution over infinitely many "knowledge clusters" (atoms $\phi_i$ ) with weights $p_i$ .

Hierarchical Generation Process:

A sentence $X$ is generated by:

Sampling latent parameters for the syntax and knowledge models.
Sampling an abstract "knowledge element" $\omega$ from a knowledge cluster $\phi_i$ (selected with probability $p_i$ ).
The knowledge element $\omega$ determines which syntax encoder/template $\theta_{syn}^{(j)}$ to use.
The syntax encoder generates the sentence $X$ based on $\omega$ and $\theta_{syn}^{(j)}$ .

(See Figure 3b in the paper for a visual)

Explaining Scaling Laws

The paper analyzes this model under a Bayesian sequential prediction framework, where the goal is to minimize redundancy, defined as $D_{KL}(P_{true}^N || P_{model})$ . This optimal redundancy is equivalent to the mutual information $\mathbb{I}(X_{1:N}; \Theta)$ between the data and the true model parameters.

Data Scaling Law (Theorem 4.2): The average optimal Bayesian redundancy (per sentence) for the Syntax-Knowledge model scales as:

$\frac{1}{N}\mathbb{I}(X_{1:N}; \Theta) = \widetilde{O}\left(\frac{C_{\text{knw}}}{N^{1-\alpha_{PYP}}} + \frac{C_{\text{syn}}}{N}\right)$

where $N$ is the training data size, $\alpha_{PYP}$ is the discount parameter of the Pitman-Yor process (related to the power-law exponent of knowledge frequency), and $C_{\text{knw}}, C_{\text{syn}}$ are constants.

* Practical Implications: * The syntax component's redundancy ( $C_{\text{syn}}/N$ ) decreases faster than the knowledge component's ( $C_{\text{knw}}/N^{1-\alpha_{PYP}}$ ). * This explains why LLMs learn syntactic structures relatively quickly and with less data, while acquiring factual knowledge is a slower, more data-intensive process, with rarer facts learned later. * Experimental Validation: Figure 1 in the paper shows that validation loss follows a power-law with data size when data is power-law distributed. High-frequency data is learned earlier.

Model Scaling Law (Theorem 4.5): Focusing on the knowledge model (assuming syntax is learned faster) and a power-law distribution of knowledge frequencies $p_k \propto k^{-1/\alpha_{PYP}}$ , the paper analyzes the minimal redundancy $Red(C)$ achievable with a model capacity constraint $C$ .

$Red(C) = \Theta(C^{-1/\alpha_{PYP}+1})$

The contribution of the $k$ -th knowledge cluster to this redundancy is:

$p_k D_k(m_k^*) = \Theta(\min\{k^{-1/\alpha_{PYP}}, C^{-1/\alpha_{PYP}}\})$

where $m_k^*$ is the optimal memory allocated to the $k$ -th cluster.

* Practical Implications: * Model performance (lower redundancy/loss) improves with capacity $C$ following a power law. * Knowledge Acquisition Order: Models first learn high-frequency knowledge. Only with sufficient capacity can they learn less frequent knowledge. * Hallucinations: If a knowledge item's frequency is too low relative to the model's capacity (i.e., $k^{-1/\alpha_{PYP}}$ is small, or $C$ is small), the optimal strategy for the model might be to allocate zero bits ( $m_k^*=0$ ) to it, effectively not learning it. This leads to hallucinations even if the fact was seen multiple times during training. (See Figure 2 and Figure 4). * Experimental Validation: Figure 2 and Figure 3 show validation loss decomposing by frequency class as model size increases, confirming that frequent knowledge is learned by smaller models, and rarer knowledge requires larger models.

Explaining Fine-Tuning Dynamics (Section 5 & Appendix C)

The framework offers insights into fine-tuning:

Instruction Following (New Syntax, Old Knowledge): If fine-tuning aims to teach a new format (e.g., Q&A) for knowledge already learned during pretraining, the model primarily adapts to the new syntax. The redundancy related to syntax ( $Red_{syn}$ ) decreases rapidly ( $O(n^{-1})$ with $n$ fine-tuning samples). Pretrained knowledge is largely retained because its redundancy term ( $O(N^{\alpha_{PYP}-1})$ from large $N$ pretraining data) is already small.
Knowledge Injection (New Syntax, New Knowledge): If fine-tuning involves new facts in a new format, the model must learn both. If the new syntax is very different, and capacity is limited, learning the new syntax might "overwrite" or cause forgetting of pretrained knowledge due to competition for model parameters. The fine-tuning loss reflects learning both new syntax and new knowledge, which can be substantial if $n$ $n$ is small.
- Practical Recommendations:
  - For knowledge injection, use syntax similar to pretraining or mix pretraining data to mitigate forgetting, especially with capacity-constrained models.
  - For instruction fine-tuning, use knowledge distributions similar to pretraining to focus learning on new syntactic styles.

Causes of Hallucination

The paper primarily focuses on hallucinations arising from limited model capacity. Even if a fact is present in the training data, if it's infrequent and the model lacks the capacity to store it after prioritizing more frequent knowledge, it may hallucinate. Other causes mentioned include biased/outdated data, instruction fine-tuning on unfamiliar data, and knowledge shadowing.

Experimental Validation

Datasets: Synthetically generated datasets based on individual profiles with attributes (e.g., birth date, university) and sentence templates, allowing control over knowledge frequency (power-law or uniform distributions).
Models: RoPE-encoded GPT-like models of varying sizes.
Key Findings (beyond scaling laws):
- Data Heterogeneity (Appendix D.2, Figure 6): Models learn different properties (e.g., major vs. employer city) at different rates, partly due to entropy of properties. Uniform data shows sharper phase transitions in learning properties compared to gradual learning with power-law data.
- Fine-Tuning (Appendix D.2, Figure 7, Table 3): Continued Pretraining (CPT) on new knowledge (mixed with old data) leads to better retention of old knowledge compared to Supervised Fine-Tuning (SFT) on a new format, especially when model capacity is saturated. SFT's format difference can cause initial loss spikes and more forgetting.

Connections to Broader Concepts

Prediction and Compression: Core theme, linking LLMs to fundamental information theory.
Heap's and Zipf's Laws: Motivate the power-law assumptions in the data model.
Bayesian Inference & Universal Coding: The Bayesian framework connects to optimal prediction and redundancy.
Simplicity Bias: The model's tendency to learn simpler/more frequent patterns first.

Conclusion

The paper provides a theoretically grounded framework, the Syntax-Knowledge model, built upon the principle of compression. This model successfully offers qualitative and quantitative explanations for LLM scaling laws, the phased acquisition of syntax and knowledge, the impact of data/knowledge frequency on learning, and capacity-related hallucinations. The work highlights how LLMs, as sophisticated compressors, prioritize learning common syntactic patterns and then progressively incorporate factual knowledge based on frequency, constrained by their capacity.