Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Published 20 Jul 2024 in cs.CL, cs.AI, and cs.LG | (2407.14985v5)

Abstract: The impressive capabilities of LLMs have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of pretraining data. To explore this issue, we introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency. To effectively capture task-specific pretraining data frequency, we propose a novel task-gram LLM, which is built by counting the co-occurrence of semantically related $n$-gram pairs from task inputs and outputs in the pretraining corpus. Using the Pythia models trained on the Pile dataset, we evaluate four distinct tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Our findings reveal varying levels of memorization, with the strongest effect observed in factual question answering. Furthermore, while model performance improves across all tasks as LLM size increases, only factual question answering shows an increase in memorization, whereas machine translation and reasoning tasks exhibit greater generalization, producing more novel outputs. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks, providing a scalable method for analyzing large pretraining corpora in greater depth.

Citations (11)

Summary

  • The paper demonstrates that pretraining data frequency correlates with task performance using n-gram analysis and cosine gradient similarity.
  • The study employs experiments on models like Pythia and OLMo, revealing a performance boost after reaching a threshold of ~400 million parameters.
  • Results indicate larger models generate novel n-gram pairs, suggesting enhanced generalization compared to smaller, more memorization-prone models.

Generalization vs. Memorization: Tracing LLMs' Capabilities Back to Pretraining Data

Introduction

This paper investigates the intricate balance between generalization and memorization in LLMs during pretraining. By employing an nn-gram analysis across different model sizes and pretraining data corpora, the authors aim to uncover how pretraining data contributes to LLM capabilities, examining tasks such as translation, question-answering, and multiple-choice reasoning. Figure 1

Figure 1: An overview of our proposed analysis pipeline for tracing LLM capabilities back to pretraining data.

Experimental Methodology

The paper examines LLMs, specifically Pythia and OLMo, and their pretraining corpora, notably the Pile and Dolma datasets. The authors propose a method to estimate data distribution through task-relevant nn-gram pairs, mined from specific tasks.

To understand pretraining's role, the authors devised a gradient-based analysis. This involves computing the gradient similarity between pretraining data and task examples, using cosine similarity as a metric. Figure 2

Figure 2: Cosine similarity between nn-gram task gradient and pretraining gradient for different tasks and model sizes.

A key finding is the correlation between the frequency of task-related nn-gram pairs in the pretraining corpus and task performance. There is a notable model size threshold—around 400 million parameters—after which performance improves significantly. Figure 3

Figure 3: BLEU and exact match scores vs. total nn-gram pair count in the pretraining corpus and model parameters.

The authors demonstrate that LLMs require both sufficient model size and relevant pretraining data to exhibit emergent abilities.

Memorization vs. Generalization

The study extends the definition of memorization beyond exact recall, evaluating the degree of similarity between generated text and training data. A linear regression analysis measures the similarity between LM distributions and pretraining data distributions, quantified by the R2^2 score.

The results illustrate that smaller models closely mirror the pretraining distribution, indicating higher memorization, while larger models show increased generalization. Figure 4

Figure 4: R2^2 score of linear regression from data distribution to LM distribution across different model sizes.

Data Distribution and LLM Novelty

The authors observe that larger models generate novel nn-gram pairs, indicating a shift towards generalization from memorization. Instruction tuning also enhances the utilization of pretraining data, resulting in improved performance and stronger alignment with pretraining distributions. Figure 5

Figure 5: Number of unique nn-gram pairs generated for different nn values and model sizes.

Conclusion

This paper provides insights into LLMs' capabilities via nn-gram pair analysis, highlighting the dynamic interplay between pretraining data, model size, and task performance. It emphasizes the importance of both memorization and generalization in achieving optimal LLM functionalities. Future research could enhance methods for filtering and analyzing task-relevant nn-gram pairs, while leveraging larger, updated corpora for even more profound understanding.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 12 tweets with 8167 likes about this paper.