Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data (2407.14985v4)

Published 20 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The impressive capabilities of LLMs have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of pretraining data. To explore this issue, we introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency. To effectively capture task-specific pretraining data frequency, we propose a novel task-gram LLM, which is built by counting the co-occurrence of semantically related $n$-gram pairs from task inputs and outputs in the pretraining corpus. Using the Pythia models trained on the Pile dataset, we evaluate four distinct tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Our findings reveal varying levels of memorization, with the strongest effect observed in factual question answering. Furthermore, while model performance improves across all tasks as LLM size increases, only factual question answering shows an increase in memorization, whereas machine translation and reasoning tasks exhibit greater generalization, producing more novel outputs. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks, providing a scalable method for analyzing large pretraining corpora in greater depth. We also show the practical implications of our analysis through a novel prompt optimization algorithm.

PDF HTML Abstract

Generalization vs. Memorization: Tracing LLMs' Capabilities Back to Pretraining Data

Overview

The paper "Generalization vs. Memorization: Tracing LLMs' Capabilities Back to Pretraining Data" by Antoniades, Wang, Elazar, et al. investigates the relationship between generalization and memorization in LLMs by analyzing task-relevant $n$ -gram pairs in their pretraining data. The paper specifically focuses on examining various LLM sizes and their performance on three major task types: translation, question-answering (QA), and multiple-choice reasoning.

Methodology

The authors conduct a comprehensive analysis involving different sizes of open-source LLMs, particularly Pythia and OLMo models, and their pretraining corpora. The methodology relies on a scalable $n$ -gram-based approach to trace back the model outputs to the pretraining data, leveraging the recently developed WIMBD framework for efficient corpus searching.

The analysis pipeline involves:

Mining task-relevant $n$ -grams by matching semantically similar $n$ -grams pairs from task inputs and outputs.
Searching these $n$ -grams across the pretraining corpus to obtain statistics.
Conducting a gradient-based analysis to justify the relevance of the mined task-related $n$ -grams pairs.

Key Findings

Task-relevant $n$ -gram Pairs: The analysis indicates that task-relevant $n$ -gram pairs are more representative of task-related data compared to individual $n$ -grams.
Task Performance Correlation: There is a direct correlation between task performance and the frequency of task-relevant $n$ -gram pairs in the pretraining corpus. For translation tasks, BLEU scores improve with increased task-related $n$ -gram pairs as model size increases. Similar trends are observed for TriviaQA exact match scores.
Emergent Abilities: The paper observes emergent abilities at a certain model size threshold, where performance significantly improves from near-random levels with increased relevant pretraining data. Above a threshold (around 400M parameters for Pythia models), models exhibit a phase transition from solely memorizing to generating novel, generalizable outputs.
Gradient Similarity Analysis: Gradient-based contributions show the higher importance of $n$ -gram pairs compared to single $n$ -grams across all datasets. Smaller models appear more dependent on pretraining data, with a notable transition from memorization to generalization as model size grows.
Instruction Tuning: Instruction tuning significantly enhances the model’s performance by improving the utilization of pretraining data. The tuned models are shown to have a higher dependency on task-relevant pretraining data, implying that they facilitate better generalization.

Implications and Future Directions

This paper makes significant contributions towards understanding how LLMs generalize from their pretraining data. The findings emphasize the importance of both data quality and model size in achieving strong generalization capabilities. The transition from memorization to generalization and the emergence of abilities at larger scales highlight the intricate balance required in model design and pretraining strategies.

Future research could explore enhanced methods for mining and filtering $n$ -gram pairs to further improve the tracing of capabilities, as well as developing better search and retrieval methods within vast pretraining corpora. The paper also underlines the potential benefits of analyzing newer, more extensive pretraining corpora and models such as Neo LLMs and the Matrix dataset.

The capabilities of LLMs in leveraging pretraining data via instruction tuning open up new avenues for fine-tuning strategies to enhance model performance further. Understanding these dynamics can inform the creation of more efficient and robust LLMs, capable of broader generalizations and applications across diverse domains.

Conclusion

The paper delivers a thorough and insightful exploration into the interplay between generalization and memorization in LLMs. By employing a scalable $n$ -gram-based approach and analyzing extensive pretraining corpora, it elucidates the origins of LLM capabilities, offering a vital step forward in comprehensively understanding these models. This research will serve as a foundational reference for future work aiming to dissect and optimize the pretraining processes of LLMs.