How much do language models memorize? (2505.24832v3)

Published 30 May 2025 in cs.CL

Abstract: We propose a new method for estimating how much a model knows about a datapoint and use it to measure the capacity of modern LLMs. Prior studies of LLM memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train LLMs on datasets of increasing size and observe that models memorize until their capacity fills, at which point "grokking" begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer LLMs ranging from $500K$ to $1.5B$ parameters and produce a series of scaling laws relating model capacity and data size to membership inference.

PDF Abstract

This paper introduces a novel method for quantifying how much a LLM "memorizes" specific training data, providing a principled way to distinguish unintended memorization from generalization. The core idea is based on Kolmogorov complexity and its relationship to compression.

The authors define memorization as the reduction in description length of a data point $x$ when a trained model $\hat{\theta}$ is available as a reference, compared to when no such model is available. To isolate unintended memorization from generalization (information about the true data-generating process), they introduce a reference model $\theta$ that represents the underlying data distribution or a high-capacity model trained on broad data. Unintended memorization of $x$ in $\hat{\theta}$ with respect to $\theta$ is defined as the information about $x$ contained in $\hat{\theta}$ that is not already captured by the reference model $\theta$ . Formally, it is the reduction in description length of $x$ given $\hat{\theta}$ compared to the description length given both $\hat{\theta}$ and $\theta$ : $\text{mem}^K_U(x, \theta, \hat{\theta}) = H^K(x\mid \theta) - H^K(x\mid (\theta, \hat{\theta}))$ .

Since exact Kolmogorov complexity is uncomputable, the authors approximate it using model likelihoods, leveraging the connection between compression and probability: $H^K(x \mid \text{model}) \approx -\log_2 p(x \mid \text{model})$ . They approximate $H^K(x \mid (\theta, \hat{\theta}))$ by using the minimum description length provided by either model, i.e., $\max\{p(x \mid \hat{\theta}), p(x \mid \theta)\}$ . For synthetic uniform data, the base entropy $H^K(x)$ is known exactly. For real text, they use a large "oracle" model or a model of the same size trained on the full dataset as the reference $\theta$ to estimate $H^K(x \mid \theta)$ .

Measuring Model Capacity:

The paper uses unintended memorization to estimate the intrinsic capacity of LLMs. By training GPT-style transformer models (ranging from 500K to 1.5B parameters) on uniformly random bitstrings, where no generalization is possible, all learning represents pure memorization. They observe that the total amount of memorization across the dataset plateaus as the dataset size increases, indicating the model has reached its capacity (Figure \ref{fig:model_capacity_synthetic}).

Through experiments on these synthetic datasets, they estimate that GPT-style transformers can store approximately 3.6 bits of information per parameter when trained in bfloat16 precision, and 3.8 bits per parameter in fp32 precision (Figure \ref{fig:model_capacity_parameters}). This suggests that doubling precision does not double memorization capacity, implying that the extra bits in fp32 are not primarily used for raw storage.

Memorization and Generalization on Text Data:

Applying the framework to real text data (FineWeb dataset), the authors use a reference model to disentangle unintended memorization from generalization. They find that unintended memorization per sample decreases as the dataset size increases relative to model capacity. The total unintended memorization across the dataset first increases as the model fills its capacity and then decreases as the model begins to generalize (Figure \ref{fig:text-kolmogorov-total-oracle}).

Double Descent Explanation:

The paper provides a new perspective on the double descent phenomenon. They show that double descent in evaluation loss occurs precisely when the dataset size (measured in bits) begins to exceed the model's memorization capacity (Figure \ref{fig:synth_double_descent} and \ref{fig:text-loss-train-val}). The proposed explanation is that once the model can no longer memorize all data points individually due to capacity constraints, it is "forced" to find more general patterns across data points, leading to improved generalization and decreased per-sample unintended memorization.

Memorization and Membership Inference:

The authors investigate the relationship between their memorization measure and the success of loss-based membership inference attacks. For both synthetic and text data, they show that membership inference performance decreases significantly as the dataset size grows relative to model capacity (Figure \ref{fig:model_capacity_membership} and \ref{fig:text-membership}). When comparing membership inference to extraction rates, they find that membership inference is generally easier than verbatim extraction, and that on large, deduplicated datasets, any successful extraction is often attributable to generalization (Figure \ref{fig:text-membership-vs-extraction}).

Scaling Laws for Membership Inference:

Based on their empirical results, the authors propose a scaling law to predict the F1 score of a loss-based membership inference attack. The law relates the F1 score to the ratio of model capacity (estimated from parameters using the 3.6 bpp factor) to the dataset size (in number of examples). The functional form is a sigmoid:

$\text{Membership}_{F_1}(\theta, \mathcal{D}) = \dfrac{1}{2}{(1 + c_1\sigma(c_2 (\dfrac{\text{Capacity}(\theta)}{|\mathcal{D}|} + c_3)))}$

where Capacity $(\theta)$ is estimated as $\alpha \times |\theta|$ parameters, and $\sigma$ is the sigmoid function. They fit constants $c_1, c_2, c_3$ from experiments on smaller models and validate the law on larger GPT-2 models (125M and 1.5B parameters) by predicting the dataset size needed for specific F1 scores (Table \ref{tab:scaling-law-dataset-sizes}). The predictions are found to be relatively accurate, suggesting that the ratio of model capacity to dataset size is a strong predictor of membership inference success. This law implies that for models trained on datasets much larger than their capacity, average membership inference is expected to be near random chance (F1=0.5).

Implementation Considerations and Practical Implications:

Measuring Memorization: Implementing this approach requires calculating per-sample likelihoods under the trained model and a chosen reference model. For text, obtaining a good reference model is crucial. An oracle model (larger, trained on broader data) or a model of similar architecture trained on the full dataset can serve this purpose.
Model Capacity Estimation: The 3.6 bits/parameter estimate provides a practical rule of thumb for the memorization capacity of GPT-style models. This is distinct from task-specific performance or knowledge storage related to generalization.
Dataset Size vs. Model Size: The findings reinforce the idea that the ratio of dataset size to model capacity dictates the training dynamics. Training on data significantly exceeding capacity pushes models towards generalization rather than rote memorization of individual samples.
Membership Inference Risk: The scaling law offers a way for practitioners to estimate the privacy risk related to average training data points based on their model and dataset sizes. If your dataset is sufficiently large compared to your model's estimated memorization capacity, loss-based MIA on average samples is unlikely to be effective.
Identifying Highly Memorized Data: The paper shows that even on large, deduplicated text datasets, certain data points are highly memorized. These often contain rare tokens or content from non-English languages (Figure \ref{fig:text-mem-tfidf}, Table \ref{tab:examples-table}). This highlights that memorization risk is not uniform across the training set and is higher for unique or statistically unusual examples. Practitioners concerned about privacy might focus on identifying and handling such outliers in their training data.
Computational Requirements: Measuring memorization across large datasets requires computing likelihoods for many samples under multiple models, which can be computationally intensive, though more feasible than adversarial extraction attempts for large-scale analysis.

The paper concludes that their framework provides a valuable tool for understanding and measuring LLM memorization, capacity, and their relationship to generalization and privacy risks like membership inference.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

John X. Morris (24 papers)
Chawin Sitawarin (26 papers)
Chuan Guo (77 papers)
Narine Kokhlikyan (15 papers)
G. Edward Suh (30 papers)
Alexander M. Rush (115 papers)
Kamalika Chaudhuri (121 papers)
Saeed Mahloujifar (43 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/jxmnop/status/1929903048408670703

https://twitter.com/bronzeagepapi/status/1937295762922340419

https://twitter.com/rohanpaul_ai/status/1929989868387525019

https://twitter.com/Kseniase_/status/1931884274887823489

https://twitter.com/fly51fly/status/1929654734472073484

https://twitter.com/ceobillionaire/status/1930329883311305035

How much do language models memorize? (2505.24832v3)

Related Papers

Tweets

YouTube

HackerNews

Reddit