This paper, "Elephants Never Forget: Memorization and Learning of Tabular Data in LLMs" (Bordt et al., 9 Apr 2024 ), investigates the extent to which LLMs, specifically GPT-3.5 and GPT-4, have memorized popular tabular datasets and how this memorization affects their few-shot learning performance. The authors address the critical issue of data contamination in LLM evaluation benchmarks.
The core contributions and findings are:
- Detection of Tabular Data Memorization: The authors introduce four tests to detect if an LLM has seen a specific tabular dataset during training, leveraging the canonical CSV format. These tests extract identifying information (canaries) from the LLM:
- Header Test: Prompting with initial rows and asking for verbatim completion.
- Row Completion Test: Prompting with contiguous rows from a random position and asking for the next row verbatim.
- Feature Completion Test: Prompting with all but one unique feature value from a random row and asking for the missing value verbatim.
- First Token Test: Prompting with contiguous rows and asking for the first token of the next row.
- Extensive Memorization Found in Popular Datasets: Applying these tests, the authors find strong evidence that GPT-3.5 and GPT-4 have memorized many popular tabular datasets verbatim, including Iris, Wine, Adult, Diabetes, Housing, and Titanic. The Header Test shows memorization of initial rows for all tested publicly available datasets. Row and Feature Completion tests also show memorization of random rows for some of these datasets, indicating memorization beyond just introductory samples often found online. Novel datasets released after the models' training cutoffs show no such evidence of memorization. GPT-4 generally shows stronger evidence of memorization than GPT-3.5.
- Impact of Memorization on Few-Shot Learning Performance: The paper compares the few-shot learning classification performance of GPT-3.5 and GPT-4 on memorized datasets versus novel (unseen) datasets.
- Few-shot setup: Models are prompted with tabular data formatted as "Feature Name = Feature Value" pairs for classification. 20 random, label-stratified few-shot examples are provided at temperature 0.
- Dataset Transformations: To assess the robustness and reliance on memorized exact values/formats, datasets are presented in four formats:
- Original: As found in the CSV.
- Perturbed: Original data with slightly changed individual digits and defaced unique identifiers.
- Task: Perturbed data with re-named features, re-coded categories, and rounded numerical values.
- Statistical: Task data with numerical features standardized (zero mean, unit variance) and categorical features numerically encoded, with feature names replaced by generic ones (X1, ...).
- Findings: On memorized datasets (Panel A, Table 3), LLMs (especially GPT-4) achieve impressive performance, sometimes outperforming traditional models like Logistic Regression (LR) and Gradient Boosted Trees (GBT) in the
Original
format. However, performance consistently drops as the format changes fromOriginal
toPerturbed
,Task
, andStatistical
. For GPT-4, the average accuracy drop fromOriginal
toTask
format on memorized datasets is 6 percentage points. This sensitivity to formatting strongly suggests that performance on these datasets is inflated due to memorization and subsequent overfitting. - On novel datasets (Panel B, Table 3), LLMs perform reasonably well but generally do not outperform LR or GBT. Crucially, the performance on novel datasets is much more robust to format changes between
Original
,Perturbed
, andTask
formats. This indicates that LLMs are not simply pattern-matching exact strings from their training data on novel tasks.
- Role of World Knowledge vs. Statistical Learning: The significant performance drop observed between the
Task
format (which retains meaningful feature names and scales) and theStatistical
format (which uses generic names and standardized values) on most novel datasets indicates that LLMs heavily rely on their "world knowledge" associated with feature names and value scales for effective few-shot learning. This effect is less pronounced on synthetic datasets like Spaceship Titanic. - Limited In-Context Statistical Learning Ability: The authors specifically test the LLMs' ability to act as pure statistical predictors on a synthetic linear classification task with standardized numerical features (the
Statistical
format).- GPT-3.5 and GPT-4 perform better than random but struggle as the feature dimension increases, deteriorating faster than LR or 1-Nearest Neighbor.
- GPT-4's performance improves monotonically with the number of few-shot examples, showing some ability to learn statistical patterns in-context, though it remains less efficient than LR. GPT-3.5 shows weaker scaling with few-shot examples. This contrasts with fine-tuning approaches like TabLLM, which scale well with data size.
- Drawing Samples from Memorized Data: The paper demonstrates that LLMs can draw random samples from datasets they have seen during training without specific fine-tuning. By conditioning GPT-3.5 with the dataset name and feature names, it can generate samples that follow the overall data distribution and feature correlations (e.g., California Housing), even if the samples are not direct copies. The diversity of samples can be influenced by the temperature parameter.
Practical Implications and Implementation:
- Evaluation Rigor: The findings underscore the critical importance of checking for dataset contamination when evaluating LLMs on specific tasks, especially for popular datasets. Evaluating on memorized data can lead to highly inflated performance estimates due to overfitting, not true few-shot generalization.
tabmemcheck
Tool: The authors provide an open-source Python package,tabmemcheck
, implementing the developed memorization tests. This tool allows practitioners and researchers to check if an LLM has memorized a specific tabular dataset before using it for evaluation or deployment on that data.- Prompt Engineering: The "Feature Name = Feature Value" format is a demonstrated effective way to prompt LLMs for tabular tasks. However, the paper highlights that the specific wording (feature names, categorical value encoding) significantly impacts performance on novel data, suggesting the need for careful prompt design when domain knowledge is involved.
- Performance Considerations: The few-shot learning experiments, particularly with GPT-4, can be computationally expensive, requiring many queries to the LLM API. This is a practical consideration for extensive evaluations.
- Scaling Limitations: While LLMs can perform statistical tasks in-context, their ability to learn complex statistical relationships purely from examples seems limited compared to traditional methods, especially in higher dimensions. Fine-tuning or combining LLMs with classical statistical models might be more effective for many real-world tabular tasks.
In summary, the paper provides strong evidence of significant verbatim memorization of popular tabular datasets by state-of-the-art LLMs, demonstrates how this leads to inflated few-shot performance estimates due to overfitting, and highlights that LLMs rely heavily on world knowledge encoded in variable names and formats for novel tabular tasks, with limited pure in-context statistical learning abilities. The release of tabmemcheck
offers a practical tool to mitigate contamination issues in future research and applications.