Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models (2404.06209v3)

Published 9 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: While many have shown how LLMs can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a LLM has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. While LLMs are significantly better than random at solving statistical classification problems, the sample efficiency of few-shot learning lags behind traditional statistical learning algorithms, especially as the dimension of the problem increases. This suggests that much of the observed few-shot performance on novel real-world datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We release the https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package to test LLMs for memorization of tabular datasets.

PDF HTML Abstract

This paper, "Elephants Never Forget: Memorization and Learning of Tabular Data in LLMs" (Bordt et al., 9 Apr 2024 ), investigates the extent to which LLMs, specifically GPT-3.5 and GPT-4, have memorized popular tabular datasets and how this memorization affects their few-shot learning performance. The authors address the critical issue of data contamination in LLM evaluation benchmarks.

The core contributions and findings are:

Detection of Tabular Data Memorization: The authors introduce four tests to detect if an LLM has seen a specific tabular dataset during training, leveraging the canonical CSV format. These tests extract identifying information (canaries) from the LLM:
- Header Test: Prompting with initial rows and asking for verbatim completion.
- Row Completion Test: Prompting with contiguous rows from a random position and asking for the next row verbatim.
- Feature Completion Test: Prompting with all but one unique feature value from a random row and asking for the missing value verbatim.
- First Token Test: Prompting with contiguous rows and asking for the first token of the next row.
Extensive Memorization Found in Popular Datasets: Applying these tests, the authors find strong evidence that GPT-3.5 and GPT-4 have memorized many popular tabular datasets verbatim, including Iris, Wine, Adult, Diabetes, Housing, and Titanic. The Header Test shows memorization of initial rows for all tested publicly available datasets. Row and Feature Completion tests also show memorization of random rows for some of these datasets, indicating memorization beyond just introductory samples often found online. Novel datasets released after the models' training cutoffs show no such evidence of memorization. GPT-4 generally shows stronger evidence of memorization than GPT-3.5.
Impact of Memorization on Few-Shot Learning Performance: The paper compares the few-shot learning classification performance of GPT-3.5 and GPT-4 on memorized datasets versus novel (unseen) datasets.
- Few-shot setup: Models are prompted with tabular data formatted as "Feature Name = Feature Value" pairs for classification. 20 random, label-stratified few-shot examples are provided at temperature 0.
- Dataset Transformations: To assess the robustness and reliance on memorized exact values/formats, datasets are presented in four formats:
  - Original: As found in the CSV.
  - Perturbed: Original data with slightly changed individual digits and defaced unique identifiers.
  - Task: Perturbed data with re-named features, re-coded categories, and rounded numerical values.
  - Statistical: Task data with numerical features standardized (zero mean, unit variance) and categorical features numerically encoded, with feature names replaced by generic ones (X1, ...).
- Findings: On memorized datasets (Panel A, Table 3), LLMs (especially GPT-4) achieve impressive performance, sometimes outperforming traditional models like Logistic Regression (LR) and Gradient Boosted Trees (GBT) in the Original format. However, performance consistently drops as the format changes from Original to Perturbed, Task, and Statistical. For GPT-4, the average accuracy drop from Original to Task format on memorized datasets is 6 percentage points. This sensitivity to formatting strongly suggests that performance on these datasets is inflated due to memorization and subsequent overfitting.
- On novel datasets (Panel B, Table 3), LLMs perform reasonably well but generally do not outperform LR or GBT. Crucially, the performance on novel datasets is much more robust to format changes between Original, Perturbed, and Task formats. This indicates that LLMs are not simply pattern-matching exact strings from their training data on novel tasks.
Role of World Knowledge vs. Statistical Learning: The significant performance drop observed between the Task format (which retains meaningful feature names and scales) and the Statistical format (which uses generic names and standardized values) on most novel datasets indicates that LLMs heavily rely on their "world knowledge" associated with feature names and value scales for effective few-shot learning. This effect is less pronounced on synthetic datasets like Spaceship Titanic.
Limited In-Context Statistical Learning Ability: The authors specifically test the LLMs' ability to act as pure statistical predictors on a synthetic linear classification task with standardized numerical features (the Statistical format).
- GPT-3.5 and GPT-4 perform better than random but struggle as the feature dimension increases, deteriorating faster than LR or 1-Nearest Neighbor.
- GPT-4's performance improves monotonically with the number of few-shot examples, showing some ability to learn statistical patterns in-context, though it remains less efficient than LR. GPT-3.5 shows weaker scaling with few-shot examples. This contrasts with fine-tuning approaches like TabLLM, which scale well with data size.
Drawing Samples from Memorized Data: The paper demonstrates that LLMs can draw random samples from datasets they have seen during training without specific fine-tuning. By conditioning GPT-3.5 with the dataset name and feature names, it can generate samples that follow the overall data distribution and feature correlations (e.g., California Housing), even if the samples are not direct copies. The diversity of samples can be influenced by the temperature parameter.

Practical Implications and Implementation:

Evaluation Rigor: The findings underscore the critical importance of checking for dataset contamination when evaluating LLMs on specific tasks, especially for popular datasets. Evaluating on memorized data can lead to highly inflated performance estimates due to overfitting, not true few-shot generalization.
tabmemcheck Tool: The authors provide an open-source Python package, tabmemcheck, implementing the developed memorization tests. This tool allows practitioners and researchers to check if an LLM has memorized a specific tabular dataset before using it for evaluation or deployment on that data.
Prompt Engineering: The "Feature Name = Feature Value" format is a demonstrated effective way to prompt LLMs for tabular tasks. However, the paper highlights that the specific wording (feature names, categorical value encoding) significantly impacts performance on novel data, suggesting the need for careful prompt design when domain knowledge is involved.
Performance Considerations: The few-shot learning experiments, particularly with GPT-4, can be computationally expensive, requiring many queries to the LLM API. This is a practical consideration for extensive evaluations.
Scaling Limitations: While LLMs can perform statistical tasks in-context, their ability to learn complex statistical relationships purely from examples seems limited compared to traditional methods, especially in higher dimensions. Fine-tuning or combining LLMs with classical statistical models might be more effective for many real-world tabular tasks.

In summary, the paper provides strong evidence of significant verbatim memorization of popular tabular datasets by state-of-the-art LLMs, demonstrates how this leads to inflated few-shot performance estimates due to overfitting, and highlights that LLMs rely heavily on world knowledge encoded in variable names and formats for novel tabular tasks, with limited pure in-context statistical learning abilities. The release of tabmemcheck offers a practical tool to mitigate contamination issues in future research and applications.