Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models (2404.06209v3)

Published 9 Apr 2024 in cs.LG, cs.AI, and cs.CL
Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

Abstract: While many have shown how LLMs can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a LLM has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. While LLMs are significantly better than random at solving statistical classification problems, the sample efficiency of few-shot learning lags behind traditional statistical learning algorithms, especially as the dimension of the problem increases. This suggests that much of the observed few-shot performance on novel real-world datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We release the https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package to test LLMs for memorization of tabular datasets.

This paper, "Elephants Never Forget: Memorization and Learning of Tabular Data in LLMs" (Bordt et al., 9 Apr 2024 ), investigates the extent to which LLMs, specifically GPT-3.5 and GPT-4, have memorized popular tabular datasets and how this memorization affects their few-shot learning performance. The authors address the critical issue of data contamination in LLM evaluation benchmarks.

The core contributions and findings are:

  1. Detection of Tabular Data Memorization: The authors introduce four tests to detect if an LLM has seen a specific tabular dataset during training, leveraging the canonical CSV format. These tests extract identifying information (canaries) from the LLM:
    • Header Test: Prompting with initial rows and asking for verbatim completion.
    • Row Completion Test: Prompting with contiguous rows from a random position and asking for the next row verbatim.
    • Feature Completion Test: Prompting with all but one unique feature value from a random row and asking for the missing value verbatim.
    • First Token Test: Prompting with contiguous rows and asking for the first token of the next row.
  2. Extensive Memorization Found in Popular Datasets: Applying these tests, the authors find strong evidence that GPT-3.5 and GPT-4 have memorized many popular tabular datasets verbatim, including Iris, Wine, Adult, Diabetes, Housing, and Titanic. The Header Test shows memorization of initial rows for all tested publicly available datasets. Row and Feature Completion tests also show memorization of random rows for some of these datasets, indicating memorization beyond just introductory samples often found online. Novel datasets released after the models' training cutoffs show no such evidence of memorization. GPT-4 generally shows stronger evidence of memorization than GPT-3.5.
  3. Impact of Memorization on Few-Shot Learning Performance: The paper compares the few-shot learning classification performance of GPT-3.5 and GPT-4 on memorized datasets versus novel (unseen) datasets.
    • Few-shot setup: Models are prompted with tabular data formatted as "Feature Name = Feature Value" pairs for classification. 20 random, label-stratified few-shot examples are provided at temperature 0.
    • Dataset Transformations: To assess the robustness and reliance on memorized exact values/formats, datasets are presented in four formats:
      • Original: As found in the CSV.
      • Perturbed: Original data with slightly changed individual digits and defaced unique identifiers.
      • Task: Perturbed data with re-named features, re-coded categories, and rounded numerical values.
      • Statistical: Task data with numerical features standardized (zero mean, unit variance) and categorical features numerically encoded, with feature names replaced by generic ones (X1, ...).
    • Findings: On memorized datasets (Panel A, Table 3), LLMs (especially GPT-4) achieve impressive performance, sometimes outperforming traditional models like Logistic Regression (LR) and Gradient Boosted Trees (GBT) in the Original format. However, performance consistently drops as the format changes from Original to Perturbed, Task, and Statistical. For GPT-4, the average accuracy drop from Original to Task format on memorized datasets is 6 percentage points. This sensitivity to formatting strongly suggests that performance on these datasets is inflated due to memorization and subsequent overfitting.
    • On novel datasets (Panel B, Table 3), LLMs perform reasonably well but generally do not outperform LR or GBT. Crucially, the performance on novel datasets is much more robust to format changes between Original, Perturbed, and Task formats. This indicates that LLMs are not simply pattern-matching exact strings from their training data on novel tasks.
  4. Role of World Knowledge vs. Statistical Learning: The significant performance drop observed between the Task format (which retains meaningful feature names and scales) and the Statistical format (which uses generic names and standardized values) on most novel datasets indicates that LLMs heavily rely on their "world knowledge" associated with feature names and value scales for effective few-shot learning. This effect is less pronounced on synthetic datasets like Spaceship Titanic.
  5. Limited In-Context Statistical Learning Ability: The authors specifically test the LLMs' ability to act as pure statistical predictors on a synthetic linear classification task with standardized numerical features (the Statistical format).
    • GPT-3.5 and GPT-4 perform better than random but struggle as the feature dimension increases, deteriorating faster than LR or 1-Nearest Neighbor.
    • GPT-4's performance improves monotonically with the number of few-shot examples, showing some ability to learn statistical patterns in-context, though it remains less efficient than LR. GPT-3.5 shows weaker scaling with few-shot examples. This contrasts with fine-tuning approaches like TabLLM, which scale well with data size.
  6. Drawing Samples from Memorized Data: The paper demonstrates that LLMs can draw random samples from datasets they have seen during training without specific fine-tuning. By conditioning GPT-3.5 with the dataset name and feature names, it can generate samples that follow the overall data distribution and feature correlations (e.g., California Housing), even if the samples are not direct copies. The diversity of samples can be influenced by the temperature parameter.

Practical Implications and Implementation:

  • Evaluation Rigor: The findings underscore the critical importance of checking for dataset contamination when evaluating LLMs on specific tasks, especially for popular datasets. Evaluating on memorized data can lead to highly inflated performance estimates due to overfitting, not true few-shot generalization.
  • tabmemcheck Tool: The authors provide an open-source Python package, tabmemcheck, implementing the developed memorization tests. This tool allows practitioners and researchers to check if an LLM has memorized a specific tabular dataset before using it for evaluation or deployment on that data.
  • Prompt Engineering: The "Feature Name = Feature Value" format is a demonstrated effective way to prompt LLMs for tabular tasks. However, the paper highlights that the specific wording (feature names, categorical value encoding) significantly impacts performance on novel data, suggesting the need for careful prompt design when domain knowledge is involved.
  • Performance Considerations: The few-shot learning experiments, particularly with GPT-4, can be computationally expensive, requiring many queries to the LLM API. This is a practical consideration for extensive evaluations.
  • Scaling Limitations: While LLMs can perform statistical tasks in-context, their ability to learn complex statistical relationships purely from examples seems limited compared to traditional methods, especially in higher dimensions. Fine-tuning or combining LLMs with classical statistical models might be more effective for many real-world tabular tasks.

In summary, the paper provides strong evidence of significant verbatim memorization of popular tabular datasets by state-of-the-art LLMs, demonstrates how this leads to inflated few-shot performance estimates due to overfitting, and highlights that LLMs rely heavily on world knowledge encoded in variable names and formats for novel tabular tasks, with limited pure in-context statistical learning abilities. The release of tabmemcheck offers a practical tool to mitigate contamination issues in future research and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Understanding in-context learning in transformers and llms by learning to learn discrete functions. arXiv preprint arXiv:2310.03016, 2023.
  2. Emergent and predictable memorization in large language models. In NeurIPS, 2023.
  3. Data science with llms and interpretable models. XAI4Sci Workshop at AAAI-24, 2024.
  4. Language models are realistic tabular data generators. In International Conference on Learning Representations (ICLR), 2023.
  5. Language models are few-shot learners. NeurIPS, 2020.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX security symposium (USENIX security 19), pp. 267–284, 2019.
  8. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
  9. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914. IEEE, 2022a.
  10. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022b.
  11. Speak, memory: An archaeology of books known to chatgpt/gpt-4. arXiv preprint arXiv:2305.00118, 2023.
  12. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp.  785–794, 2016.
  13. Exploring the potential of large language models (llms) in learning on graphs. arXiv preprint arXiv:2307.03393, 2023.
  14. Label-only membership inference attacks. In International conference on machine learning, pp. 1964–1974. PMLR, 2021.
  15. Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34:6478–6490, 2021.
  16. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
  17. Do membership inference attacks work on large language models? arXiv preprint arXiv:2402.07841, 2024.
  18. Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023a.
  19. Faith and fate: Limits of transformers on compositionality. In Advances in Neural Information Processing Systems, volume 36, 2023b.
  20. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35, 2022.
  21. Nathan Goad. Diabetic Ketoacidosis and Hyperchloremia Full Dataset, 2018. URL https://doi.org/10.7910/DVN/PX9K2R.
  22. Tabllm: few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pp.  5549–5581. PMLR, 2023.
  23. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. Advances in Neural Information Processing Systems, 36, 2024.
  24. Investigating data contamination for pre-training language models. arXiv preprint arXiv:2401.06059, 2024.
  25. Time-llm: Time series forecasting by reprogramming large language models. ICLM, 2024.
  26. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997. doi: https://doi.org/10.1016/S0167-7152(96)00140-X.
  27. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, 2023.
  28. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  29. Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242, 2022.
  30. Membership inference on word embedding and beyond. arXiv preprint arXiv:2106.11384, 2021.
  31. Membership inference attacks against language models via neighbourhood comparison. arXiv preprint arXiv:2305.18462, 2023.
  32. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
  33. Adapting pretrained language models for solving tabular prediction problems in the electronic health record. arXiv preprint arXiv:2303.14920, 2023.
  34. Quantifying privacy risks of masked language models using membership inference attacks. arXiv preprint arXiv:2203.03929, 2022.
  35. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911, 2022.
  36. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  37. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  38. OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
  39. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
  40. Detecting pretraining data from large language models. ICLR 2024, 2024.
  41. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the annual symposium on computer application in medical care, pp.  261. American Medical Informatics Association, 1988.
  42. Table meets llm: Can large language models understand structured table data? a benchmark and empirical study. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pp.  645–654, 2024.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  44. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
  45. Transformers learn in-context by gradient descent. In ICML, 2023.
  46. Towards parameter-efficient automation of data wrangling tasks with prefix-tuning. In NeurIPS 2022 First Table Representation Workshop, 2022.
  47. Anypredict: Foundation model for tabular prediction. arXiv preprint arXiv:2305.12081, 2023.
  48. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  49. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
  50. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
  51. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
  52. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  53. Kola: Carefully benchmarking world knowledge of large language models. NeurIPS Datasets and Benchmarks Track, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sebastian Bordt (14 papers)
  2. Harsha Nori (23 papers)
  3. Vanessa Rodrigues (1 paper)
  4. Besmira Nushi (38 papers)
  5. Rich Caruana (42 papers)
Citations (10)
Youtube Logo Streamline Icon: https://streamlinehq.com