Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Elephants Never Forget: Testing Language Models for Memorization of Tabular Data (2403.06644v1)

Published 11 Mar 2024 in cs.LG and cs.CL

Abstract: While many have shown how LLMs can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the LLM reproduces important statistics of the data, but fails to reproduce the dataset verbatim. On these datasets, although seen during training, good performance on downstream tasks might not be due to overfitting. Our findings underscore the need for ensuring data integrity in machine learning tasks with LLMs. To facilitate future research, we release an open-source tool that can perform various tests for memorization \url{https://github.com/interpretml/LLM-Tabular-Memorization-Checker}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Language models are realistic tabular data generators. In International Conference on Learning Representations (ICLR), 2023.
  2. Language models are few-shot learners. NeurIPS, 2020.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  4. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
  5. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914. IEEE, 2022a.
  6. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022b.
  7. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1721–1730, 2015.
  8. Speak, memory: An archaeology of books known to chatgpt/gpt-4. arXiv preprint arXiv:2305.00118, 2023.
  9. An interpretable model with globally consistent explanations for credit risk. arXiv preprint arXiv:1811.12615, 2018.
  10. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  11. An evaluation of machine-learning methods for predicting pneumonia mortality. Artificial intelligence in medicine, 9(2):107–138, 1997.
  12. Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34:6478–6490, 2021.
  13. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
  14. Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
  15. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  16. Tabllm: few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pages 5549–5581. PMLR, 2023.
  17. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997. doi: https://doi.org/10.1016/S0167-7152(96)00140-X.
  18. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  19. Membership inference attacks against language models via neighbourhood comparison. arXiv preprint arXiv:2305.18462, 2023.
  20. Adapting pretrained language models for solving tabular prediction problems in the electronic health record. arXiv preprint arXiv:2303.14920, 2023.
  21. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911, 2022.
  22. Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31–88, 2001.
  23. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  24. OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
  27. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the annual symposium on computer application in medical care, page 261. American Medical Informatics Association, 1988.
  28. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058, 2023.
  29. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290, 2022.
  30. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
  31. Towards parameter-efficient automation of data wrangling tasks with prefix-tuning. In NeurIPS 2022 First Table Representation Workshop, 2022.
  32. Anypredict: Foundation model for tabular prediction. arXiv preprint arXiv:2305.12081, 2023.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  34. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sebastian Bordt (14 papers)
  2. Harsha Nori (23 papers)
  3. Rich Caruana (42 papers)
Citations (10)