Emergent Mind

Data Portraits: Recording Foundation Model Training Data

Published Mar 6, 2023 in cs.LG and cs.CL


Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straightforward question: has the model already encountered a given example during training? We therefore propose a widespread adoption of Data Portraits: artifacts that record training data and allow for downstream inspection. First we outline the properties of such an artifact and discuss how existing solutions can be used to increase transparency. We then propose and implement a solution based on data sketching, stressing fast and space efficient querying. Using our tools, we document a popular language modeling corpus (The Pile) and a recently released code modeling dataset (The Stack). We show that our solution enables answering questions about test set leakage and model plagiarism. Our tool is lightweight and fast, costing only 3% of the dataset size in overhead. We release a live interface of our tools at https://dataportraits.org/ and call on dataset and model creators to release Data Portraits as a complement to current documentation practices.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. Proceedings of the Fifth Conference on Machine Translation, Online, November 2020. Association for Computational Linguistics. https://aclanthology.org/2020.wmt-1.0.

  2. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604, 2018. doi: 10.1162/tacla00041. https://aclanthology.org/Q18-1041.

  3. “better than nothing” privacy with Bloom filters: To what extent? In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2012, Palermo, Italy, September 26-28, 2012. Proceedings, pages 348–363. Springer
  4. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422–426, jul 1970. ISSN 0001-0782. doi: 10.1145/362686.362692. https://doi.org/10.1145/362686.362692.
  5. On the Opportunities and Risks of Foundation Models
  6. A Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE
  7. Network applications of bloom filters: A survey. Internet Mathematics, 1:485 – 509
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  9. Extracting training data from large language models. In USENIX Security Symposium
  10. Quantifying Memorization Across Neural Language Models
  11. Evaluating Large Language Models Trained on Code
  12. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. https://aclanthology.org/2021.emnlp-main.98.

  13. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  14. Datasheets for datasets. Communications of the ACM, 64(12):86–92
  15. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. https://aclanthology.org/2020.findings-emnlp.301.

  16. Sketch algorithms for estimating point queries in NLP. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1093–1103, Jeju Island, Korea, July 2012. Association for Computational Linguistics. https://aclanthology.org/D12-1100.

  17. Statistical power and translationese in machine translation evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.6. https://aclanthology.org/2020.emnlp-main.6.

  18. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation
  19. Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy
  20. Abstractive summarization of Reddit posts with multi-level memory networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2519–2531, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1260. https://aclanthology.org/N19-1260.

  21. Abstract meaning representation (amr) annotation release 2.0, 2017. https://catalog.ldc.upenn.edu/LDC2017T10.

  22. The stack: 3 tb of permissively licensed source code. Preprint
  23. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track
  24. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. https://aclanthology.org/2022.acl-long.577.

  25. Starcoder: may the source be with you!
  26. What’s in the box? an analysis of undesirable content in the Common Crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 182–189, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.24. https://aclanthology.org/2021.acl-short.24.

  27. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 220–229, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287596. https://doi.org/10.1145/3287560.3287596.
  28. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. https://aclanthology.org/D18-1206.

  29. OpenAI. Chatgpt, 2022. https://openai.com/blog/chatgpt.

  30. The ROOTS search tool: Data transparency for LLMs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 304–314, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-demo.29. https://aclanthology.org/2023.acl-demo.29.

  31. Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels, October 2018. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-6319.

  32. Language models are unsupervised multitask learners. 2019.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. http://jmlr.org/papers/v21/20-074.html.

  34. Randomised language modelling for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 512–519, Prague, Czech Republic, June 2007. Association for Computational Linguistics. https://aclanthology.org/P07-1065.

  35. Twitter. More on restricted use cases – twitter developers, 2023. https://web.archive.org/web/20230212045424/https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases.

  36. Benjamin Van Durme and Ashwin Lall. Probabilistic counting with randomized storage. In Twenty-First International Joint Conference on Artificial Intelligence. Citeseer
  37. OPT: Open Pre-trained Transformer Language Models
  38. Albert Ziegler. Github copilot research recitation, 2021. https://github.blog/2021-06-30-github-copilot-research-recitation/.

Show All 38

Test Your Knowledge

You answered out of questions correctly.

Well done!