Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Portraits: Recording Foundation Model Training Data (2303.03919v2)

Published 6 Mar 2023 in cs.LG and cs.CL

Abstract: Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straightforward question: has the model already encountered a given example during training? We therefore propose a widespread adoption of Data Portraits: artifacts that record training data and allow for downstream inspection. First we outline the properties of such an artifact and discuss how existing solutions can be used to increase transparency. We then propose and implement a solution based on data sketching, stressing fast and space efficient querying. Using our tools, we document a popular LLMing corpus (The Pile) and a recently released code modeling dataset (The Stack). We show that our solution enables answering questions about test set leakage and model plagiarism. Our tool is lightweight and fast, costing only 3% of the dataset size in overhead. We release a live interface of our tools at https://dataportraits.org/ and call on dataset and model creators to release Data Portraits as a complement to current documentation practices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Proceedings of the Fifth Conference on Machine Translation, Online, November 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.wmt-1.0.
  2. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604, 2018. doi: 10.1162/tacl_a_00041. URL https://aclanthology.org/Q18-1041.
  3. “better than nothing” privacy with Bloom filters: To what extent? In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2012, Palermo, Italy, September 26-28, 2012. Proceedings, pages 348–363. Springer, 2012.
  4. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422–426, jul 1970. ISSN 0001-0782. doi: 10.1145/362686.362692. URL https://doi.org/10.1145/362686.362692.
  5. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  6. A Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE, 1997.
  7. Network applications of bloom filters: A survey. Internet Mathematics, 1:485 – 509, 2004.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Extracting training data from large language models. In USENIX Security Symposium, 2021.
  10. Quantifying memorization across neural language models. ArXiv, abs/2202.07646, 2022.
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  12. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. URL https://aclanthology.org/2021.emnlp-main.98.
  13. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  14. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  15. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
  16. Sketch algorithms for estimating point queries in NLP. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1093–1103, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL https://aclanthology.org/D12-1100.
  17. Statistical power and translationese in machine translation evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.6. URL https://aclanthology.org/2020.emnlp-main.6.
  18. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210, 2023.
  19. Preventing verbatim memorization in language models gives a false sense of privacy. arXiv preprint arXiv:2210.17546, 2022.
  20. Abstractive summarization of Reddit posts with multi-level memory networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2519–2531, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1260. URL https://aclanthology.org/N19-1260.
  21. Abstract meaning representation (amr) annotation release 2.0, 2017. URL https://catalog.ldc.upenn.edu/LDC2017T10.
  22. The stack: 3 tb of permissively licensed source code. Preprint, 2022.
  23. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  24. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577.
  25. Starcoder: may the source be with you!, 2023.
  26. What’s in the box? an analysis of undesirable content in the Common Crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 182–189, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.24. URL https://aclanthology.org/2021.acl-short.24.
  27. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 220–229, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287596. URL https://doi.org/10.1145/3287560.3287596.
  28. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206.
  29. OpenAI. Chatgpt, 2022. URL https://openai.com/blog/chatgpt.
  30. The ROOTS search tool: Data transparency for LLMs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 304–314, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-demo.29. URL https://aclanthology.org/2023.acl-demo.29.
  31. Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels, October 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319.
  32. Language models are unsupervised multitask learners. 2019.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  34. Randomised language modelling for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 512–519, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/P07-1065.
  35. Twitter. More on restricted use cases – twitter developers, 2023. URL https://web.archive.org/web/20230212045424/https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases.
  36. Benjamin Van Durme and Ashwin Lall. Probabilistic counting with randomized storage. In Twenty-First International Joint Conference on Artificial Intelligence. Citeseer, 2009.
  37. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  38. Albert Ziegler. Github copilot research recitation, 2021. URL https://github.blog/2021-06-30-github-copilot-research-recitation/.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Marc Marone (11 papers)
  2. Benjamin Van Durme (173 papers)
Citations (28)

Summary

  • The paper introduces Data Portraits, a framework using Bloom filters to efficiently record and query model training data membership.
  • It demonstrates that the method incurs only a 3% dataset overhead while mitigating test set leakage and potential plagiarism.
  • The study advocates integrating Data Portraits into standard documentation practices to enhance transparency and accountability in AI research.

An Evaluation of Data Portraits in Foundation Model Transparency

The paper "Data Portraits: Recording Foundation Model Training Data" by Marc Marone and Benjamin Van Durme proposes a methodological advancement aimed at enhancing transparency in foundation models by recording their training data. With the increasing scale and opacity of datasets used in training state-of-the-art AI models, understanding whether a given example was part of the training data becomes non-trivial. This paper introduces Data Portraits as a viable solution to this problem.

Summary of Contributions

The research delineates the concept of Data Portraits, a framework designed for the inspection and recording of training data. It discusses the critical feature of membership inference, whereby one can determine if a specific example is included in the training set of a model. Data Portraits can serve various stakeholders, including content creators, scientists, and consumers, each of whom may have different motivations for understanding the composition of a dataset used by a foundation model.

A significant contribution of the paper is the implementation of a data sketching technique based on Bloom filters. This technique facilitates fast and efficient querying of training data without significantly increasing the overhead—approximately 3% of the dataset size. Furthermore, the authors present an empirical analysis of their tool on two extensive corpora: "The Pile" and "The Stack," demonstrating its utility in addressing issues such as test set leakage and potential plagiarism by models. The authors also advocate for the integration of Data Portraits into the standard practices for documentation of datasets and models.

Analytical Insights

This paper provides an insightful examination into how Data Portraits could improve datasets' transparency within AI research and applications. Data sketching using Bloom filters stands out as a notable approach due to its space efficiency and speed in membership queries. The authors detail the functionality of Bloom filters in recording strided n-grams, allowing for efficient membership inference with minimum latency.

A noteworthy element is the exploration of how Data Portraits can address the perennial issues of dataset contamination, test set leakage, and memorization in models like GPT-3. Specifically, the paper's findings underscore the challenge of model adherence to test data integrity, exemplified by the overlap between WMT test sets and The Pile. The implications of such test data overlap are profound, opening discussions around the reliability and credibility of trained foundation models in real-world applications.

Practical Impact and Theoretical Implications

Practically, the introduction of Data Portraits provides a tangible mechanism for model developers and users to ensure that foundation models do not depend on unauthorized or proprietary content or generate previously seen examples verbatim. This capability can lead to more robust and ethically sound AI models, adhering to data protection regulations.

Theoretically, Data Portraits propose a framework that can stimulate further research into developing transparent, accountable, and privacy-preserving machine learning artifacts. While Bloom filters provide an effective balance of space efficiency and processing speed, future work could explore alternative data structures that might offer enhanced functionalities such as fuzzy matching or direct context retrieval without compromising privacy.

Conclusion

In conclusion, the paper introduces Data Portraits as a critical tool for increasing transparency in the era of large-scale model training. By enabling stakeholders to query training datasets efficiently, it offers a path forward in documenting and understanding the intricate datasets that fuel today's AI advancements. Moving forward, such tools can inform discussions around model development, deployment transparency, and ethical AI practices. While preliminary, the focus on Data Portraits is a meaningful step towards a more transparent ecosystem in AI research and its applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com