Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets (2403.15230v1)

Published 22 Mar 2024 in cs.SE and cs.LG

Abstract: Does the training of LLMs potentially infringe upon code licenses? Furthermore, are there any datasets available that can be safely used for training these models without violating such licenses? In our study, we assess the current trends in the field and the importance of incorporating code into the training of LLMs. Additionally, we examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future. To accomplish this, we compiled a list of 53 LLMs trained on file-level code. We then extracted their datasets and analyzed how much they overlap with a dataset we created, consisting exclusively of strong copyleft code. Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of 514 million code files, discovering 38 million exact duplicates present in our strong copyleft dataset. Additionally, we examined 171 million file-leading comments, identifying 16 million with strong copyleft licenses and another 11 million comments that discouraged copying without explicitly mentioning a license. Based on the findings of our study, which highlights the pervasive issue of license inconsistencies in LLMs trained on code, our recommendation for both researchers and the community is to prioritize the development and adoption of best practices for dataset creation and management.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. 2023. Getty Images (US), Inc. v. Stability AI, Inc. United States District Court for the District of Delaware. Case No. 1:23-cv-00135-UNA.
  2. 2023. Mike Huckabee, Relevate Group, David Kinnaman, Tsh Oxenreider, Lysa TerKeurst, and John Blase, Plaintiffs, v. Meta Platforms, Inc., Bloomberg L.P., Bloomberg Finance, L.P., Microsoft Corporation, and The Eleutherai Institute, Defendants. United States District Court Southern District of New York. Case No. 1:23-cv-09152-LGS.
  3. 2023. The New York Times Company v. Microsoft Corporation, OpenAI, Inc., OpenAI LP, OpenAI GP, LLC, OpenAI LLC, OpenAI OpCo LLC, OpenAI Global LLC, OAI Corporation, LLC, and OpenAI Holdings, LLC. United States District Court Southern District of New York. Case No. 1:23-cv-11195.
  4. A. Al-Kaswan and M. Izadi. 2023. The (ab)use of Open Source Code to Train Large Language Models. In 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE). IEEE Computer Society, Los Alamitos, CA, USA, 9–10. https://doi.org/10.1109/NLBSE59153.2023.00008
  5. Targeted Attack on GPT-Neo for the SATML Language Model Data Extraction Challenge. arXiv preprint arXiv:2302.07735 (2023).
  6. Traces of Memorisation in Large Language Models for Code. In 46th International Conference on Software Engineering (ICSE). https://doi.org/10.1145/3597503.3639133
  7. SantaCoder: don’t reach for the stars! arXiv:2301.03988 [cs.SE]
  8. Joshua Bloch and Pamela Samuelson. 2022. Some Misconceptions about Software in the Copyright Literature. In Proceedings of the 2022 Symposium on Computer Science and Law (Washington DC, USA) (CSLAW ’22). Association for Computing Machinery, New York, NY, USA, 131–141. https://doi.org/10.1145/3511265.3550449
  9. Quantifying Memorization Across Neural Language Models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=TatRHT_1cK
  10. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  11. Together Computer. 2023. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset. https://github.com/togethercomputer/RedPajama-Data
  12. Software Bertillonage: Finding the Provenance of an Entity. In Proceedings of the 8th Working Conference on Mining Software Repositories (Waikiki, Honolulu, HI, USA) (MSR ’11). Association for Computing Machinery, New York, NY, USA, 183–192. https://doi.org/10.1145/1985441.1985468
  13. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023).
  14. A bibliometric review of large language models research from 2017 to 2023. arXiv preprint arXiv:2304.02020 (2023).
  15. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547.
  16. The Apache Software Foundation. 2004. Apache License, Version 2.0. https://www.apache.org/licenses/LICENSE-2.0
  17. InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=hQwb-lbM6EL
  18. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027 [cs.CL]
  19. Daniel M. German and Ahmed E. Hassan. 2009. License integration patterns: Addressing license mismatches in component-based development. In 2009 IEEE 31st International Conference on Software Engineering. 188–198. https://doi.org/10.1109/ICSE.2009.5070520
  20. A sentence-matching method for automatic license identification of source code files. In Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (Antwerp, Belgium) (ASE ’10). Association for Computing Machinery, New York, NY, USA, 437–446. https://doi.org/10.1145/1858996.1859088
  21. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7212–7225.
  22. GraphCodeBERT: Pre-training Code Representations with Data Flow. In International Conference on Learning Representations.
  23. Foundation models and fair use. arXiv preprint arXiv:2303.15715 (2023).
  24. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023).
  25. Membership Inference Attacks on Machine Learning: A Survey. ACM Comput. Surv. 54, 11s, Article 235 (sep 2022), 37 pages. https://doi.org/10.1145/3523273
  26. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy. In Proceedings of the 16th International Natural Language Generation Conference, C. Maria Keet, Hung-Yi Lee, and Sina Zarrieß (Eds.). Association for Computational Linguistics, Prague, Czechia, 28–53. https://doi.org/10.18653/v1/2023.inlg-main.3
  27. The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533 [cs.CL]
  28. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  29. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https://openreview.net/forum?id=6lE4dQXaUcb
  30. At Which Training Stage Does Code Data Help LLMs Reasoning? arXiv preprint arXiv:2309.16298 (2023).
  31. Rettigheds Alliancen. 2023. Rights Alliance Removes the Illegal Books3 Dataset Used to Train Artificial Intelligence. https://rettighedsalliancen.com/rights-alliance-removes-the-illegal-books3-dataset-used-to-train-artificial-intelligence/
  32. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  33. GNU Operating System. 2022. What is copyleft? https://www.gnu.org/licenses/licenses.html#WhatIsCopyleft
  34. Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221 (2023).
  35. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 1069–1088. https://doi.org/10.18653/v1/2023.emnlp-main.68
  36. DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2875–2886. https://doi.org/10.18653/v1/2023.emnlp-main.174
  37. Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models. arXiv preprint arXiv:2310.01166 (2023).
  38. What do code models memorize? an empirical study on large language models of code. arXiv preprint arXiv:2308.09932 (2023).
  39. Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 7443–7464. https://doi.org/10.18653/v1/2023.acl-long.411
  40. Sheng Zhang and Hui Li. 2023. Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models. arXiv preprint arXiv:2312.07200 (2023).
  41. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
Citations (1)

Summary

We haven't generated a summary for this paper yet.