Not All Similarities Are Created Equal: Leveraging Data-Driven Biases to Inform GenAI Copyright Disputes (2403.17691v2)
Abstract: The advent of Generative Artificial Intelligence (GenAI) models, including GitHub Copilot, OpenAI GPT, and Stable Diffusion, has revolutionized content creation, enabling non-professionals to produce high-quality content across various domains. This transformative technology has led to a surge of synthetic content and sparked legal disputes over copyright infringement. To address these challenges, this paper introduces a novel approach that leverages the learning capacity of GenAI models for copyright legal analysis, demonstrated with GPT2 and Stable Diffusion models. Copyright law distinguishes between original expressions and generic ones (Sc`enes `a faire), protecting the former and permitting reproduction of the latter. However, this distinction has historically been challenging to make consistently, leading to over-protection of copyrighted works. GenAI offers an unprecedented opportunity to enhance this legal analysis by revealing shared patterns in preexisting works. We propose a data-driven approach to identify the genericity of works created by GenAI, employing "data-driven bias" to assess the genericity of expressive compositions. This approach aids in copyright scope determination by utilizing the capabilities of GenAI to identify and prioritize expressive elements and rank them according to their frequency in the model's dataset. The potential implications of measuring expressive genericity for copyright law are profound. Such scoring could assist courts in determining copyright scope during litigation, inform the registration practices of Copyright Offices, allowing registration of only highly original synthetic works, and help copyright owners signal the value of their works and facilitate fairer licensing deals. More generally, this approach offers valuable insights to policymakers grappling with adapting copyright law to the challenges posed by the era of GenAI.
- 1954. Mazer v. stein. Supreme Court of the United States.
- 1984. Perma greetings, inc. v. russ berrie & co.
- 1985. Harper & row, publishers, inc. v. nation enterprises. Supreme Court of the United States.
- 1990. 17 u.s.c. s 102(a).
- 1991. Feist publications, inc. v. rural telephone service co. Supreme Court of the United States.
- 1992. Computer associations international, inc. v. altai, inc. United States Court of Appeals for the Second Circuit.
- 1994. Apple computer v. microsoft corp.
- 1994. Campbell v. acuff-rose music, inc.
- 1995. Lotus development corp. v. borland international, inc.
- 1998. Acuff-rose v. jostens.
- 2004. 37 c.f.r. § 202.1(a). Code of Federal Regulations.
- 2012. 17 u.s.c. § 107(1)-(2).
- 2014. Sas institute inc. v. world programming ltd.
- 2021. Oracle america, inc. v. google llc.
- C. D. Asay. 2020. Independent creation in a world of ai. FIU Law Review, 14:201.
- Taylor B. Bartholomew. 2014. The death of fair use in cyberspace: Youtube and the problem with content id. Duke Law & Technology Review, 13:66.
- Margaret A Boden. 2007. Creativity in a nutshell. Think, 5(15):83–96.
- Benjamin Boroughf. 2015. The next great youtube: Improving content id to foster creativity, cooperation, and fair compensation. Albany Law Journal of Science & Technology, 25:95.
- Synthetic data generators–sequential and private. In Advances in Neural Information Processing Systems, volume 33, page 7114–7124.
- Oren Bracha. 2023. The work of copyright in the age of machine production. U of Texas Law, Legal Studies Research Paper.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Dan L. Burk. 2023. Cheap creativity and what it will do. Georgia Law Review, 57:1669.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
- Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270.
- Speak, memory: An archaeology of books known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327, Singapore. Association for Computational Linguistics.
- Julie E. Cohen. 1998. Lochner in cyberspace: The new economic orthodoxy of “rights management”. Michigan Law Review, 97:462, 502.
- Constitutional Convention. 1787. U.s. constitution. Art. I, §8, cl. 8.
- Anna di Robilant and Talha Syed. 2022. Property’s building blocks: Hohfeld in europe and beyond. In Henry Smith et al., editors, The Legacy of Wesley Hohfeld: Edited Major Works, Select Personal Papers, and Original Commentaries, page 229.
- Stacey L. Dogan and Joseph P. Liu. 2005. Copyright law and subject matter specificity: The case of computer software. New York University Annual Survey of American Law, 61:203.
- Cynthia Dwork and Vitaly Feldman. 2018. Privacy-preserving prediction. In Conference on Learning Theory, page 1693–1702. PMLR.
- Cynthia Dwork et al. 2012. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, page 214–226.
- Can copyright be reduced to privacy? arXiv preprint arXiv:2305.14822.
- Federal Register. 2023. 37 cfr part 202.
- Giorgio Franceschelli and Mirco Musolesi. 2021. Creativity and machine learning: A survey. arXiv preprint arXiv:2104.02726.
- Giorgio Franceschelli and Mirco Musolesi. 2022. Deepcreativity: measuring creativity with deep learning techniques. Intelligenza Artificiale, 16(2):151–163.
- Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pages 2242–2251. PMLR.
- James Gibson. 2007. Risk aversion and rights accretion in intellectual property law. Yale Law Journal, 116:882–884.
- Paul Goldstein. 2005. Goldstein on Copyright, 3 edition, volume 1. Aspen Publishers.
- James Grimmelmann. 2009. The ethical visions of copyright law. Fordham Law Review, 77:2005.
- James Grimmelmann. 2015. Copyright for literate robots. Iowa Law Review, 101:657.
- James Grimmelmann. 2016. There’s no such thing as a computer-authored work - and it’s a good thing, too. Columbia Journal of Law and the Arts, 39:403.
- James Grimmelmann. 2023. Talkin’ ‘bout ai generation: Copyright and the generative ai supply chain. Available at SSRN: https://ssrn.com/abstract=4523551.
- Uri Y. Hacohen and Niva Elkin-Koren. 2024. Copyright regenerated: Harnessing genai to measure originality and copyright scope. Harvard Journal of Law & Technology, 37(2).
- Reconstructing training data from trained neural networks. Advances in Neural Information Processing Systems, 35:22911–22924.
- Understanding transformer memorization recall through idioms. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 248–264.
- Foundation models and fair use. arXiv preprint arXiv:2303.15715.
- Datamodels: Predicting predictions from training data. Proceedings of the 39th International Conference on Machine Learning.
- Preventing generation of verbatim memorization in language models gives a false sense of privacy. In Proceedings of the 16th International Natural Language Generation Conference, pages 28–53, Prague, Czechia. Association for Computational Linguistics.
- Inc. J. Doe 1 v. GitHub. 2022. Complaint. No. 3:22-cv-06823 (N.D. Cal. filed Nov. 03, 2022).
- Ruoxi Jia et al. 2019. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176. PMLR.
- Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1–2):1–210.
- Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR.
- Louis Kaplow. 1992. Rules versus standards: An economic analysis. Duke Law Journal, 42(3):557.
- Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
- Mark A. Lemley. 1995. Convergence in the law of software copyright. High Technology Law Journal, 10:1–28.
- Mark A. Lemley. 1997. The economics of improvements in intellectual property law. Texas Law Review, 75:989–1077.
- Mark A. Lemley. 2005. Property, intellectual property, and free riding. Texas Law Review, 83:1031, 1037–8.
- Mark A. Lemley. 2010. Our bizarre system for proving copyright infringement. Journal of Copyright Society, 57:719–748.
- Mark A. Lemley and Bryan Casey. 2020. Fair learning. Texas Law Review, 99:743.
- Pierre N. Leval. 1990. Toward a fair use standard. Harvard Law Review, 103:1105–1109.
- Jessica D. Litman. 1990. The public domain. Emory Law Journal, 39:965–975.
- Jessica D. Litman. 2008. Billowing white goo. Columbia Journal of Law & Arts, 31:587.
- Self-conditioned gans for image editing. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22, New York, NY, USA. Association for Computing Machinery.
- Jamie Lund. 2009. Copyright genericide. Creighton Law Review, 42:132–139.
- Peter S. Menell. 1989. An analysis of the scope of copyright protection for application programs. Stanford Law Review, 41:1045, 1066–1067, 1101.
- Michael J. Meurer and Peter Menell. 2013. Notice failure and notice externalities. Journal of Legal Analysis, 5:1.
- Joseph S. Miller. 2009. Hoisting originality. Cardozo Law Review, 31:452–467.
- OpenAI. 2019. Better language models. https://openai.com/research/better-language-models. Accessed: 2023-03-19.
- Gideon Parchomovsky and Alex Stein. 2009. Originality. Virginia Law Review, 95:1497–1517.
- An empirical comparison of instance attribution methods for NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 967–975, Online. Association for Computational Linguistics.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
- Matthew Sag. 2023. Copyright safety for generative ai. Houston Law Review, 61(2). Forthcoming.
- Pamela Samuelson. 1996. The copyright grab. WIRED.
- Pamela Samuelson. 2016. Reconceptualizing copyright’s merger doctrine. Journal of the Copyright Society of the USA, 63(1):417.
- Formalizing human ingenuity: A quantitative framework for copyright law’s substantial similarity. arXiv preprint arXiv:2206.01230.
- Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058.
- John Tehranian. 2007. Infringement nation: Copyright reform and the law/norm gap. Utah Law Review, pages 537–548.
- Andersen v. Stability AI Ltd. 2023. Complaint. class action & demand for jury trial. No. 3:23-cv-00201 (N.D. Cal. filed Jan. 13, 2023).
- Siva Vaidhyanathan. 2003. Copyrights and Copywrongs: The Rise of Intellectual Property. New York University Press.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- On provable copyright protection for generative models. In International Conference on Machine Learning, pages 35277–35299. PMLR.
- Albert Ziegler. 2021. Github copilot research recitation. https://github.blog/2021-06-30-github-copilot-research-recitation/.
- Uri Hacohen (3 papers)
- Adi Haviv (9 papers)
- Shahar Sarfaty (2 papers)
- Bruria Friedman (1 paper)
- Niva Elkin-Koren (4 papers)
- Roi Livni (35 papers)
- Amit H Bermano (2 papers)