Determine fair use legality of training LLMs on copyrighted material in the U.S.

Determine whether training large language models on copyrighted material constitutes fair use under U.S. copyright law, clarifying the legal status of using copyrighted works during pretraining and establishing conditions under which such training practices are considered lawful or infringing.

Background

The paper surveys memorization risks across domains and highlights legal uncertainty surrounding the use of copyrighted materials in pretraining LLMs. This uncertainty affects developers seeking to address copyright risks, as well as policymakers considering safe harbors and standards for fair learning practices.

By explicitly noting that the legality of training on copyrighted works is unsettled and subject to ongoing litigation, the authors flag a concrete open question whose resolution would have broad implications for dataset construction, memorization measurement, and mitigation strategies in LLM development.

References

In the U.S., whether training LLMs on copyrighted material is fair use remains uncertain and its legality will be determined by ongoing litigation.

— Hubble: a Model Suite to Advance the Study of LLM Memorization (2510.19811 - Wei et al., 22 Oct 2025) in Section 2.1 (Copyright)

Determine fair use legality of training LLMs on copyrighted material in the U.S.

Background

References

Related Problems