Optimize DE-COP Efficiency and Mitigate Selection Bias

Develop methods to optimize the computational efficiency of the DE-COP: Detecting Copyrighted Content in Language Models Training Data framework and reduce selection biases within its evaluation procedure to enable scalable, unbiased detection of memorized copyrighted content in large language models.

Background

DE-COP is a black-box-compatible framework that detects whether a LLM has memorized copyrighted content by asking it to identify the original passage among paraphrased options. While it improves detection accuracy relative to statistical methods, it is computationally intensive and has dataset issues noted in prior evaluations, including high runtime per book and inconsistent paraphrase quality.

Selection bias can arise from the ordering and construction of multiple-choice options, potentially skewing results. Although the paper proposes improvements such as permutation handling and structured evaluation prompts, achieving both high computational efficiency and minimal selection bias in DE-COP remains unresolved, motivating further research.

References

However, optimizing its computational efficiency and reducing selection biases remains an open challenge for future work.

Copyright Detection in Large Language Models: An Ethical Approach to Generative AI Development (2511.20623 - Szczecina et al., 25 Nov 2025) in Related Works subsection