- The paper reveals that ChatGPT and GPT-4 memorize a wide range of copyrighted literature, with memorization linked to the frequency of online exposure.
- It employs a name cloze membership inference query on 571 fiction works to assess memorization, highlighting measurement challenges in cultural analytics.
- The research advocates for transparent training data to mitigate biases and ensure ethical, comprehensive representation in language models.
An Archaeological Study of Memorized Literature in LLMs
The paper "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4" investigates the memorized content within two prominent LLMs, ChatGPT and GPT-4, specifically focusing on their knowledge of various literary works. This study, conducted by researchers from the University of California, Berkeley, and Emory University, employs a "data archaeology" approach using a name cloze membership inference query to probe a range of 571 fiction works, scrutinizing the presence and memorization extent of these texts in the models.
The researchers highlight that OpenAI's models have memorized an extensive array of copyrighted books, a discovery that links the level of memorization to the frequency with which book passages appear online. This significant degree of memorization introduces challenges in assessing measurement validity for cultural analytics, complicating the interpretation of models' abilities when evaluated on memorized texts. This phenomenon leads to a pronounced performance disparity, where OpenAI's models exhibit superior accuracy on memorized books compared to non-memorized ones in downstream literary tasks.
A notable aspect of this research is the advocacy for open models, where the training data is transparent and known. Such an approach, the authors argue, can ameliorate issues of bias and validity in the domain of cultural analytics, offering a clearer understanding of the datasets that models have been exposed to during training.
Key findings underscore that GPT-4, in particular, has memorized a diverse array of copyrighted material, extending to widely popular and frequently occurring literary works such as science fiction and fantasy novels, public domain classics, and contemporary bestsellers. This bias in model knowledge aligns with patterns observed in web replication through search results from Google, Bing, and datasets like C4, confirming that higher duplication levels facilitate memorization.
The paper's authors further investigate this memorization's effect on downstream tasks by examining models' capabilities in predicting attributes such as the year of first publication and narrative time. These analyses reveal that models perform best when handling texts they have memorized, raising concerns over the reliability of these models when applied to less represented literature in their training data.
The paper also examines the presence of copyrighted works within BERT's training data, discovering that BookCorpus—one of BERT's training sources—contains in-copyright but widely recognized books. The inclusion of such works questions the legality and ethicality of utilizing them without explicit permission, highlighting broader issues in dataset compilation for machine learning purposes.
In addressing the implications of these findings, the authors emphasize the importance of open and transparent LLMs in the academic field of cultural analytics and beyond. Given these discoveries, future AI developments may prioritize the creation of datasets that ensure fair and comprehensive representation of diverse narratives, mitigating biases ingrained in historical and popular content.
This research paves the way for more empirical efforts to map the information landscape inhabited by LLMs, crucial for accurately evaluating their broader application and potential biases. Future work may explore the extension of this methodology to non-English language datasets, furthering the understanding of LLM capabilities across different cultural domains.