Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG (2406.13069v3)

Published 18 Jun 2024 in cs.CL and cs.AI

Abstract: How novel are texts generated by LLMs (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily large $n$). To enable arbitrary-length $n$-gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$, LM-generated text is less novel than human-written text, though it is more novel for smaller $n$. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete $n$-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.

PDF HTML Abstract

Evaluating $n$ -Gram Novelty of LLMs Using Rusty-DAWG

The paper "Evaluating $n$ -Gram Novelty of LLMs Using Rusty-DAWG" introduces a structured approach to analyze the novelty of text generated by modern LLMs (LMs) compared to their training data. This research is conducted by employing a novel search tool, Rusty-DAWG, designed to facilitate efficient $n$ -gram analysis over extensive corpora.

Summary of Findings

The primary objective of this paper is to address how novel the text generated by LMs is relative to human-written text in the training dataset. The research explores several pivotal aspects, with significant numerical results and bold claims outlined as follows:

Efficiency of Rusty-DAWG: The paper introduces Rusty-DAWG, an efficient data structure based on Compacted Directed Acyclic Word Graph (CDAWG) automata enabling unbounded-length $n$ -gram searches in large pretraining datasets in constant time with respect to corpus size and linear time concerning query size.
Novelty Comparison: In generated text, large $n$ -grams ( $n > 4$ ) were found to be less novel compared to human-written text from validation sets. For smaller $n$ -grams ( $n \leq 4$ ), generated text was slightly more novel. Specifically, for Pythia-12B, generated bigrams were 8% novel versus 5% for Dolma’s human-written text, whereas 10-grams were 93% novel in generated text versus 98% in Dolma.
Influence of Model Size: The paper showcased that larger LMs (e.g., Pythia-12B compared to smaller models) tend to generate less novel text across all $n$ -gram sizes. This indicates a trend where increasing model size potentially leads to increased memorization or copying from the training data.
Decoding Strategies: The choice of decoding strategy significantly impacts the novelty of the generated text. More constrained decoding methods, such as beam search and low-temperature sampling, resulted in less novel text. For instance, a temperature of 0.5 led to much lower novelty in generated 100-grams compared to more stochastic decoding choices.
Prompting Effects: Including prompts from the training data marginally decreased the novelty of the generated text, particularly influencing the longest non-novel $n$ -grams. The increase in the mean non-novel suffix length (nnsl) from 6.19 to 7.56 when prompting with 100 tokens substantiates this observation.
Completion Loss: The paper also evaluated how LMs complete $n$ -grams, finding that $n$ -grams that frequently appear in the training set are completed with lower loss. This was manifested by significant reductions in completion loss for training $n$ -grams across various $n$ sizes. Additionally, more frequent $n$ -grams in the training data resulted in lower completion loss, indicating a sensitivity of LMs to training data frequency effects.

Implications and Future Developments

Practical Implications:

Legal Considerations: The novelty of LM-generated text may play a crucial role in legal settings, such as the ongoing discussions about the fair use of copyrighted materials in training data. By quantifying the overlap between generated and training text, this paper supports informed debates on intellectual property issues.
Model Evaluation and Improvement: Understanding how different variables, such as model size and decoding strategies, influence generated text novelty can inform best practices in LM deployment and tuning.

Theoretical Implications:

Memorization vs. Generalization: This research positions itself within broader discussions on the balance between memorization and generalization in LMs. The findings—particularly that larger and more constrained models are less novel—further the evidence that LMs might heavily rely on memorization for large datasets.
Data Efficiency: Rusty-DAWG introduces a new level of efficiency in analyzing large-scale text data, which can inspire further development of similar tools for large corpus analysis in computational linguistics and other AI fields.

Future Research Directions:

Scaling Across Languages: Extending this approach to analyze LMs trained on non-English corpora could reveal language-specific patterns in memorization and novelty.
Semantic Novelty: Future work could explore methodologies to assess the semantic and syntactic novelty of generated text, complementing the verbatim novelty metrics discussed in this paper.
Algorithmic Optimizations: Further improvements in the memory efficiency of indices like the CDAWG could enhance their applicability to even larger datasets or more resource-constrained environments.

In conclusion, this paper provides a nuanced understanding of the novelty in LM-generated text and presents Rusty-DAWG as a potent tool for such analysis. The findings have substantial implications for both the deployment and ethical considerations surrounding modern LMs, setting a strong foundation for future advancements in the domain of AI and natural language processing.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

William Merrill (36 papers)
Noah A. Smith (224 papers)
Yanai Elazar (44 papers)

Citations (6)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/lambdaviking/status/1805989781739864326

https://twitter.com/yanaiela/status/1858559828383469578

https://twitter.com/niloofar_mire/status/1875318879196754244

https://twitter.com/yanaiela/status/1808045126243307526

https://twitter.com/jowenpetty/status/1932599777125888296