Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG (2406.13069v3)

Published 18 Jun 2024 in cs.CL and cs.AI

Abstract: How novel are texts generated by LLMs (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily large $n$). To enable arbitrary-length $n$-gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$, LM-generated text is less novel than human-written text, though it is more novel for smaller $n$. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete $n$-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.

Evaluating nn-Gram Novelty of LLMs Using Rusty-DAWG

The paper "Evaluating nn-Gram Novelty of LLMs Using Rusty-DAWG" introduces a structured approach to analyze the novelty of text generated by modern LLMs (LMs) compared to their training data. This research is conducted by employing a novel search tool, Rusty-DAWG, designed to facilitate efficient nn-gram analysis over extensive corpora.

Summary of Findings

The primary objective of this paper is to address how novel the text generated by LMs is relative to human-written text in the training dataset. The research explores several pivotal aspects, with significant numerical results and bold claims outlined as follows:

  1. Efficiency of Rusty-DAWG: The paper introduces Rusty-DAWG, an efficient data structure based on Compacted Directed Acyclic Word Graph (CDAWG) automata enabling unbounded-length nn-gram searches in large pretraining datasets in constant time with respect to corpus size and linear time concerning query size.
  2. Novelty Comparison: In generated text, large nn-grams (n>4n > 4) were found to be less novel compared to human-written text from validation sets. For smaller nn-grams (n4n \leq 4), generated text was slightly more novel. Specifically, for Pythia-12B, generated bigrams were 8% novel versus 5% for Dolma’s human-written text, whereas 10-grams were 93% novel in generated text versus 98% in Dolma.
  3. Influence of Model Size: The paper showcased that larger LMs (e.g., Pythia-12B compared to smaller models) tend to generate less novel text across all nn-gram sizes. This indicates a trend where increasing model size potentially leads to increased memorization or copying from the training data.
  4. Decoding Strategies: The choice of decoding strategy significantly impacts the novelty of the generated text. More constrained decoding methods, such as beam search and low-temperature sampling, resulted in less novel text. For instance, a temperature of 0.5 led to much lower novelty in generated 100-grams compared to more stochastic decoding choices.
  5. Prompting Effects: Including prompts from the training data marginally decreased the novelty of the generated text, particularly influencing the longest non-novel nn-grams. The increase in the mean non-novel suffix length (nnsl) from 6.19 to 7.56 when prompting with 100 tokens substantiates this observation.
  6. Completion Loss: The paper also evaluated how LMs complete nn-grams, finding that nn-grams that frequently appear in the training set are completed with lower loss. This was manifested by significant reductions in completion loss for training nn-grams across various nn sizes. Additionally, more frequent nn-grams in the training data resulted in lower completion loss, indicating a sensitivity of LMs to training data frequency effects.

Implications and Future Developments

Practical Implications:

  • Legal Considerations: The novelty of LM-generated text may play a crucial role in legal settings, such as the ongoing discussions about the fair use of copyrighted materials in training data. By quantifying the overlap between generated and training text, this paper supports informed debates on intellectual property issues.
  • Model Evaluation and Improvement: Understanding how different variables, such as model size and decoding strategies, influence generated text novelty can inform best practices in LM deployment and tuning.

Theoretical Implications:

  • Memorization vs. Generalization: This research positions itself within broader discussions on the balance between memorization and generalization in LMs. The findings—particularly that larger and more constrained models are less novel—further the evidence that LMs might heavily rely on memorization for large datasets.
  • Data Efficiency: Rusty-DAWG introduces a new level of efficiency in analyzing large-scale text data, which can inspire further development of similar tools for large corpus analysis in computational linguistics and other AI fields.

Future Research Directions:

  • Scaling Across Languages: Extending this approach to analyze LMs trained on non-English corpora could reveal language-specific patterns in memorization and novelty.
  • Semantic Novelty: Future work could explore methodologies to assess the semantic and syntactic novelty of generated text, complementing the verbatim novelty metrics discussed in this paper.
  • Algorithmic Optimizations: Further improvements in the memory efficiency of indices like the CDAWG could enhance their applicability to even larger datasets or more resource-constrained environments.

In conclusion, this paper provides a nuanced understanding of the novelty in LM-generated text and presents Rusty-DAWG as a potent tool for such analysis. The findings have substantial implications for both the deployment and ethical considerations surrounding modern LMs, setting a strong foundation for future advancements in the domain of AI and natural language processing.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. William Merrill (36 papers)
  2. Noah A. Smith (224 papers)
  3. Yanai Elazar (44 papers)
Citations (6)