Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deduplicating Training Data Makes Language Models Better (2107.06499v2)

Published 14 Jul 2021 in cs.CL and cs.LG

Abstract: We find that existing LLMing datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of LLMs trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Katherine Lee (34 papers)
  2. Daphne Ippolito (47 papers)
  3. Andrew Nystrom (4 papers)
  4. Chiyuan Zhang (57 papers)
  5. Douglas Eck (24 papers)
  6. Chris Callison-Burch (102 papers)
  7. Nicholas Carlini (101 papers)
Citations (505)

Summary

Deduplicating Training Data Makes LLMs Better

The paper "Deduplicating Training Data Makes LLMs Better" by Katherine Lee et al. addresses the issue of duplicate content in training datasets for LLMs (LMs). The paper asserts that prevalent duplication not only affects efficiency but also biases performance evaluations. The authors propose novel methodologies to systematically identify and remove these duplicates, yielding several improvements in model training and evaluation.

Methodologies

Two main deduplication techniques are developed:

  1. Exact Substring Matching: Utilizes a suffix array to identify verbatim duplicates by detecting repeated substrings longer than 50 tokens. This scalable method effectively removes redundancies, ensuring model outputs are less reliant on memorized training data.
  2. Approximate Matching with MinHash: This approach flags documents with high nn-gram similarity. It extends suffix array methods to recognize documents that are largely identical except for slight variations, often seen in web-crawled data.

Key Findings

The research reveals that even advanced datasets, which have undergone rudimentary deduplication, such as C4 and RealNews, contain significant levels of duplicated content:

  • The RealNews dataset sees 13.63% of its training examples identified as duplicates through approximate matching.
  • C4 has 3.04% near-duplicate examples exposed using the same methodology.

Both methods successfully reduced memorization rates by models, with deduplicated datasets exhibiting fewer verbatim reproductions from the train set. This was a notable reduction, with deduplicated models outputting memorized tokens at roughly 0.1% compared to over 1% for datasets without deduplication.

Implications and Results

The advantages of deduplication extend beyond reducing memorization. Models trained on deduplicated data require fewer training steps to achieve equivalent or superior accuracy levels across tasks. Importantly, deduplication also mitigates train-test overlap, leading to more reliable evaluations of model capabilities:

  • Over 4% of C4 validation data had overlaps with its training set. Models trained on overlapping datasets may unfairly advantage memorization over true generalization, inflating benchmarks.
  • Deduplicated datasets resulted in reduced perplexity scores in evaluation, sometimes up to a 10% improvement. This outcome highlights the enhanced dataset quality and diversity achieved through deduplication strategies.

Future Directions

The implications of this paper suggest necessary enhancements in data curation practices for future dataset collections. Ensuring data diversity and minimizing memorization are paramount, particularly for LMs intended for general use across diverse applications. Improvements in deduplication algorithms will continue to play a critical role in optimizing LLM performance and reliability.

As LMs are increasingly used in sensitive contexts, understanding and mitigating memorization risks becomes crucial. The paper's findings prompt further exploration into the balance between necessary memorization of factual data and inadvertent leakage of sensitive information. Additionally, fine-tuning deduplication thresholds and methodologies tailored specific to dataset characteristics may yield even more refined models.

Ultimately, this work serves as a call to action for researchers and practitioners to thoughtfully consider the composition and quality of datasets, ensuring robust and generalizable machine learning models. The tools and findings shared by the authors provide valuable resources for researchers aiming to enhance their LM training workflows.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com