Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? (2407.16607v4)

Published 23 Jul 2024 in cs.CL and cs.LG

Abstract: The pretraining data of today's strongest LLMs is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern LLMs. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their predecessors, training on 39% and 47% non-English language data, respectively; Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

PDF HTML Abstract

Data Mixture Inference: What Do BPE Tokenizers Reveal About Their Training Data?

The paper "Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?" explores the task of uncovering the distributional makeup of the pretraining datasets used for LLMs (LMs). By leveraging information embedded within byte-pair encoding (BPE) tokenizers, the authors propose an innovative attack method they term "data mixture inference."

Key Insights and Methodology

The foundational insight driving this work is the observation that the ordered list of merge rules learned by a BPE tokenizer inherently reflects token frequency information from the training data. Each step in the BPE training process merges the most frequent pair of bytes or tokens, ordered from the highest frequency to the lowest. Therefore, the sequence of merges effectively encodes information about the distribution of different data categories within the underlying dataset.

The authors formalize this into a linear programming (LP) problem. The task involves solving for the proportions of different categories (e.g., natural languages, programming languages, or data domains) that make up the tokenizer's training set. By applying the BPE tokenizer to data samples from each category and measuring the resulting pair frequencies, they set up constraints that reflect the relative frequencies of pairs in each category. The LP solver then estimates the proportions that best satisfy these constraints.

Controlled Experiments

In controlled experiments, tokenizers were trained on known mixtures of natural languages, programming languages, and data sources. The findings are compelling:

The method achieves between three and six orders of magnitude better accuracy than random guessing.
For natural languages, which inherently have distinct vocabularies, the method exhibits the highest success rate.
For English domain mixtures, accuracy remains significantly better than random, despite subtler vocabulary differences.

Application to Commercial Tokenizers

Applying their method to commercial tokenizers, the authors uncover various quantitative details:

GPT-4's tokenizer is notably multilingual, trained on 39% non-English data.
GPT-3.5's tokenizer includes substantial code data, with code making up approximately 60% of its training set.
LLaMa3 extends GPT-3.5's tokenizer to enhance multilingual capabilities, with 48% non-English data.

Practical and Theoretical Implications

This research has several implications:

Security and Privacy: Revealing the distributional properties of training data can leak proprietary information or enable targeted poisoning attacks.
Auditing and Fairness: Understanding data mixtures aids in auditing models for biases, highlighting over- or under-represented languages and domains.
Technical Insight: This method offers a tool for indirectly inferring pretraining data distribution when direct access to data is restricted.

Future Developments

Future research directions include:

Extending the method to handle more complex tokenization schemes, such as those beyond BPE.
Enhancing robustness against distribution shifts between the data used for tokenization and the actual pretraining set.
Exploring practical defenses that model producers could employ to mitigate such inference attacks.

Conclusion

The method proposed in this paper provides a powerful tool for inferring the distribution of training data from the properties of BPE tokenizers. By reverse engineering the merge lists, the paper illuminates otherwise opaque aspects of model training data, contributing significantly to the discourse on model transparency, security, and fairness. As AI systems continue to integrate into critical societal functions, such research is pivotal in ensuring equitable and secure AI deployments.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jonathan Hayase (20 papers)
Alisa Liu (25 papers)
Yejin Choi (287 papers)
Sewoong Oh (128 papers)
Noah A. Smith (224 papers)

Citations (5)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/alisawuffles/status/1816640952657752146

https://twitter.com/fly51fly/status/1817202371174576574

https://twitter.com/jxmnop/status/1902357711083212925

https://twitter.com/ADarmouni/status/1817705486797680961

https://twitter.com/gm8xx8/status/1815927544723513790

https://twitter.com/ShomLinEd/status/1926964177073033344