On the Effect of (Near) Duplicate Subwords in Language Modelling (2404.06508v3)

Published 9 Apr 2024 in cs.CL and cs.LG

Abstract: Tokenisation is a core part of LMs. It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.

Citations (1)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper demonstrates that near duplicate subwords reduce training efficiency by around 17% when models operate with duplicated vocabularies.
It introduces a novel experimental design using projected perplexity to assess the impact of perfect and natural duplication on model performance.
The findings imply that refining vocabulary construction and subword processing can significantly enhance language model generalization and efficiency.

Assessing the Impact of Near Duplicate Subwords on LLM Efficiency

Introduction

LLMs (LMs) have significantly advanced, yet the efficiency of their training remains a critical factor for improvement. This paper explores the impact of near duplicate subwords, such as minimal pairs that differ only in aspects like whitespace, capitalization, or plural suffixes, on the training efficiency of LMs. Near duplicates, constituting over 40% of modern LMs' vocabularies, potentially hinder models from generalizing effectively across similar subwords, thus affecting sample efficiency. The paper introduces a novel experimental design to quantify this issue, providing insights into the mechanisms underlying subword duplication and its effects on LM performance.

LLMling and Subword Duplication

The research outlines a formal framework for LLMing, emphasizing the role of subword duplication. It proposes a novel methodology to assess the impact of perfect and natural duplication, demonstrating through a synthetic duplicate setting that LMs are approximately 17% less data efficient when trained on a fully duplicated vocabulary. This efficiency loss linearly correlates with the fraction of duplicated vocabulary, suggesting a potential ceiling for performance improvement if models could generalize across duplicates more effectively.

Experimental Approach

The paper meticulously compares models trained on original and artificially duplicated vocabularies. It introduces projected perplexity as a metric to compare models fairly across different training settings. The experiments reveal that while LMs can generalize across duplicated subwords, this capability comes at a cost to training efficiency. For perfect duplicates, embedding alignments and the potential for mutual information across duplicates play a significant role in enabling generalization. However, when exploring natural duplication by merging near-duplicate subwords, the paper finds that such merges generally deteriorate model performance, indicating that natural near duplicates are not as semantically equivalent as one might assume.

Implications and Future Directions

The findings have both theoretical and practical implications for the design and optimization of LMs. The demonstrated inefficiency introduced by near duplicates suggests that better handling or preprocessing of near duplicates could enhance LM training efficiency significantly. Moreover, the distinction between perfect and natural duplication highlights the complexity of subword semantics in LLMing, challenging the current understanding of vocabulary construction's impact on model performance. Future research could explore more sophisticated methods for identifying and utilizing near duplicates, potentially leveraging character or byte-level information to improve LM generalization capabilities further.

In closing, this paper sheds light on a previously underexplored aspect of LLM training, offering a foundational understanding of how near duplicate subwords influence LMs. Its findings prompt a reevaluation of vocabulary construction techniques and open avenues for further exploration into more efficient and semantically aware tokenization methods.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (4)

Tweets

https://twitter.com/antonschafer/status/1778790649287913611

https://twitter.com/fly51fly/status/1779515637691527171