Learn Your Tokens: Word-Pooled Tokenization for Language Modeling (2310.11628v1)

Published 17 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the limitations of such a tokenization strategy, particularly for documents not written in English and for representing numbers. On the other extreme, byte/character-level LLMs are much less restricted but suffer from increased sequence description lengths and a subsequent quadratic expansion in self-attention computation. Recent attempts to compress and limit these context lengths with fixed size convolutions is helpful but completely ignores the word boundary. This paper considers an alternative 'learn your tokens' scheme which utilizes the word boundary to pool bytes/characters into word representations, which are fed to the primary LLM, before again decoding individual characters/bytes per word in parallel. We find that our moderately expressive and moderately fast end-to-end tokenizer outperform by over 300% both subwords and byte/character models over the intrinsic LLMing metric of next-word prediction across datasets. It particularly outshines on rare words, outperforming by a factor of 30! We extensively study the LLMing setup for all three categories of tokenizers and theoretically analyze how our end-to-end models can also be a strong trade-off in efficiency and robustness.

PDF HTML Abstract

Word-Pooled Tokenization for LLMing: An Examination

The paper "Learn Your Tokens: Word-Pooled Tokenization for LLMing" introduces a novel approach to tokenization in NLP that aims to balance expressivity and efficiency. Current tokenization strategies, such as subword-based methods and byte/character-level tokenization, present inherent limitations. Subword tokenizers, while providing a compromise between compressing information and representing rare words, are often hand-engineered and static, leading to inefficiencies across different languages and numeric representations. Byte or character-level models allow for broader applicability but at a significant computational cost due to increased sequence length, which is proportional to the size of the input text.

The proposed alternative, termed as "learn your tokens" scheme, innovatively capitalizes on word boundaries to pool characters into word-level representations. This pooling precedes the passage into the LLM and is followed by decoding characters/bytes in parallel per word. This new tokenization approach aims to outperform existing methods by over 300% in next-word prediction efficacy across datasets. The paper demonstrates that this method particularly excels in handling rare words, with improvements by a factor of 30 over traditional methods.

Methodology

The central methodology revolves around a tokenization strategy that compresses base units (characters or bytes) using word boundaries into word representations. This is analogous to using CLS (classification) tokens in BERT-like models but adapted on a per-word basis. The architecture comprises three steps: pooling base units into fixed embeddings per word, passing these into the main LLM, and subsequently decoding the predictions on a character/byte level.

The transformer-based architecture used employs a shallow word encoder-transformer and word decoder-transformer interspersed with the primary LLM, aligning with the inputs' word boundaries. This structured approach allows for a reduction in computational requirements by limiting self-attention to intra-word levels initially, thus enhancing efficiency.

Experimental Evaluation

The paper evaluates the effectiveness of various tokenizer strategies across datasets spanning multiple languages (English, French, Russian) and a numeracy dataset, emphasizing the model's capability to predict numbers. Results indicate substantial improvements in word prediction accuracy, particularly for rare words where the proposed method outstripped the standard subword and byte-level models by large margins.

Implications and Future Prospects

This method presents a compelling case for the refinement of tokenization strategies in NLP. By successfully incorporating word boundaries into tokenization, it balances expressiveness with computational cost, offering a viable middle ground between subword and character-level models. The results imply potential for further optimization, particularly in reducing memory overhead and enhancing computational speeds during training and inference phases.

The paper speculates broader implications for LLMing where tokenization can directly dictate model efficiency and accuracy. Future developments may focus on optimizing and integrating such dynamic tokenization schemes in large-scale LLMs, potentially leveraging adaptive mechanisms to dynamically alter token density based on data complexity or task-specific requirements.

In conclusion, this research opens pathways for more nuanced and adaptable tokenization strategies that can advance the current boundaries of NLP model performance and broaden applicability across linguistic and computation-intensive tasks. It also prompts a reevaluation of how fundamental the role of word boundaries can be in tokenization to better serve diverse languages and contexts in AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Avijit Thawani (3 papers)
Saurabh Ghanekar (1 paper)
Xiaoyuan Zhu (5 papers)
Jay Pujara (44 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/thawani_avijit/status/1761120775531315390