T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings (2406.19223v2)

Published 27 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Tokenizers are crucial for encoding information in LLMs, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.

Citations (2)

View on Semantic Scholar

Summary

The paper presents a tokenizer-free LLM approach that leverages sparse trigram representations to reduce embedding parameters by up to 87.5% compared to traditional tokenizers.
The technique encodes words through character trigram extraction and mapping into a shared embedding space, resulting in better parameter efficiency and improved training stability.
Experiments demonstrate competitive performance on English benchmarks and faster cross-lingual adaptation, validating the method's memory efficiency and robustness.

The paper "T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings" (2406.19223) introduces a novel approach to text encoding and decoding for LLMs that aims to replace traditional tokenizers. The authors argue that existing tokenization methods like BPE and Unigram have significant drawbacks, including computational overhead, inefficient vocabulary usage (due to duplicates and rarely used tokens), large embedding/head layers, and performance bias towards the languages in their training corpus. T-Free proposes a tokenizer-free method that directly processes words via sparse activation patterns derived from character trigrams.

The core idea behind T-Free is to represent each word not as a single token ID mapped to one dense embedding vector, but as a multi-label sparse activation pattern over a much smaller shared embedding space. This allows the model to inherently capture morphological similarities between words, which the authors argue leads to better parameter efficiency and cross-lingual transfer capabilities.

Here's a breakdown of the T-Free implementation:

Word Splitting: The input text is first split into words, digits, and special characters. This is primarily done by splitting on digits and non-alphanumeric characters. Special "whitespace" and "non-whitespace" tokens are introduced to handle specific concatenation rules, reducing reliance on explicit whitespace characters.
Trigram Encoding: Each word is treated as a sequence of characters with padded whitespaces at the beginning and end (e.g., "Hello" becomes "_Hello_"). Character triplets, or trigrams, are extracted using a convolution-like operation with a stride of 1. For "Hello", the trigrams are {_He, Hel, ell, llo, lo_}.
Sparse Activation Pattern: Each trigram is mapped to a fixed number ( $m$ ) of numerical identifiers (hashes). These hash values are then taken modulo the vocabulary size ( $v$ ) to determine the indices in a shared embedding matrix that should be "activated" for this trigram. A parameter $k \in [0, m)$ allows a subset of these hashes to be computed from the lowercase version of the trigram, explicitly modeling capitalization similarities.
Word Embedding: The final embedding for a word is computed by summing the embedding vectors from the shared embedding matrix at all indices activated by all of its constituent trigrams. If a trigram hash modulo $v$ yields index $i$ , the embedding $EL[i]$ is added to the word's embedding vector.
Training Objective: Unlike traditional LLMs which use a single-label objective (predicting the next token ID) over the full vocabulary, T-Free uses a multi-label Binary Cross-Entropy loss. The target for predicting the next word is the sparse activation pattern corresponding to that word's trigrams over the shared embedding space. The LM head is a projection layer into the shared embedding dimension ( $v$ ).
Decoding: To predict the next word, the model outputs logits over the shared embedding dimension ( $v$ ). These logits are scored against a pre-compiled dictionary. This dictionary is a sparse matrix where each row corresponds to a candidate next word and contains its pre-computed sparse activation pattern (a binary vector of size $v$ ). The dot product of the model's sigmoid-activated logits and each row of the dictionary matrix gives a score for each candidate word. These scores are then typically normalized and potentially softmaxed to pick the most likely next word.

Implementation Considerations and Benefits:

Memory Efficiency: T-Free drastically reduces the size of the embedding and LM head layers. Instead of a matrix of size $V \times H$ where $V$ is the large tokenizer vocabulary size (e.g., 32k-256k), T-Free uses a shared embedding matrix of size $v \times H$ , where $v$ can be significantly smaller (e.g., 8k-16k in experiments). This led to an 87.5% reduction in embedding parameters compared to a 64k Unigram baseline. This makes models smaller and potentially allows for larger micro-batch sizes during training, improving throughput.
No Duplicate Tokens by Design: Traditional tokenizers often have duplicate tokens differing only in capitalization or leading whitespace (e.g., " Word" and "word"). T-Free avoids this by processing words and handling these variations through the shared trigram-based representation. The authors show that common tokenizers have 15-35% duplicate tokens, while T-Free has none.
Cross-lingual Transfer: Since the trigram-based encoding is language-agnostic and captures character-level similarities, T-Free demonstrates superior performance in zero-shot and continual pre-training settings on languages different from the main training corpus (English to German in the paper's experiments).
Training Stability: The paper notes more stable training loss curves with T-Free, potentially due to explicit modeling of word similarities and a more uniform distribution of gradients across the shared embedding space.
Computational Cost: Pre-processing involves simple word splitting and trigram hashing, which is generally faster than complex BPE or Unigram decoding algorithms, especially for large vocabularies. Inference decoding requires scoring against a dictionary, which can be efficient if the dictionary is sparse.

Experimental Results:

Hyperparameter Ablations (1B models): T-Free achieved competitive or superior downstream performance on English benchmarks compared to 64k Unigram baselines, even with $v=8k$ , demonstrating significant parameter reduction without performance degradation.
Fertility: T-Free showed much lower fertility (tokens per word closer to 1.0) and less variance across diverse languages (English, German, Russian, Vietnamese, Arabic) compared to standard tokenizers.
Language Transfer (3B models): Continual pre-training an English-trained 3B model on German data showed T-Free adapting significantly faster and achieving much better German benchmark scores than the Unigram baseline, highlighting its better cross-lingual capabilities.

Limitations:

The authors note that evaluation was primarily done on models up to 3B parameters. Potential issues with very long words (where summation of many trigram embeddings might cause numerical instability) and repetitive trigrams within a word are mentioned but deemed statistically insignificant for common datasets. Code performance could be further improved by explicitly modeling code patterns. Evaluation on languages entirely relying on Unicode byte-encodings (like Chinese) was not performed.

In summary, T-Free presents a compelling alternative to traditional tokenization for LLMs, offering significant memory savings, improved cross-lingual performance, and better parameter utilization by directly embedding words based on character trigrams and leveraging a shared, sparse embedding space.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (5)

Tweets

https://twitter.com/kerstingAIML/status/1809152764649574541

https://twitter.com/fly51fly/status/1809904658439426078

https://twitter.com/MBrack_AIML/status/1813628709813027021

https://twitter.com/mopodono/status/1851711590317318176

https://twitter.com/_reachsumit/status/1806533326121177319

https://twitter.com/f14bertolotti/status/1833623753189695661

HackerNews

T-FREE Tokenizer-Free LLMs via Sparse Representation Memory-Efficient Embeddings (3 points, 0 comments)