From Tokens to Words: On the Inner Lexicon of LLMs (2410.05864v4)

Published 8 Oct 2024 in cs.CL and cs.AI

Abstract: Natural language is composed of words, but modern LLMs process sub-words as input. A natural question raised by this discrepancy is whether LLMs encode words internally, and if so how. We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent whole-word representations at their last token. Our experiments show that this process primarily takes place within the early and middle layers of the model. We further demonstrate its robustness to arbitrary splits (e.g., "cats" to "ca" and "ts"), typos, and importantly-to out-of-vocabulary words: when feeding the last token internal representations of such words to the model as input, it can "understand" them as the complete word despite never seeing such representations as input during training. Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope. These insights provide a practical, finetuning-free application for expanding the vocabulary of pre-trained models. By enabling the addition of new vocabulary words, we reduce input length and inference iterations, which reduces both space and model latency, with little to no loss in model accuracy.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that early to mid layers of LLMs reconstruct coherent word representations from sub-word tokens via a latent inner lexicon.
Experiments show LLMs achieve 89% accuracy distinguishing meaningful words from gibberish and a 64% retrieval rate for unseen multi-token words.
The analysis reveals that feedforward networks and attention layers jointly drive robust detokenization, suggesting efficient vocabulary expansion.

Analyzing Detokenization in LLMs

The paper "From Tokens to Words: On the Inner Lexicon of LLMs" presents an analysis of the detokenization process within LLMs, addressing how these models internally reconstruct word-level representations from sub-word tokens. This work investigates whether LLMs possess an intrinsic mechanism for mapping these sub-word sequences into coherent word representations, suggesting a latent vocabulary that extends beyond the limitations of conventional tokenization methods like Byte-Pair Encoding (BPE).

Key Insights and Experiments

The authors introduce several experiments to delve into the LLM detokenization process, focusing on two primary scenarios: words not directly represented in the model's BPE vocabulary (multi-token words) and single-token words artificially broken down. The research shows that this detokenization primarily occurs in the early to middle layers of the models.

Words vs. Nonwords: The paper begins by examining whether LLMs distinguish between meaningful word sequences and gibberish. A k-nearest neighbors classifier achieves 89% accuracy, indicating that LLMs differentiate these sequences, demonstrating an underlying mechanism for detecting recognized words during detokenization.
Single-Token Word Splitting: For words artificially separated into multiple tokens or containing typos, the model progressively reconstructs the original word representation as processing advances through the layers. Using the logit lens method, they observe that the last token's hidden state aligns strongly with the original word, confirming robust detokenization despite initial perturbations.
Multi-Token Word Processing: For inherently multi-token words, Patchscopes are employed to demonstrate that the LLM can regenerate the complete word despite being unseen during training. This ability implies the presence of a latent inner lexicon, validated by a retrieval rate of 64%.
Mechanism Analysis: The authors analyze the key mechanisms involved in detokenization, particularly the role of feedforward networks (FFNs) and attention mechanisms. They find that FFNs retrieve word-level information from sub-words, and early attention layers aggregate token information, which collaboratively drives coherent word representation formation.

Practical Implications and Future Directions

The findings suggest that understanding and leveraging this detokenization process can substantially enhance practical applications:

Vocabulary Expansion: The paper explores expanding the LLM vocabulary without finetuning. By generating fused representations for multi-token words within the embedding spaces, they report maintained or improved LLMing performance. This approach significantly reduces token lengths and model latency, offering implications for efficiency, especially in languages with high word-to-token ratios.
Computational Efficiency: By efficiently interpreting and embedding sequences, models can reduce their space and computational demands, useful for scaling applications.

The identification of a latent vocabulary within LLMs that can handle out-of-vocabulary and typological variations without additional training presents opportunities for future work. Enhancing transformer architectures to more comprehensively optimize these latent vocabularies could advance performance for complex linguistic tasks and provide greater adaptability across diverse languages and dialects.

In summary, this paper offers a detailed investigation into the internal workings of LLMs, particularly focusing on the critical aspect of detokenization. These insights underscore the potential for inherent improvements in LLM design, thereby fostering more versatile, efficient computational linguistic tools.

PDF Markdown

Related Papers

Tweets

https://twitter.com/GKaplan38844/status/1845732301491351558

https://twitter.com/GKaplan38844/status/1914005225410613287

https://twitter.com/tunadorable/status/1872251012033212458

https://twitter.com/JagersbergKnut/status/1846124555653251489

YouTube

Show All Videos

HackerNews

From Tokens to Words: On the Inner Lexicon of LLMs (2 points, 0 comments)