- The paper demonstrates that early to mid layers of LLMs reconstruct coherent word representations from sub-word tokens via a latent inner lexicon.
- Experiments show LLMs achieve 89% accuracy distinguishing meaningful words from gibberish and a 64% retrieval rate for unseen multi-token words.
- The analysis reveals that feedforward networks and attention layers jointly drive robust detokenization, suggesting efficient vocabulary expansion.
Analyzing Detokenization in LLMs
The paper "From Tokens to Words: On the Inner Lexicon of LLMs" presents an analysis of the detokenization process within LLMs, addressing how these models internally reconstruct word-level representations from sub-word tokens. This work investigates whether LLMs possess an intrinsic mechanism for mapping these sub-word sequences into coherent word representations, suggesting a latent vocabulary that extends beyond the limitations of conventional tokenization methods like Byte-Pair Encoding (BPE).
Key Insights and Experiments
The authors introduce several experiments to delve into the LLM detokenization process, focusing on two primary scenarios: words not directly represented in the model's BPE vocabulary (multi-token words) and single-token words artificially broken down. The research shows that this detokenization primarily occurs in the early to middle layers of the models.
- Words vs. Nonwords: The paper begins by examining whether LLMs distinguish between meaningful word sequences and gibberish. A k-nearest neighbors classifier achieves 89% accuracy, indicating that LLMs differentiate these sequences, demonstrating an underlying mechanism for detecting recognized words during detokenization.
- Single-Token Word Splitting: For words artificially separated into multiple tokens or containing typos, the model progressively reconstructs the original word representation as processing advances through the layers. Using the logit lens method, they observe that the last token's hidden state aligns strongly with the original word, confirming robust detokenization despite initial perturbations.
- Multi-Token Word Processing: For inherently multi-token words, Patchscopes are employed to demonstrate that the LLM can regenerate the complete word despite being unseen during training. This ability implies the presence of a latent inner lexicon, validated by a retrieval rate of 64%.
- Mechanism Analysis: The authors analyze the key mechanisms involved in detokenization, particularly the role of feedforward networks (FFNs) and attention mechanisms. They find that FFNs retrieve word-level information from sub-words, and early attention layers aggregate token information, which collaboratively drives coherent word representation formation.
Practical Implications and Future Directions
The findings suggest that understanding and leveraging this detokenization process can substantially enhance practical applications:
- Vocabulary Expansion: The paper explores expanding the LLM vocabulary without finetuning. By generating fused representations for multi-token words within the embedding spaces, they report maintained or improved LLMing performance. This approach significantly reduces token lengths and model latency, offering implications for efficiency, especially in languages with high word-to-token ratios.
- Computational Efficiency: By efficiently interpreting and embedding sequences, models can reduce their space and computational demands, useful for scaling applications.
The identification of a latent vocabulary within LLMs that can handle out-of-vocabulary and typological variations without additional training presents opportunities for future work. Enhancing transformer architectures to more comprehensively optimize these latent vocabularies could advance performance for complex linguistic tasks and provide greater adaptability across diverse languages and dialects.
In summary, this paper offers a detailed investigation into the internal workings of LLMs, particularly focusing on the critical aspect of detokenization. These insights underscore the potential for inherent improvements in LLM design, thereby fostering more versatile, efficient computational linguistic tools.