Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling (2501.16975v2)
Abstract: Tokenization is a fundamental component of LLMs, yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve LLMing performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.
Collections
Sign up for free to add this paper to one or more collections.
Summary
- The paper introduces an over-tokenized transformer that decouples input and output vocabularies to enhance scalability and efficiency across model sizes.
- The paper identifies a log-linear relationship between input vocabulary size and training loss, achieving up to a 2.5-fold improvement in scale efficiency.
- The paper provides practical guidelines for tokenizer design via hierarchical n-gram embeddings and integrated multi-token prediction to capture long-range dependencies.
Overview of Over-Tokenized Transformer Research
The paper "Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling" presents a comprehensive paper on the impact of tokenization within LLMs. It addresses the role of tokenization in enhancing model performance, with a particular focus on the scaling laws which dictate how model efficacy can increase with larger input vocabularies while maintaining constant computational costs. This research introduces the concept of Over-Tokenized Transformers, which decouple input and output vocabularies, providing a structured approach to optimizing LLM design.
Key Contributions
- Over-Tokenized Transformers Framework: The authors propose a novel framework that emphasizes expanding the input vocabulary to improve performance across various model sizes. This decoupled approach allows for using significantly larger vocabularies for encoding than decoding, which balances the computational overhead and enhances model scalability and capability.
- Scaling Laws and Tokenization: Through extensive experimentation, the authors identify a log-linear relationship between input vocabulary size and training loss. Larger input vocabularies consistently yield performance improvements across different model scales, highlighting a new dimension in the scaling laws for LLMs.
- Practical Implications for Tokenizer Design: The paper provides substantial insights into tokenizer design by demonstrating that scaling up input vocabularies positively affects model performance while expanding the output vocabularies might negatively impact smaller models. These findings underline tokenization's pivotal role in advancing LLMs and inform the design choices for future tokenizer implementations.
- Hierarchical n-Gram Embeddings: The authors introduce an efficient hierarchical n-gram encoding strategy that significantly improves model performance by incorporating multi-granularity token embeddings. This approach allows models to capture longer-range dependencies more effectively.
- Integration with Multi-Token Prediction: The research extends its contributions by integrating over-encoding with multi-token prediction methodologies, resulting in the Over-Tokenized Transformer. This integration aids in leveraging the expanded vocabulary's full potential, particularly for larger models.
Experimental Insights
The authors conduct extensive experiments across various scales of dense and sparse models (e.g., OLMo and OLMoE models), demonstrating that over-encoding consistently improves performance compared to baseline counterparts. Particularly impressive is a 2.5-fold increase in model scale efficiency when using over-encoding, achieving similar performance with fewer parameters. Additionally, using larger token input sizes helps maintain accuracy even with reduced training costs.
For MoE architectures, over-encoding shows comparable improvements in training loss reduction and downstream task performance, although the magnitude of improvement decreases as model size increases. This suggests potential overlapping benefits from the sparse parameters typically employed in MoE models.
Future Implications and Speculations
This work's implications suggest several directions for future research and development in AI:
- Broader Application of Tokenization Principles: The insights into tokenizer design and scalability can be applied beyond language tasks, potentially benefiting other AI applications where input granularity and efficient encoding are crucial.
- Integration with AI Hardware Advances: The separation of vocabulary into input and output components can parallel developments in AI hardware, where specialized hardware could optimize different stages of model processing.
- Adaptive Tokenization Strategies: Future LLMs could incorporate dynamic tokenization strategies based on context or task-specific requirements, further leveraging the performance gains observed from larger input vocabularies.
In conclusion, "Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling" provides significant contributions to understanding and optimizing the impact of tokenization within LLMs. Its findings not only emphasize the importance of tokenization in the context of scaling laws but also pave the way for new research and practical applications aimed at enhancing the efficiency and capability of next-generation LLMs.
Follow-up Questions
- How does decoupling the input and output vocabularies impact the learning dynamics and generalization of large language models?
- What are the potential limitations or downsides of aggressively expanding input vocabularies beyond the ranges tested in this paper?
- How might hierarchical n-gram embeddings affect the interpretability of token representations within Transformer models?
- In what ways could over-tokenization and dynamic tokenization strategies be adapted for modalities beyond language, such as vision or multimodal models?
- Find recent papers about scaling laws and tokenization in large language models.
Related Papers
- Getting the most out of your tokenizer for pre-training and domain adaptation (2024)
- Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance (2024)
- Toward a Theory of Tokenization in LLMs (2024)
- Large Vocabulary Size Improves Large Language Models (2024)
- Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies (2024)
- Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles (2024)
- Counting Ability of Large Language Models and Impact of Tokenization (2024)
- Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages (2024)
- Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More (2025)
- Beyond Text Compression: Evaluating Tokenizers Across Scales (2025)
Authors (7)
Tweets
YouTube
HackerNews
- Over-Tokenized Transformer: Vocabulary Is Worth Scaling [pdf] (2 points, 0 comments)
- Over-Tokenized Transformer: Vocabulary Is Generally Worth Scaling (2 points, 0 comments)
- Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling (19 points, 5 comments)