Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies (2407.13623v3)

Published 18 Jul 2024 in cs.CL and cs.AI

Abstract: Research on scaling LLMs has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. Most LLMs, however, use insufficient vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo.

PDF HTML Abstract

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

In the field of LLMs, where performance scaling with computational budgets is a critical concern, traditional research has predominantly focused on optimizing model parameters and the amount of training data. However, the paper "Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies" by Chaofan Tao et al. brings attention to the often overlooked aspect of vocabulary size and its impact on LLM performance. This paper rigorously investigates how vocabulary size affects LLM scaling laws and proposes methodologies to predict the compute-optimal vocabulary sizes for efficient LLM scaling.

Background and Objective

LLMs have achieved remarkable success by leveraging vast text corpora and extensive computational resources, reflected in scaling laws that predict how model performance scales with the number of parameters and training data. Despite these advancements, the role of vocabulary size has remained underexplored. Vocabulary size influences tokenization efficiency and the model's representational capacity, yet excessive vocabulary sizes can lead to under-fitting, especially for infrequent tokens.

The primary objective of this paper is to quantify the effect of vocabulary size on LLM performance and develop predictive methods for determining the optimal vocabulary size given a compute budget. By addressing this gap, the paper provides a framework to enhance the efficiency of LLM scaling.

Methodology

The research introduces three complementary approaches to predict the compute-optimal vocabulary size:

IsoFLOPs Analysis: This approach involves pre-training models with fixed non-vocabulary parameters but varying vocabulary sizes under the same computational budget (measured in FLOPs). The method fits power laws between FLOPs, non-vocabulary parameters, vocabulary parameters, and training data, revealing how these factors should scale together to minimize loss.
Derivative-Based Estimation: This method computes the derivative of FLOPs with respect to the vocabulary size to determine the point at which FLOPs are minimized for a fixed loss. By solving this derivative equation, the optimal vocabulary size can be estimated given a set non-vocabulary parameter size.
Parametric Fit of Loss Formula: Modifying the classical Chinchilla scaling laws, this approach incorporates vocabulary parameters into the loss function. It predicts the loss based on non-vocabulary parameters, vocabulary parameters, and the number of training characters, providing flexibility to determine optimal vocabulary sizes even when model parameters and training data are not scaled equally.

Results and Findings

The predictions from all approaches converge on the key finding that larger models necessitate larger vocabularies. Notably, the relationship between non-vocabulary parameters and vocabulary parameters follows a power law, where the rate of increase for vocabulary parameters is slower than that for non-vocabulary parameters.

Empirical validations confirm the practicality and accuracy of the proposed approaches. For example, models trained with optimal vocabulary sizes predicted by the approaches consistently outperformed models with commonly used, smaller vocabulary sizes across various downstream tasks. Specifically, by increasing the vocabulary size from 32K to 43K, the performance on the ARC-Challenge improved from 29.1 to 32.0 within the same 2.3e21 FLOPs budget.

Implications and Future Directions

This research underscores the significant yet previously underestimated role of vocabulary size in scaling LLMs. The implications of these findings are multifaceted:

Practical Implications: Practitioners can achieve better performance by jointly optimizing model parameters, training data, and vocabulary size, leading to more efficient and effective LLMs.
Theoretical Implications: The paper extends existing scaling laws to include vocabulary size, offering a more comprehensive framework for understanding and predicting LLM performance.

Future research may explore the application of these findings to multilingual and multimodal models, where vocabulary considerations can be even more complex. Additionally, investigating the trade-offs between vocabulary size and computational efficiency during inference could further optimize LLM deployment in real-world applications.

Conclusion

The paper by Tao et al. makes a compelling case for the necessity of considering vocabulary size in LLM scaling laws. By developing robust methodologies to predict the optimal vocabulary sizes, this research enriches the scaling law framework, ensuring that larger models can fully leverage their capacities with appropriately sized vocabularies. This work paves the way for more efficient, powerful, and versatile LLMs, highlighting the critical interplay between computational resources and linguistic representation.