Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Vocabulary Size Improves Large Language Models (2406.16508v1)

Published 24 Jun 2024 in cs.CL

Abstract: This paper empirically investigates the relationship between subword vocabulary size and the performance of LLMs to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained LLM is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sho Takase (25 papers)
  2. Ryokan Ri (15 papers)
  3. Shun Kiyono (18 papers)
  4. Takuya Kato (8 papers)
Citations (1)

Summary

Large Vocabulary Size and Its Impact on LLMs

The paper by Takase et al. investigates an often-overlooked aspect of improving LLMs: the size of the vocabulary. Although previous works have intensely focused on scaling models and exploring the inner architecture, this paper empirically studies how varying subword vocabulary sizes impact LLM performance, particularly in monolingual contexts, as well as in scenarios involving continual training.

Empirical Investigations and Results

The core of the work dives into understanding how different subword vocabulary sizes affect the efficacy of LLMs in English and Japanese. The researchers use Transformer-based architectures and configure several vocabulary sizes: 5k, 10k, 50k, 100k, and 500k. Their experiments span two scenarios with fixed total training tokens and fixed training epochs, a prudent step to account for computational cost and resource allocation during training.

Notably, larger vocabulary sizes consistently improve performance across several commonsense reasoning tasks in both languages. Strong numerical evidence supports these claims, with the largest vocabulary for English (500k) achieving an average performance increase of 2.5 points over the smallest size (5k) in terms of percentage measures across tasks like WinoGrande and ARC. Furthermore, in Japanese, the 500k vocabulary model exhibits a remarkable improvement, signifying the utility of large vocabularies especially for languages with rich character sets.

Continual Training and Vocabulary Adaptation

The paper further explores a simple yet effective strategy for adapting LLMs to new languages by reconstructing subword vocabularies during the model's continual training phase. Starting with a pre-trained Llama2 model, the researchers swap out the pre-defined vocabulary in favor of one more suited to the target language. They demonstrate that incorporating a new vocabulary can outperform merely adapting with the original pre-trained vocabulary, especially when aligning embeddings of shared subwords in both vocabularies. This adaptation perspective offers a pathway for enhancing multilingual LLMs, particularly when extending capabilities to typologically diverse languages.

Theoretical and Practical Implications

From a theoretical standpoint, the findings suggest that subword vocabulary size is a critical hyperparameter that directly influences the cross-lingual and monolingual efficiency of LLMs. This insight prompts a reconsideration of the hitherto arbitrary choice of vocabulary size in many systems. Practically, the results imply that larger vocabularies contribute to more computational efficiency in training and potentially better fine-tuning opportunities across languages.

The nuance added by examining configurations across multiple language tasks strengthens the argument that optimal vocabulary choice should be considered as crucial as scaling up model parameters or datasets. This research could influence future approaches in building and extending LLMs by embracing more dynamic and context-specific vocabulary adaptations.

Speculation on Future Developments

Moving forward, the implications of these findings may fuel further research into automatic or adaptive vocabulary selection mechanisms which consider task-specific and language-specific requirements. Additionally, LLMs might evolve to develop enhanced methodologies for continuous vocabulary expansion or reduction, making them more adaptable and resource-efficient.

Overall, the paper presents a comprehensive analysis and offers compelling evidence that challenges existing preconceptions about vocabulary size, making a strong case for its optimization as part of LLM training and deployment strategies.

Youtube Logo Streamline Icon: https://streamlinehq.com