Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size (2506.15025v1)

Published 17 Jun 2025 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Pretraining LLMs is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On the parametrization side, $\mu P$ (Maximal Update Parametrization) parametrizes model weights and learning rate (LR) in a way that makes hyperparameters (HPs) transferable with width (embedding dimension): HPs can be tuned for a small model and used for larger models without additional tuning. While $\mu$P showed impressive results in practice, recent empirical studies have reported conflicting observations when applied to LLMs. One limitation of the theory behind $\mu$P is the fact that input dimension (vocabulary size in LLMs) is considered fixed when taking the width to infinity. This is unrealistic since vocabulary size is generally much larger than width in practice. In this work, we provide a theoretical analysis of the effect of vocabulary size on training dynamics, and subsequently show that as vocabulary size increases, the training dynamics \emph{interpolate between the $\mu$P regime and another regime that we call Large Vocab (LV) Regime}, where optimal scaling rules are different from those predicted by $\mu$P. Our analysis reveals that in the LV regime, the optimal embedding LR to hidden LR ratio should roughly scale as $\Theta(\sqrt{width})$, surprisingly close to the empirical findings previously reported in the literature, and different from the $\Theta(width)$ ratio predicted by $\mu$P. We conduct several experiments to validate our theory, and pretrain a 1B model from scratch to show the benefit of our suggested scaling rule for the embedding LR.

PDF Abstract

Pretraining LLMs is a computationally intensive process, and effectively tuning hyperparameters like learning rate (LR) across different model scales is a significant challenge. Maximal Update Parametrization ( $\mu$ P) (Yang et al., 2022 ) was proposed to address this by providing scaling rules for initialization and LR that theoretically enable hyperparameter transfer with model width (embedding dimension $d$ ). However, empirical studies applying $\mu$ P to LLMs have shown conflicting results, particularly regarding the optimal embedding layer LR.

This paper, "Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size" (Hayou et al., 17 Jun 2025 ), investigates why $\mu$ P's predictions for LLMs might be inaccurate. The authors identify a key limitation in the standard $\mu$ P theory: it assumes a fixed input dimension (vocabulary size $m$ ) while only scaling model width $d$ . In practice, LLM vocabulary sizes are often much larger than model width, and the relationship between $m$ and $d$ is not fixed across all scales and datasets. Furthermore, the embedding layer acts as a lookup table, meaning updates are heavily influenced by token frequencies, a factor not fully captured by traditional $\mu$ P analysis.

The paper provides a theoretical analysis using a simplified linear model consisting only of embedding and projection layers, trained with an Adam-like optimizer (specifically, SignSGD for tractability). They analyze how the magnitude of updates to the embedding and projection weights scale as both model width $d$ and vocabulary size $m$ become large. This analysis reveals two distinct regimes:

$\mu$ P Regime: When vocabulary size $m$ is fixed relative to $d$ , the update magnitudes for both embedding and projection layers scale with $d$ . This aligns with the conditions under which $\mu$ P was derived, suggesting optimal LRs scaling as $\eta_E=\Theta(1)$ and $\eta_W=\Theta(d^{-1})$ (for hidden/projection layers) to achieve $\Theta(1)$ updates.
Large Vocabulary (LV) Regime: When vocabulary size $m$ scales significantly with $d$ (e.g., $m \propto d$ ) or is much larger ( $m \gg d$ ), the analysis shows that the update magnitude for the embedding layer scales approximately as $\Theta(\sqrt{d})$ , while the projection layer update still scales as $\Theta(d)$ . This difference arises due to the effect of large $m$ and token frequency on the element-wise normalization used in Adam-like optimizers.

This theoretical finding suggests that in the LV regime, which the authors argue is more representative of modern LLMs, the optimal scaling for the embedding LR should be different from $\mu$ P's $\Theta(1)$ prediction. To maintain balanced feature learning updates across layers, the embedding LR ( $\eta_E$ ) should be scaled relative to the hidden/projection LR ( $\eta_W$ ) such that $\eta_E/\eta_W \approx \Theta(\sqrt{d})$ . This contrasts with $\mu$ P's suggested ratio of $\Theta(d)$ and the standard practice (Standard Parametrization, SP) which often uses a ratio of $\Theta(1)$ .

Based on this, the authors propose a Large Vocabulary Parametrization (LVP). While the theoretical analysis used a simplified model and optimizer, the authors hypothesize that the core finding regarding the embedding layer's sensitivity to vocabulary size carries over to full transformer architectures trained with Adam. LVP uses Standard Parametrization-like initialization ( $\sigma_E=\sigma_W=\sigma_H=d^{-1/2}$ ) and $\mu$ P-like LR scaling for hidden and output layers ( $\eta_W = \eta_H = \eta d^{-1}$ ), but incorporates the $\sqrt{d}$ -rule for the embedding layer LR ( $\eta_E = \eta d^{-1/2}$ ). This results in the desired $\eta_E/\eta_W = \Theta(\sqrt{d})$ ratio.

The paper validates these theoretical findings with experiments:

Small Model Scaling with Vocabulary: They trained a small transformer model while scaling both width $d$ and vocabulary size $m$ such that $m$ grows linearly with $d$ . By sweeping embedding LR ( $\eta_E$ ) while fixing hidden/projection LRs according to LVP ( $\eta_H=\eta_W \propto d^{-1}$ ), they observed that the optimal $\eta_E$ indeed decreases sublinearly with $d$ , aligning more closely with the $\sqrt{d}$ behavior predicted by their theory than with $\mu$ P's constant prediction (\cref{fig:emb_lr_vocab_scaling}).
Production-Scale 1B Model Pretraining: To assess the practical benefit, they trained a 1B parameter dense transformer model (with $d=2048$ , so $\sqrt{d} \approx 45.3$ ) on a large, production-scale dataset (1.75T tokens, used for Phi-3 (Munkhdalai et al., 10 Apr 2024 )). A baseline model used the conventional practice of applying the same LR across all layers ( $\eta_E/\eta_W \approx 1$ ), while their experimental model used $\eta_E/\eta_W \approx \sqrt{d}$ . The results showed that the model trained with the $\sqrt{d}$ ratio for the embedding LR achieved consistently lower training loss and better perplexity on the Wikitext test set compared to the baseline (\cref{fig:training_ppl}, \cref{fig:test_ppl}). Experiments with other ratios confirmed $\sqrt{d}$ was near-optimal.

Practical Implementation:

The key takeaway for practitioners is that when pretraining LLMs with large vocabularies, the embedding layer's learning rate should likely be higher than that of the hidden and projection layers.
Specifically, the paper suggests setting the ratio of the embedding LR to the hidden/projection LR to be approximately $\sqrt{d}$ , where $d$ is the embedding dimension (model width). If using an Adam-like optimizer with global LR $\eta$ and standard $d^{-1}$ scaling for hidden layers, the embedding LR could be set to $\eta \times d^{-1/2}$ and hidden/output LRs to $\eta \times d^{-1}$ .
The authors' LVP parametrization suggests combining SP-like initialization variance ( $d^{-1}$ ) with $\mu$ P-like LR scaling ( $d^{-1}$ for hidden/output) and the $\sqrt{d}$ -rule for the embedding LR ( $d^{-1/2}$ ). However, the empirical results for the 1B model primarily focus on the ratio of LRs, implying that adjusting the embedding LR relative to others is the most critical aspect.
Experimentation with different LR ratios around $\sqrt{d}$ may still be necessary to find the absolute optimum for a specific model size, architecture, and dataset (\cref{fig:training_ppl_different_ratios}).

Limitations and Future Work:

The theoretical analysis uses a simplified linear model and SignSGD, and extending it to full transformers with Adam is complex.
Optimal scaling rules might depend on the training step $t$ , not just $d$ and $m$ .
The analysis highlights that optimal LR scaling is sensitive to token frequencies, suggesting that more advanced parametrizations might benefit from explicitly incorporating this information.
While the $\sqrt{d}$ -rule improves training efficiency, it is not proven to guarantee perfect hyperparameter transfer across all scales and datasets, unlike the theoretical claim of $\mu$ P (though $\mu$ P itself shows limitations in practice for LLMs).

In conclusion, the paper provides both theoretical evidence and empirical validation showing that vocabulary size is a critical factor influencing the optimal embedding learning rate in LLMs. It challenges the universality of $\mu$ P scaling in the large vocabulary setting and proposes a practical heuristic – setting the embedding LR roughly $\sqrt{d}$ times higher than hidden/projection LRs – that demonstrably improves training performance for large LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Soufiane Hayou (26 papers)
Liyuan Liu (49 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/SisForCollege/status/1937764960043770331

https://twitter.com/fly51fly/status/1935814610143395860