Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size (2506.15025v1)

Published 17 Jun 2025 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Pretraining LLMs is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On the parametrization side, $\mu P$ (Maximal Update Parametrization) parametrizes model weights and learning rate (LR) in a way that makes hyperparameters (HPs) transferable with width (embedding dimension): HPs can be tuned for a small model and used for larger models without additional tuning. While $\mu$P showed impressive results in practice, recent empirical studies have reported conflicting observations when applied to LLMs. One limitation of the theory behind $\mu$P is the fact that input dimension (vocabulary size in LLMs) is considered fixed when taking the width to infinity. This is unrealistic since vocabulary size is generally much larger than width in practice. In this work, we provide a theoretical analysis of the effect of vocabulary size on training dynamics, and subsequently show that as vocabulary size increases, the training dynamics \emph{interpolate between the $\mu$P regime and another regime that we call Large Vocab (LV) Regime}, where optimal scaling rules are different from those predicted by $\mu$P. Our analysis reveals that in the LV regime, the optimal embedding LR to hidden LR ratio should roughly scale as $\Theta(\sqrt{width})$, surprisingly close to the empirical findings previously reported in the literature, and different from the $\Theta(width)$ ratio predicted by $\mu$P. We conduct several experiments to validate our theory, and pretrain a 1B model from scratch to show the benefit of our suggested scaling rule for the embedding LR.

Pretraining LLMs is a computationally intensive process, and effectively tuning hyperparameters like learning rate (LR) across different model scales is a significant challenge. Maximal Update Parametrization (μ\muP) (Yang et al., 2022 ) was proposed to address this by providing scaling rules for initialization and LR that theoretically enable hyperparameter transfer with model width (embedding dimension dd). However, empirical studies applying μ\muP to LLMs have shown conflicting results, particularly regarding the optimal embedding layer LR.

This paper, "Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size" (Hayou et al., 17 Jun 2025 ), investigates why μ\muP's predictions for LLMs might be inaccurate. The authors identify a key limitation in the standard μ\muP theory: it assumes a fixed input dimension (vocabulary size mm) while only scaling model width dd. In practice, LLM vocabulary sizes are often much larger than model width, and the relationship between mm and dd is not fixed across all scales and datasets. Furthermore, the embedding layer acts as a lookup table, meaning updates are heavily influenced by token frequencies, a factor not fully captured by traditional μ\muP analysis.

The paper provides a theoretical analysis using a simplified linear model consisting only of embedding and projection layers, trained with an Adam-like optimizer (specifically, SignSGD for tractability). They analyze how the magnitude of updates to the embedding and projection weights scale as both model width dd and vocabulary size mm become large. This analysis reveals two distinct regimes:

  1. μ\muP Regime: When vocabulary size mm is fixed relative to dd, the update magnitudes for both embedding and projection layers scale with dd. This aligns with the conditions under which μ\muP was derived, suggesting optimal LRs scaling as ηE=Θ(1)\eta_E=\Theta(1) and ηW=Θ(d1)\eta_W=\Theta(d^{-1}) (for hidden/projection layers) to achieve Θ(1)\Theta(1) updates.
  2. Large Vocabulary (LV) Regime: When vocabulary size mm scales significantly with dd (e.g., mdm \propto d) or is much larger (mdm \gg d), the analysis shows that the update magnitude for the embedding layer scales approximately as Θ(d)\Theta(\sqrt{d}), while the projection layer update still scales as Θ(d)\Theta(d). This difference arises due to the effect of large mm and token frequency on the element-wise normalization used in Adam-like optimizers.

This theoretical finding suggests that in the LV regime, which the authors argue is more representative of modern LLMs, the optimal scaling for the embedding LR should be different from μ\muP's Θ(1)\Theta(1) prediction. To maintain balanced feature learning updates across layers, the embedding LR (ηE\eta_E) should be scaled relative to the hidden/projection LR (ηW\eta_W) such that ηE/ηWΘ(d)\eta_E/\eta_W \approx \Theta(\sqrt{d}). This contrasts with μ\muP's suggested ratio of Θ(d)\Theta(d) and the standard practice (Standard Parametrization, SP) which often uses a ratio of Θ(1)\Theta(1).

Based on this, the authors propose a Large Vocabulary Parametrization (LVP). While the theoretical analysis used a simplified model and optimizer, the authors hypothesize that the core finding regarding the embedding layer's sensitivity to vocabulary size carries over to full transformer architectures trained with Adam. LVP uses Standard Parametrization-like initialization (σE=σW=σH=d1/2\sigma_E=\sigma_W=\sigma_H=d^{-1/2}) and μ\muP-like LR scaling for hidden and output layers (ηW=ηH=ηd1\eta_W = \eta_H = \eta d^{-1}), but incorporates the d\sqrt{d}-rule for the embedding layer LR (ηE=ηd1/2\eta_E = \eta d^{-1/2}). This results in the desired ηE/ηW=Θ(d)\eta_E/\eta_W = \Theta(\sqrt{d}) ratio.

The paper validates these theoretical findings with experiments:

  1. Small Model Scaling with Vocabulary: They trained a small transformer model while scaling both width dd and vocabulary size mm such that mm grows linearly with dd. By sweeping embedding LR (ηE\eta_E) while fixing hidden/projection LRs according to LVP (ηH=ηWd1\eta_H=\eta_W \propto d^{-1}), they observed that the optimal ηE\eta_E indeed decreases sublinearly with dd, aligning more closely with the d\sqrt{d} behavior predicted by their theory than with μ\muP's constant prediction (\cref{fig:emb_lr_vocab_scaling}).
  2. Production-Scale 1B Model Pretraining: To assess the practical benefit, they trained a 1B parameter dense transformer model (with d=2048d=2048, so d45.3\sqrt{d} \approx 45.3) on a large, production-scale dataset (1.75T tokens, used for Phi-3 (Munkhdalai et al., 10 Apr 2024 )). A baseline model used the conventional practice of applying the same LR across all layers (ηE/ηW1\eta_E/\eta_W \approx 1), while their experimental model used ηE/ηWd\eta_E/\eta_W \approx \sqrt{d}. The results showed that the model trained with the d\sqrt{d} ratio for the embedding LR achieved consistently lower training loss and better perplexity on the Wikitext test set compared to the baseline (\cref{fig:training_ppl}, \cref{fig:test_ppl}). Experiments with other ratios confirmed d\sqrt{d} was near-optimal.

Practical Implementation:

  • The key takeaway for practitioners is that when pretraining LLMs with large vocabularies, the embedding layer's learning rate should likely be higher than that of the hidden and projection layers.
  • Specifically, the paper suggests setting the ratio of the embedding LR to the hidden/projection LR to be approximately d\sqrt{d}, where dd is the embedding dimension (model width). If using an Adam-like optimizer with global LR η\eta and standard d1d^{-1} scaling for hidden layers, the embedding LR could be set to η×d1/2\eta \times d^{-1/2} and hidden/output LRs to η×d1\eta \times d^{-1}.
  • The authors' LVP parametrization suggests combining SP-like initialization variance (d1d^{-1}) with μ\muP-like LR scaling (d1d^{-1} for hidden/output) and the d\sqrt{d}-rule for the embedding LR (d1/2d^{-1/2}). However, the empirical results for the 1B model primarily focus on the ratio of LRs, implying that adjusting the embedding LR relative to others is the most critical aspect.
  • Experimentation with different LR ratios around d\sqrt{d} may still be necessary to find the absolute optimum for a specific model size, architecture, and dataset (\cref{fig:training_ppl_different_ratios}).

Limitations and Future Work:

  • The theoretical analysis uses a simplified linear model and SignSGD, and extending it to full transformers with Adam is complex.
  • Optimal scaling rules might depend on the training step tt, not just dd and mm.
  • The analysis highlights that optimal LR scaling is sensitive to token frequencies, suggesting that more advanced parametrizations might benefit from explicitly incorporating this information.
  • While the d\sqrt{d}-rule improves training efficiency, it is not proven to guarantee perfect hyperparameter transfer across all scales and datasets, unlike the theoretical claim of μ\muP (though μ\muP itself shows limitations in practice for LLMs).

In conclusion, the paper provides both theoretical evidence and empirical validation showing that vocabulary size is a critical factor influencing the optimal embedding learning rate in LLMs. It challenges the universality of μ\muP scaling in the large vocabulary setting and proposes a practical heuristic – setting the embedding LR roughly d\sqrt{d} times higher than hidden/projection LRs – that demonstrably improves training performance for large LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Soufiane Hayou (26 papers)
  2. Liyuan Liu (49 papers)