Characterize factors governing the effectiveness and stability of embedding scaling

Determine how the total parameter budget allocated to embedding parameters, the n-gram vocabulary size, the initialization schemes for embedding tables and projection matrices, and the trade-offs between transformer model width and depth jointly influence the effectiveness and training stability of scaling embedding parameters in large language models (e.g., via N-gram Embedding).

Background

The paper positions embedding scaling as an orthogonal sparsity axis to Mixture-of-Experts (MoE), emphasizing approaches such as vocabulary expansion via N-gram Embedding and structural expansion via Per-Layer Embedding. Despite promising empirical gains, the authors note that the constraints of scaling embeddings have not been systematically characterized.

Understanding how design choices—parameter budgeting, vocabulary size, initialization, and architectural width/depth—interact is critical for stable and effective training. This problem seeks a principled characterization to guide capacity allocation and configuration of embedding-heavy architectures.

References

Second, the constraints of scaling embeddings are still not systematically characterized: it remains unclear how factors such as the total parameter budget, vocabulary size, initialization schemes, and the trade-offs between model width and depth jointly influence the effectiveness and stability of embedding scaling.

Scaling Embeddings Outperforms Scaling Experts in Language Models  (2601.21204 - Liu et al., 29 Jan 2026) in Section 1 Introduction