Scaling behavior of token-level dynamic grouping to billion-parameter models
Determine how token-level dynamic grouping, implemented by defining patches via BPE token character sequences with explicit end-of-patch markers and a second-stage hierarchical BPE compression, scales when applied to large-scale language models with billions of parameters, specifically characterizing its effects on training dynamics, memory requirements, and optimization challenges.
References
First, we do not evaluate our approach in large-scale regimes involving models with billions of parameters. It remains an open question how well token-level dynamic grouping scales in such settings, where training dynamics, memory constraints, and optimization challenges may differ substantially.
— From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
(2510.15517 - Dolga et al., 17 Oct 2025) in Limitations