Over-Tokenized Transformers Framework
- Over-Tokenized Transformers are a design framework that expands token usage to enhance representation, scalability, and performance across various domains.
- The framework employs multi-scale tokenization—from input, intermediate, to parameter tokens—to improve efficiency and reduce training loss through advanced attention and hashing techniques.
- Its modular design supports applications in language, vision, and graphs, enabling cost-effective training and dynamic model scaling with improved task outcomes.
The Over-Tokenized Transformers Framework designates a family of architectural approaches and theoretical insights in which the role, number, and structure of transformer tokens—whether representing input data, intermediate feature distributions, or even model parameters—are deliberately amplified and diversified relative to classical transformer practice. Over-tokenization is employed to improve information representation, scalability, training efficiency, and task performance across domains such as LLMing, vision, and graph learning.
1. Conceptual Foundations and Taxonomy
The central tenet of the Over-Tokenized Transformers Framework is to decouple, diversify, or intentionally expand the set of tokens participating in a transformer’s computation, surpassing earlier conventions where token count and structure closely mirror raw input granularity or sequence length. This generalized expansion can take multiple forms:
- Input over-tokenization: Increasing input vocabulary (e.g., via multi-gram or hierarchical token schemes) so each input segment is represented by multi-scale or overlapping tokens (Huang et al., 28 Jan 2025).
- Intermediate over-tokenization: Decomposing latent representations (e.g., node neighborhoods in graphs (Chen et al., 2022), or image regions (Qian et al., 2022)) into structured token sequences encoding richer facets.
- Parameter tokenization: Representing fixed linear weights as mutable sets of tokens engaged in parameter-token attention, allowing architectures to scale flexibly (Wang et al., 30 Oct 2024, Zhang et al., 2 Aug 2025).
- Domain decoupling: Distinct tokenization strategies for encoder and decoder steps, enabling richer input spaces without ballooning output vocabularies (Huang et al., 28 Jan 2025).
This framework has been formalized and instantiated in language (e.g., scaling LLM input vocabularies with multi-grams (Huang et al., 28 Jan 2025), hash-based embedding for unlimited vocabulary (Xue et al., 2022)), graph (e.g., hop-wise node sequence encoding (Chen et al., 2022), composite token views (Chen et al., 27 Jun 2024), RVQ-quantized node tokens (Wang et al., 17 Oct 2024)), and vision (e.g., patch/token modulation, expansion, and fusion methods (Qian et al., 2022, Kim et al., 2023, Huang et al., 31 Mar 2024, Zeng et al., 6 Jun 2025)) domains.
2. Input Over-Tokenization and Vocabulary Scaling
A primary instantiation is the decoupling and upscaling of the input vocabulary relative to the output, as seen in the Over-Tokenized Transformer (OT) model (Huang et al., 28 Jan 2025). Here, the input sequence is over-encoded using n-gram tokens: where is typically equal to the base vocabulary size . Embedding tables are parameterized using modulo-based hashing and tiling (e.g., ), enabling the use of extremely large input vocabularies without linear parameter growth.
Empirically, scaling the input vocabulary produces a log-linear relationship with LLMing loss: where is the effective vocabulary scaling parameter. Thus, increasing vocabulary size along this scalable “sparse dimension” yields near-linear improvements in training loss, allowing smaller models (in parameter count) to achieve parity with much larger baselines for the same training budget. Importantly, this extension is most beneficial for the input encoding and can marginally benefit or harm performance when applied to output decoding, especially in small models.
3. Token-Parameter Attention: Model Parameter Tokenization
Parameter tokenization replaces fixed weight matrices with sets of learnable tokens, allowing input tokens to interact with parameters through a cross-attention mechanism rather than static projections (Wang et al., 30 Oct 2024, Zhang et al., 2 Aug 2025). For example, in TokenFormer (Wang et al., 30 Oct 2024): where are input tokens, and are parameter tokens (keys and values), and is typically an L2-normalized GeLU or related function. All projections (query, key, value, and output) are performed through such token-parameter cross-attention, decoupling network width and parameter growth. Model scaling is achieved by appending new tokens, possibly initialized to zero, and does not require full retraining. This architecture supports efficient progressive expansion while preserving existing capabilities, as demonstrated by maintaining baseline perplexities when scaling from 124M to 1.4B parameters at a fraction of the training cost.
The T2S framework (Zhang et al., 2 Aug 2025) extends this to lifelong imitation learning, where new skills are added by growing the parameter token pool; language-guided selection mechanisms enable knowledge transfer and prevent catastrophic forgetting.
4. Structured and Multi-View Tokenization in Non-Text Domains
In vision and graph neural architectures, over-tokenization manifests as multi-scale or multi-view representations:
- Vision: Token Expansion (Huang et al., 31 Mar 2024) introduces an initialization–expansion–merging pipeline, starting with seed tokens and adding tokens that maximize feature distribution diversity, then merging redundant representations. Token Transforming (Zeng et al., 6 Jun 2025) generalizes token pruning/merging as many-to-many matrix transformations, , where is not block-diagonal or exclusive, reserving more information and enabling training-free token compression.
- Graph learning: In NAGphormer (Chen et al., 2022), each node is represented by a sequence of Hop2Token-generated hop-wise aggregations, yielding input sequences per node (as opposed to a single node-per-token pattern). NTFormer (Chen et al., 27 Jun 2024) generates composite token sequences per node (via neighborhood aggregation and similarity-based sampling in topo/attribute views), further improving representation completeness. GQT (Wang et al., 17 Oct 2024) uses a GNN encoder and hierarchical residual vector quantization for compact, task-independent graph tokens, enhancing memory efficiency while supporting state-of-the-art classification.
In all these cases, over-tokenization yields better expressivity and empirical accuracy, even when compared to methods tailored with graph/vision inductive bias.
5. Memory and Computation Efficiency
Over-tokenization, when coupled with sparse, modular, or hashed representations, can be highly cost-effective:
- Compression and Hashing: HashFormers (Xue et al., 2022) use message-digest and locality-sensitive hashes to map unbounded, fine-grained vocabularies into small, fixed embedding tables (down to ~0.3% of the parameter count of BERT), with minimal degradation (<0.5% on GLUE).
- Efficient Training and Inference: Token Expansion (Huang et al., 31 Mar 2024) achieves 1.3× faster training in ViTs by controlling token growth, with maintained or improved accuracy. Token Transforming (Zeng et al., 6 Jun 2025) reduces FLOPs by ~40% and supports 1.5× inference speedups with negligible (∼0.1%) accuracy drop, generalizing to object detection, segmentation, and depth tasks via token recovery modules.
- Scalability: In graph transformers, per-node over-tokenization enables mini-batch training that scales to graphs with millions of nodes (Chen et al., 2022, Chen et al., 27 Jun 2024). NAGphormer’s complexity reduces from to by representing each node as a short sequence.
6. Mathematical and Theoretical Underpinnings
Several lines of theoretical justification underlie the framework:
- Expressiveness: Treating fine-grained graph structure as tokens (e.g., nodes and edges) and augmenting tokens with node identifiers and type embeddings, a pure transformer becomes as expressive as second-order invariant graph networks (2-IGN), strictly exceeding message-passing GNNs (Kim et al., 2022).
- Readout Adaptivity: In NAGphormer, the attention-based readout,
gives a node-specific, learnable fusion across hop-wise tokens; in contrast, GCN-like models apply fixed weights.
- Optimization and Conditioning: Modifying the initial token matrix to reduce its condition number (), as in (Saratchandran et al., 19 May 2025), provably reduces the ill-conditioning of the attention operation and results in more stable gradients and improved convergence: where is a correction term constructed to bound the original token matrix’s conditioning.
Additionally, the Token Space (Pan, 11 Apr 2024) applies category theory, elevating tokens to objects in a categorical structure with products, coproducts, and topos-theoretic constructions. This formalism supports theoretically grounded operations (merging, flattening, scattering) for dynamic token manipulation, especially relevant in over-tokenized (or over-fragmented) settings.
7. Implications, Limitations, and Future Directions
The Over-Tokenized Transformers Framework provides an extensible blueprint for model scaling and representation:
- Tokenizer Design: Deliberate expansion of input representation—through multi-gram, multi-view, or quantized tokenization—can yield linear improvements in modeling loss for fixed parameter counts (Huang et al., 28 Jan 2025).
- Efficient Scaling and Knowledge Transfer: Parameter tokenization enables dynamic, sublinear growth in lifelong and multi-task settings (Wang et al., 30 Oct 2024, Zhang et al., 2 Aug 2025). Language-guided selection of parameter tokens fosters efficient skill transfer while preventing catastrophic forgetting.
- Domain Generality: Over-tokenization benefits language, graph, and vision tasks; modular tokenizer pipelines (separable tokenization, compression, and recovery units) can be flexibly combined for further optimization.
- Memory/Latency Budgeting: Compression via hashing, tiling, or quantization allows over-tokenization at modest resource cost, and dynamic expansion/merging can be orchestrated per budget or device limitation.
- Theoretical Integration: Category-theoretic tools and formal scaling laws provide a deeper mathematical foundation for future token manipulation strategies and their integration into efficient transformers.
Limitations include modest increases in memory or lookup cost for extreme vocabulary scaling, the necessity for careful output-side design to prevent performance drops in smaller models (Huang et al., 28 Jan 2025), and potential configurational overhead in composite or hybrid tokenization (multi-view, per-hop, adaptive merging) settings. Further research directions include advanced decoding supervision, ultra-large embedding decompositions, cross-modal over-tokenization, and mathematically principled fusion/recovery mechanisms to close any remaining gaps in performance, interpretability, or hardware compatibility.
The Over-Tokenized Transformers Framework recasts tokenization as a central axis of transformer design, underpinned by empirical scaling laws, attention-based dynamism, and formal theoretical insights, with demonstrated benefits in model scaling, efficiency, transfer, and task performance across application domains (Chen et al., 2022, Kim et al., 2022, Xue et al., 2022, Qian et al., 2022, Kim et al., 2023, Huang et al., 31 Mar 2024, Pan, 11 Apr 2024, Chen et al., 27 Jun 2024, Phan et al., 11 Oct 2024, Wang et al., 17 Oct 2024, Wang et al., 30 Oct 2024, Wu et al., 23 Dec 2024, Huang et al., 28 Jan 2025, Saratchandran et al., 19 May 2025, Zeng et al., 6 Jun 2025, Zhang et al., 2 Aug 2025).