Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 30 tok/s Pro
GPT-4o 91 tok/s
GPT OSS 120B 454 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (2410.23168v2)

Published 30 Oct 2024 in cs.LG

Abstract: Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at https://github.com/Haiyang-W/TokenFormer.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents Tokenformer, a novel framework that treats model parameters as tokens to enable incremental model scaling.
  • It replaces traditional linear projections with a token-parameter attention mechanism using GeLU activation and L2-normalization to ensure training stability.
  • Empirical results show that Tokenformer achieves performance parity with models trained from scratch while significantly reducing training costs.

Overview of Tokenformer: Rethinking Transformer Scaling with Tokenized Model Parameters

In this research, the authors address the significant computational costs associated with scaling Transformer models, a limitation that arises from the need to retrain the entire model whenever architectural modifications are introduced. As Transformers become increasingly ubiquitous across various domains such as NLP, visual modeling, and more, their inflexible scaling mechanism poses a noteworthy challenge. This paper introduces Tokenformer, a novel framework designed to enhance the scalability and flexibility of Transformers without necessitating a complete model retraining.

Key Contributions

Tokenformer fundamentally reimagines the traditional Transformer architecture by treating model parameters as tokens, thereby integrating a token-parameter attention layer into the framework. Specifically, the model parameter tokens act as keys and values within the attention mechanism, with input tokens serving as queries. This reformulation allows for progressively scalable models ranging from 124M to 1.4B parameters by merely adding new key-value pairs. The empirical results demonstrate that Tokenformer matches the performance of Transformers trained from scratch while significantly reducing training costs.

Technical Innovations

Tokenformer departs from traditional architecture by replacing all linear projections with a novel token-parameter attention mechanism. This mechanism provides the flexibility to scale model parameters incrementally. By representing model parameters as tokens, Tokenformer exploits the attention mechanism for both token-token and token-parameter interactions. This unification of computations removes the rigid dependency on fixed-size parameter matrices and allows model scalability with preserved pre-trained weights, achieved through zero-initialization of new parameters.

Additionally, the attention mechanism undergoes an adjustment by using a GeLU activation function and L2-normalization, replacing the standard softmax, to mitigate small gradient issues that can lead to training instability. This optimized attention function ensures a more stable and effective learning process.

Implications and Future Work

Practically, Tokenformer presents a cost-effective method for the continual expansion of large-scale models crucial for real-world applications where data volume is continuously increasing. By maintaining performance parity with models trained from scratch, Tokenformer offers an economically viable alternative which could be particularly beneficial for resource-constrained environments.

On a theoretical level, the paper hints at a broader applicability of the tokenization of model parameters, potentially influencing other neural network architectures. By decoupling the scaling of compute-intensive token-token interactions from the parameter size, Tokenformer could enhance models requiring long context processing, such as those employing Chain-of-Thought reasoning.

Future research directions may involve extending Tokenformer into Mixture-of-Experts architectures or exploring its application in parameter-efficient fine-tuning strategies. Further exploratory work could aim to combine Tokenformer for multimodal tasks, fostering seamless integration across various domains, or use it as a basis for device-cloud collaboration strategies in on-device LLMing.

Conclusion

Tokenformer introduces an innovative approach to Transformer scalability by treating model parameters as tokens and employing an entirely attention-based mechanism for their interactions. This research presents concrete solutions to the challenges of model scaling in ever-growing Transformer architectures, potentially setting a new course for efficient neural network design and deployment across diverse computational fields.

Youtube Logo Streamline Icon: https://streamlinehq.com