Exploring the Benefit of Activation Sparsity in Pre-training (2410.03440v1)

Published 4 Oct 2024 in cs.CL and cs.AI

Abstract: Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to $2\times$ faster inference speed. Codes are available at https://github.com/thunlp/moefication.

PDF HTML Abstract

The Implications of Activation Sparsity in Transformer Pre-training

The paper "Exploring the Benefit of Activation Sparsity in Pre-training" presents a nuanced exploration of activation sparsity within the pre-training phase of Transformer architectures. It investigates the dynamic nature of activated neurons during Transformer pre-training and leverages this insight to propose a novel training methodology, termed Switchable Sparse-Dense Learning (SSD). The methodology's core innovation lies in dynamically transitioning between sparse and dense training, utilizing observed stabilization in activation patterns over the course of pre-training.

Key Contributions

Dynamic Activation Sparsity Analysis: The research demonstrates that sparse activation is prevalent throughout the pre-training of various Transformer models (GPT, BERT, T5) and stabilizes quickly after the training begins. However, despite the rapid stabilization of sparsity, activation patterns exhibit a dynamic nature, prompting the need for adaptive training strategies that accommodate evolving neuron activation correlations.
Switchable Sparse-Dense Learning (SSD): The proposed SSD framework combines dense and sparse training to enhance computational efficiency without degrading model performance. During pre-training, SSD toggles between sparse activation utilizing Mixtures-of-Experts (MoE) for efficient parameter optimization and dense training when necessary for pattern evolution. This approach mitigates the risk of representation collapse observed in previous purely sparse models.
Empirical Results: Compared to traditional dense training, SSD significantly reduces pre-training costs—with speedups of up to 1.44x in floating point operations (FLOPs)—while maintaining model performance across key metrics. Furthermore, SSD-pre-trained models achieve up to a 2x reduction in inference time compared to dense models, offering compelling evidence of its efficiency without compromising functional capacity.
Evaluation Across Architectures and Tasks: SSD's effectiveness was tested across various architectures (GPT, BERT, T5) and downstream tasks, including natural language understanding and instruction tuning. The consistent performance outcomes across diverse tasks underscore SSD's robustness and adaptability to different model architectures and application scenarios.

Implications for Future Research

The insights and approaches delineated in this research open several pathways for advancing Transformer models:

Scalable Sparse Training Techniques: The SSD framework exemplifies how adaptive training approaches can harness sparsity to achieve greater computational efficiency. This work encourages exploration into additional methods for harnessing activation sparsity, potentially incorporating real-time monitoring of activation patterns for even more responsive switching mechanisms.
Integration with Existing Acceleration Methods: The modular nature of SSD implies potential for integration with other model acceleration techniques, such as parameter-efficient training frameworks and architectural modifications, to further enhance training efficiency and scalability.
Applicable to Current Large Models: While the primary focus is on ReLU-based FFNs, the authors briefly mention the applicability of SSD to Gated Linear Units (GLUs), which are typical in current large-scale models like LLaMA. This discussion prompts further research in adapting clustering and sparse mode identification to more sophisticated non-linear architectures.

Conclusion

The paper provides a thorough dissection of activation sparsity phenomena in Transformer pre-training, introducing a pragmatic training framework that effectively toggles between dense and sparse execution modes. Its implications are far-reaching, promising substantial reductions in computational costs while preserving the functional efficacy of large pre-trained models. As AI models continue to scale, methodologies such as SSD that capitalize on inherent model efficiencies will become essential in managing the intensifying demands for computational resources in model development and deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Zhengyan Zhang (46 papers)
Chaojun Xiao (39 papers)
Qiujieli Qin (1 paper)
Yankai Lin (125 papers)
Zhiyuan Zeng (23 papers)
Xu Han (270 papers)
Zhiyuan Liu (433 papers)
Ruobing Xie (97 papers)
Maosong Sun (337 papers)
Jie Zhou (687 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - thunlp/MoEfication (105 stars)