- The paper demonstrates that large transformers consistently exhibit high activation sparsity across layers and data types.
- It employs extensive empirical analysis and hypothesis testing to show that training dynamics, rather than data properties, drive sparsity.
- Enforced sparsity reduces computations and enhances model robustness against noise and input corruptions.
Introduction
Machine learning models, particularly those based on transformer architectures, have recently exhibited a phenomenon termed "activation sparsity." This refers to the observation that the activation maps, which are the outputs of the intermediate layers in the model, tend to contain a high proportion of zero entries. This characteristic has been widely studied in the context of simple neural activities in biological brains but has not been as thoroughly examined in artificial deep neural networks (DNNs).
Prevalence of Activation Sparsity
Researchers from Google have demonstrated through extensive empirical analysis that activation sparsity is not just an isolated happenstance but rather a prevalent feature observed across various configurations, scales, and types of transformer models. This sparsity occurs regardless of the depth of the transformer layer, the type of training data (including language and vision tasks), and evaluation data. Comprehensive experimentation further reveals that larger transformers, characterized by more layers and greater MLP widths, exhibit higher degrees of sparsity.
Understanding Activation Sparsity
The paper seeks to uncover the origin of activation sparsity in transformers, proposing three hypotheses: sparsity arising from structured labels, intrinsic low-dimensional structures in the data, or due to the model's ample capacity to fit the training data. Through various experimental setups, including exposure to random labels, random inputs, and even generating infinite random training data, it becomes evident that none of these factors alone can fully account for the emergence of sparsity. The paper speculates that sparsity may stem from the training dynamics in the optimization process, with theoretical analysis showing that training direction inherently favors decreasing activation values.
Implications of Activation Sparsity
Activation sparsity is not just an academic curiosity; it significantly affects the computational efficiency and robustness of transformer models. Sparser activation maps imply fewer computations during inference, and implementing Top-k thresholding further highlights that models with enforced sparsity become less sensitive to noisy training data, more robust to input corruptions, and better calibrated in prediction confidence. This suggests that sparsity is not only a byproduct but also an asset for model performance and efficiency. The paper presents empirical evidence that models with enforced sparsity, despite lacking computational support for sparse operations in current hardware, promise benefits in terms of wall-time reduction and reliability.
Conclusion and Future Direction
The unexpected emergence of activation sparsity in transformer models underscores the importance of rethinking the design and deployment of future DNN architectures. The connection between spare activations and the observed parsimonious use of model parameters, as well as the implication of these findings for hardware design, necessitate further research into exploiting sparsity. The confirmation that sparsity can improve the computational efficiency and robustness of DNNs opens a promising avenue for more energy-efficient and reliable AI systems.