The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers (2210.06313v2)

Published 12 Oct 2022 in cs.LG, cs.CL, cs.CV, and stat.ML

Abstract: This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by sparse we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.

Citations (77)

View on Semantic Scholar

Summary

The paper demonstrates that large transformers consistently exhibit high activation sparsity across layers and data types.
It employs extensive empirical analysis and hypothesis testing to show that training dynamics, rather than data properties, drive sparsity.
Enforced sparsity reduces computations and enhances model robustness against noise and input corruptions.

Introduction

Machine learning models, particularly those based on transformer architectures, have recently exhibited a phenomenon termed "activation sparsity." This refers to the observation that the activation maps, which are the outputs of the intermediate layers in the model, tend to contain a high proportion of zero entries. This characteristic has been widely studied in the context of simple neural activities in biological brains but has not been as thoroughly examined in artificial deep neural networks (DNNs).

Prevalence of Activation Sparsity

Researchers from Google have demonstrated through extensive empirical analysis that activation sparsity is not just an isolated happenstance but rather a prevalent feature observed across various configurations, scales, and types of transformer models. This sparsity occurs regardless of the depth of the transformer layer, the type of training data (including language and vision tasks), and evaluation data. Comprehensive experimentation further reveals that larger transformers, characterized by more layers and greater MLP widths, exhibit higher degrees of sparsity.

Understanding Activation Sparsity

The paper seeks to uncover the origin of activation sparsity in transformers, proposing three hypotheses: sparsity arising from structured labels, intrinsic low-dimensional structures in the data, or due to the model's ample capacity to fit the training data. Through various experimental setups, including exposure to random labels, random inputs, and even generating infinite random training data, it becomes evident that none of these factors alone can fully account for the emergence of sparsity. The paper speculates that sparsity may stem from the training dynamics in the optimization process, with theoretical analysis showing that training direction inherently favors decreasing activation values.

Implications of Activation Sparsity

Activation sparsity is not just an academic curiosity; it significantly affects the computational efficiency and robustness of transformer models. Sparser activation maps imply fewer computations during inference, and implementing Top-k thresholding further highlights that models with enforced sparsity become less sensitive to noisy training data, more robust to input corruptions, and better calibrated in prediction confidence. This suggests that sparsity is not only a byproduct but also an asset for model performance and efficiency. The paper presents empirical evidence that models with enforced sparsity, despite lacking computational support for sparse operations in current hardware, promise benefits in terms of wall-time reduction and reliability.

Conclusion and Future Direction

The unexpected emergence of activation sparsity in transformer models underscores the importance of rethinking the design and deployment of future DNN architectures. The connection between spare activations and the observed parsimonious use of model parameters, as well as the implication of these findings for hardware design, necessitate further research into exploiting sparsity. The confirmation that sparsity can improve the computational efficiency and robustness of DNNs opens a promising avenue for more energy-efficient and reliable AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/teortaxesTex/status/1802486079712346238

YouTube

Show All Videos