Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity (2411.02335v1)

Published 4 Nov 2024 in cs.LG, cs.CL, and stat.ML

Abstract: Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with LLMs. Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.

Activation Sparsity in LLMs: An Analytical Study

The provided paper focuses on the concept of activation sparsity in the field of neural networks, specifically within decoder-only Transformer-based LLMs. Activation sparsity refers to the phenomenon where many entries within a layer’s activation outputs are zero or possess low values, thus contributing negligibly to the model's output. This intrinsic sparsity has potential applications in accelerating computations and enhancing model interpretability.

Key Findings

The authors engage in a comprehensive analysis of activation sparsity, investigating its scaling properties and the influential architectural factors impacting it. Their investigations yield several crucial insights:

  1. Activation Functions and Training Data Relationship: The authors ascertain that different activation functions demonstrate disparate sparsity behaviors during training. ReLU-activated models exhibit a decreasing logspace power-law relationship between activation ratios and the volume of training data, inherently converging towards a limit sparsity ratio. Conversely, SiLU-activated models show an increasing vanilla power-law relationship, suggesting ReLU as a more efficient activation function that permits improved sparsity with additional training data. Interestingly, despite these differences in sparsity trends, both ReLU and SiLU show comparable performance levels.
  2. Width-Depth Ratio Effects: Analyzing the effects of architecture, the authors conclude that at constant parameter scales, the activation ratio tends to increase linearly with larger width-depth ratios up to a certain bottleneck point. Beyond this threshold, the ratio stabilizes, indicating that deeper models are sparser. However, there is an optimal range for width-depth ratios — oversizing beyond this range may detrimentally affect performance, despite potential gains in sparseness.
  3. Parameter Scale Independence: Surprisingly, activation patterns within the LLMs appear to be largely insensitive to model size under similar width-depth ratios, with sparsity converging more rapidly in smaller models. This suggests an inherent organizational structure in neuron specialization that remains consistent across scales.

Methodological Approach

The authors introduce a novel metric termed PPL-p% sparsity, which offers a versatile and performance-aware measure to support their paper. This metric enables precise evaluations across different model architectures and effectively recognizes weakly-contributed neurons through adaptive layer-wise thresholds. The authors exemplify the metric's utility by achieving the highest sparsity without significant performance degradation when compared to conventional sparsity metrics.

Implications and Future Directions

The findings propose practical implications for the design and pre-training of more efficient and interpretable LLMs. The research paves the way for models with controllable activation sparsity, potentially optimizing them by using ReLU and fine-tuning architectural depth against width-depth ratios. Moreover, predictions of future sparsity levels during training could improve resource allocation and give insights into neuron specialization dynamics.

Furthering this work, experimentation with larger models could unravel whether these scaling laws persist beyond current scales, considering computation costs in the context of sparsity metrics. Additionally, exploring diverse datasets could determine the robustness of these observed laws and provide a deeper understanding of activation sparsity phenomena across different learning environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yuqi Luo (2 papers)
  2. Chenyang Song (7 papers)
  3. Xu Han (270 papers)
  4. Yingfa Chen (11 papers)
  5. Chaojun Xiao (39 papers)
  6. Zhiyuan Liu (433 papers)
  7. Maosong Sun (337 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com