Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 92 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Kimi K2 157 tok/s Pro
2000 character limit reached

Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity (2504.18929v1)

Published 26 Apr 2025 in cs.LG and cs.AI

Abstract: Compression has been a critical lens to understand the success of Transformers. In the past, we have typically taken the target distribution as a criterion to evaluate a model's compression performance. Nevertheless,it often remains challenging to precisely assess how well the model achieves compression and to compare the information content of the learned distribution with that of the target distribution during compression,as the target distribution is typically unknown and entropy computation often incurs exponential cost. In this work, we explore these issues under a controlled experimental setup. We find that Transformers exhibit a unique inductive bias in data compression: beyond approaching the target distribution, they tend to favor learning lower-entropy distributions, with this tendency becoming more pronounced as the model size increases. This preference prevents Transformers from perfectly aligning with the target distribution, instead further compressing its information content. Furthermore, we show that the FFN module plays a critical role in driving this bias. In addition, while models remove informational redundancy from data during compression, they also exhibit redundancy within their parameters, which enables compression and can be characterized through dynamic sparsity. However, the dynamic sparsity patterns in Transformers, particularly in attention and FFN modules, demand further exploration. As for this, we show that larger Transformers show stronger preferences for bypassing attention computations via residual connections and have lower proportion of active neurons. Interestingly, we also find that training instability in larger models strongly correlates with sudden increases in dead neurons. Our work contributes to a deeper understanding of Transformers from the lens of entropy and dynamic sparsity.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper analyzes intrinsic properties of Transformers, finding they favor low-entropy distributions and dynamic sparsity, particularly as model size increases.
  • The Feed-Forward Network (FFN) module is identified as primarily responsible for guiding Transformers towards lower-entropy solutions and exhibiting significant dynamic sparsity.
  • Larger Transformer models demonstrate dynamic sparsity in both attention (favoring residual connections) and FFN modules (activating fewer neurons), with sparsity increases correlating with training loss spikes.

Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity

Transformers have become a cornerstone in the field of artificial intelligence, especially in tasks related to natural language processing. The paper "Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity" by Ruifeng Ren and Yong Liu offers an in-depth exploration of the intrinsic properties of Transformers that contribute to their success, focusing on entropy and dynamic sparsity. By setting up controlled experiments, the authors investigate how these models handle data compression and express preferences for certain computational pathways.

Insights into Data Compression and Entropy

One of the central findings of this paper is the tendency of Transformers to favor lower-entropy distributions during model training. The significance of this observation lies in the implicit bias toward smoothing the learned distributions beyond the target distribution, which inherently reduces entropy as model size increases. This behavior is particularly pronounced in comparison to RNN architectures like GRU and LSTM, which more closely track the entropy of the target distribution in proportion to their size. The accompanying hypothesis suggests that Transformers compress data not only by learning the target distribution but also by further condensing its information content. This revelation holds important implications for understanding compression as a mechanism of intelligence representation within AI systems.

The Role of the FFN Module

The researchers explore the structural components of Transformers to ascertain the drivers of this entropy bias. The attention and feed-forward network (FFN) modules are isolated and examined. It is revealed that the FFN module is chiefly responsible for guiding the model toward lower-entropy solutions. This insight underscores the importance of the FFN module in shaping the Transformer’s approach to data modeling, providing new directions for architectural optimization aimed at enhancing compression efficacy.

Dynamic Sparsity: Attention and FFN Modules

Beyond data compression, the paper explores the redundancy and sparsity inherent in Transformers’ parameters, highlighting dynamic sparsity patterns particularly in larger models. In the attention module, larger models show a proclivity for favoring residual connections over attention heads, thereby demonstrating a sparse computation trend. This structural choice could represent an efficiency mechanism, allowing the model to bypass elaborate attention computations when outputs can be predicted correctly with shorter pathways.

Similarly, the FFN module also exhibits dynamic sparsity, with larger models activating fewer neurons on average. Remarkably, this sparsity pattern tends to intensify suddenly during training, coinciding with loss spikes, suggesting a correlation between training instability and sudden shifts in neuronal activity. The paper further asserts that the second-order gradient information plays a critical role in this sparsity formation, highlighting an intricate relationship between optimization dynamics and model behavior.

Practical Implications and Future Directions

The implications of these findings are multifold, offering potential computational and architectural optimizations that leverage the intrinsic preferences of Transformers for low-entropy and sparse pathways. Understanding these biases and patterns could inform the design of more efficient models, particularly as AI continues to scale and tackle increasingly complex tasks. As the authors suggest, further exploration into the theoretical underpinnings of these observations could lead to refined training regimes that optimize for both stability and efficiency, minimizing loss spikes and enhancing model reliability.

In conclusion, the insights gathered from this paper contribute to a nuanced understanding of Transformers, revealing the deep-seated compositional preferences that drive their success. By examining entropy and sparsity, this research highlights key aspects that could shape future developments in AI model design and training strategies.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (2)