Entropy Attention Maps in Neural Systems
- Entropy Attention Maps (EAM) are methodologies that use Shannon entropy to distinguish redundant from dynamic regions in neural attention models.
- They analyze attention distributions via entropy metrics to enable sparsity masking, quantization, and significant inference cost reductions.
- Empirical results show that EAM can reduce FLOPs and memory usage with negligible accuracy loss across transformer models like ViT, DeiT, and Swin.
Entropy Attention Maps (EAM) denote two closely related but distinct methodologies that utilize entropy-based metrics to identify and localize regions of low information or high dissipation in neural attention mechanisms. The concept has emerged in both transformer compression for computer vision and in the unsupervised inference of entropy production in stochastic physical systems. In both contexts, EAMs serve as diagnostic and operational tools for reducing redundancy, extracting physically meaningful structure, or improving computational tractability.
1. Shannon Entropy in Attention-Based Models
In transformer models for vision, multi-head self-attention generates attention maps for each sample , layer , and head . Each entry,
is interpreted as an empirical distribution over the input dataset. The Shannon entropy per entry is estimated via a -bit histogram,
and the entropy itself as
The full set forms the entropy attention map for head in layer , quantifying spatial and headwise heterogeneity in information content (Maisonnave et al., 22 Aug 2025).
Analogously, in neural estimators of entropy production from time-series image data, such as in Brownian movies, the spatial dissipation map is defined by leveraging last-layer feature activations of a convolutional neural network trained to distinguish forward and time-reversed movie trajectories. Here, reflects pixelwise local entropy production per image transition (Bae et al., 2021).
2. Entropy Distribution Analysis and Redundancy Detection
Computation of over a subset of ImageNet-1k reveals a wide distribution of entropy values across attention heads and map positions. Typically, a subset of heads exhibits low mean entropy,
with minimal input-dependence, indicating these heads contribute redundant or deterministic attention patterns. In contrast, high-entropy heads encode greater context sensitivity.
In stochastic physical systems, the CNN-based estimator produces spatial maps that resolve the local structure of dissipative processes—distinguishing, for instance, which bead-spring connections or network edges disproportionately generate entropy. These maps can be quantitatively validated against analytic or simulated ground-truth exchange rates (Bae et al., 2021).
3. Sparsity Masking, Quantization, and Freezing Protocols
In transformer EAM, a global sparsity threshold is set, and a corresponding entropy percentile is selected. A binary mask,
partitions the attention map entries into "dynamic" (to be recomputed per input) and "static" (to be fixed and quantized). At inference, dynamic entries are computed as in standard attention, while static entries are replaced by their dataset means .
Low-entropy static entries are quantized to 4 bits using uniform symmetric quantization per head,
Dynamic entries and the rest of the model (weights/activations) are simultaneously quantized using a post-training quantizer such as RepQ-ViT (Maisonnave et al., 22 Aug 2025).
4. Inference Integration and Computational Implications
The modified inference pipeline loads precomputed, quantized static maps ( with ), computes and softmax only for dynamic positions, and fills static entries by dequantization. The attention matrix is then used as usual for value aggregation. This hybrid mechanism omits up to of attention map computations, leading to reduced FLOPs, lower memory reads, and substantially diminished on-chip storage for frozen maps, which require only less storage per static entry compared with FP32 (Maisonnave et al., 22 Aug 2025).
In the unsupervised entropy production regime, the CNN estimator’s output explicitly decomposes the scalar estimate into a sum of local contributions , permitting direct visualization and quantitative assessment of spatial dissipation heterogeneity (Bae et al., 2021).
5. Empirical Performance and Validation
Extensive evaluation on ImageNet-1K for ViT, DeiT, and Swin Transformer variants post-trained to 4-bit quantization demonstrates:
- Negligible accuracy drop (<0.2%) up to sparsity.
- At , EAM often matches or surpasses the RepQ-ViT baseline (e.g., DeiT-Base Top-1 75.31% 75.71%).
- Even at , most models lose <2% accuracy; Swin-Small retains of baseline performance.
- Random freezing produces significant accuracy degradation relative to entropy-guided selection (5–50% gap in Top-1), confirming entropy’s role as a precise redundancy indicator.
| Model | RepQ-ViT | EAM () | EAM () |
|---|---|---|---|
| ViT-S | 64.92 | 65.09 | 65.19 |
| ViT-B | 68.46 | 68.18 | 68.16 |
| DeiT-T | 57.91 | 58.03 | 58.02 |
| DeiT-S | 68.58 | 68.74 | 68.53 |
| DeiT-B | 75.31 | 75.71 | 75.64 |
| Swin-T | 70.67 | 70.65 | 70.55 |
| Swin-S | 79.45 | 79.79 | 79.63 |
In physical systems, correlation between CNN-predicted and true entropy production achieves in bead–spring models; local dissipation maps reliably recover theoretically predicted structure and demonstrate significant robustness to noise, finite resolution, and partial observation (Bae et al., 2021).
6. Limitations and Applicability Conditions
The entropy-masking approach in transformer models requires an initial calibration pass on a representative data sample (e.g., 5% of ImageNet-1K) and is most effective when the attention mechanism exhibits clear redundancy. Tasks or models with uniformly high entropy across attention heads are unamenable to substantial inference cost reduction by this method. The current EAM pipeline computes softmax normalization over mixed dynamic and static entries; more aggressive (e.g., row-wise) normalization could further increase sparsity and efficiency.
For unsupervised entropy production inference, estimator performance degrades with high noise and very limited temporal context, though much of this loss is mitigated by increasing frame concatenation depth. Generalization to other types of experimental or simulated video data depends on the similarity of input statistics to those seen in the bead-spring or filament network examples (Bae et al., 2021).
7. Deployment and Broader Implications
EAM is highly suited for deployment scenarios emphasizing low compute and memory footprints, such as mobile or embedded vision accelerators. Static, quantized maps are ideal for SRAM storage, and the pattern of sparsity may be exploited by hardware primitives for sparse matrix computation, enabling reductions in latency and energy consumption.
The methodology of quantifying and exploiting informational redundancy via local entropy analysis in neural attention mechanisms may have broader application in model compression, interpretability, and the design of hybrid static-dynamic inference pipelines in both machine learning and physical sciences (Maisonnave et al., 22 Aug 2025, Bae et al., 2021).
Key References:
- "Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers" (Maisonnave et al., 22 Aug 2025).
- "Attaining entropy production and dissipation maps from Brownian movies via neural networks" (Bae et al., 2021).