Papers
Topics
Authors
Recent
Search
2000 character limit reached

Entropy-Based Pruning in Neural Networks

Updated 30 March 2026
  • Entropy-based pruning is a model compression technique that uses information-theoretic measures like Shannon entropy to identify and remove redundant neural network parameters.
  • It employs methods such as activation entropy, spatial entropy, and weight-number entropy to guide structured and unstructured pruning across various architectures.
  • Empirical studies show that entropy-based pruning can achieve significant sparsity and compression—with minimal accuracy drop—across models ranging from CNNs to transformers.

Entropy-based pruning refers to a broad class of model compression techniques that leverage information-theoretic quantities, especially entropy, as objective functions or filter/node/connection selection criteria during the pruning of neural networks and other statistical models. By quantifying the information content, uncertainty, or redundancy of parameters, hidden representations, or dataset samples, entropy-based methods provide both theoretical rigor and empirical effectiveness across diverse architectures and data modalities.

1. Theoretical Foundations: Minimum Description Length and Entropy Quantification

The core theoretical underpinning of entropy-based pruning is the Minimum Description Length (MDL) principle. In the neural network context, learning is framed as a two-part code: transmitting the model parameters (WW) and then the prediction errors given WW. The total code length is decomposed as

L(W)=Ldata(W)+αLmodel(W)L(W) = L_{\text{data}}(W) + \alpha L_{\text{model}}(W)

where Ldata(W)=log2p(YX,W)L_{\text{data}}(W) = -\log_2 p(Y|X, W) is the data (likelihood) loss and Lmodel(W)L_{\text{model}}(W) is the information-theoretic complexity of the weights. The latter can be quantified in terms of entropy: for nn weights W={w1,...,wn}W = \{w_1, ..., w_n\} taking values in a discrete alphabet Ω={ω0,...,ωK1}\Omega = \{\omega_0, ..., \omega_{K-1}\}, the empirical discrete probability mass function is μ^k=1ni=1n1[wi=ωk]\hat\mu_k = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[w_i = \omega_k], yielding a Shannon entropy

H(μ^)=k=0K1μ^klog2μ^kH(\hat\mu) = -\sum_{k=0}^{K-1} \hat\mu_k \log_2 \hat\mu_k

and hence Lmodel(W)=nH(μ^)L_{\text{model}}(W) = n H(\hat\mu) bits to encode WW optimally (Wiedemann et al., 2018).

This leads to a regularized objective: L(W)=Ldata(W)+λnH(μ^)\mathcal{L}(W) = \mathcal{L}_{\text{data}}(W) + \lambda n H(\hat\mu) where λ\lambda controls the tradeoff between accuracy and compression (in bits).

Entropic penalization unifies pruning (increased mass on ω0=0\omega_0=0 yields explicit sparsity) and quantization (concentration on a few discrete values), providing a principled way to induce structured or unstructured sparsity and low-precision parameterizations (Wiedemann et al., 2018).

2. Methodologies: Entropy-Based Pruning Criteria and Algorithms

Multiple algorithmic instantiations of entropy-based pruning have been proposed:

  • Activation Entropy Pruning: Filters or channels whose activations over a dataset yield low Shannon entropy (i.e., low variability and information) are deemed uninformative and pruned accordingly. For a filter jj with activation histogram {pi}i=1m\{p_i\}_{i=1}^m over nn samples, the entropy is Hj=i=1mpilogpiH_j = -\sum_{i=1}^m p_i \log p_i (Luo et al., 2017).
  • Spatial Entropy for CNNs: Local entropy, such as spatial aura entropy (AME), quantifies the disorder among neighboring activations for 2D feature maps, capturing more structural information than global activation entropy. This is critical in structured RL-based filter pruning (Musat et al., 2023, Musat et al., 2023).
  • Weight-Number Entropy: The entropy of the weight value distribution (over all parameters in a block/layer) reflects how dispersed the parameter values are. High-entropy blocks are considered more important; Gardener prunes the lowest-entropy blocks in vision transformers (Xiang et al., 3 Feb 2026).
  • Matrix Entropy (Von Neumann Entropy) for Tokens: Pruning visual tokens in MLLMs is guided by the matrix (quantum/Von Neumann) entropy of trace-normalized covariance matrices of token features; tokens with lower entropy are less informative and pruned (Wang et al., 19 Feb 2026).
  • Conditional Entropy and Mutual Information: Conditional entropy H(Z)H(\ell|Z) of the loss given filter activation, or mutual information I(X;X+1)I(\mathcal{X}_\ell; \mathcal{X}_{\ell+1}) between adjacent layer activations, directly quantifies task-relevant information, guiding pruning to preserve predictive capacity (Min et al., 2018, Westphal et al., 2024).
  • Entropy in Pruning Schedules: Adaptive allocation of pruning across layers uses normalized filter or layer entropy to determine where pruning is least likely to damage model capacity or informativeness (Chen et al., 2024, Lu et al., 2022, Liao et al., 2024, Liao et al., 2023).

Continuous relaxations of non-differentiable objectives allow end-to-end gradient-based minimization, using parametric discrete distributions to softly assign weights to possible discrete values: EWPθ[log2p(YX,W)]+λnH(P)\mathbb{E}_{W \sim P_\theta}[-\log_2 p(Y|X,W)] + \lambda n H(P) where PikP_{ik} models the probability that wi=ωkw_i = \omega_k, enabling SGD optimization of the entropy-regularized variational loss (Wiedemann et al., 2018).

3. Empirical Results and Compression-Accuracy Tradeoffs

Extensive evaluations demonstrate that entropy-based pruning enables state-of-the-art compression with negligible accuracy loss across diverse architectures and domains. Representative results include:

Model/Domain Accuracy Drop Sparsity/Compression Source
VGG-16 (ImageNet) ~1% Top-5 3.3x speedup, 16.6x params (Luo et al., 2017)
LeNet-5 (MNIST) -0.03% 98.1% zeros, 235x compression (Wiedemann et al., 2018)
Llama3.1-8B (LLM) -5% avg acc 37.5% blocks pruned, 20%+ spd (Yang et al., 4 Apr 2025)
MLLM (LLaVA-1.5-7B) 4% relative 68% token pruned, 1.6x+ spd (Wang et al., 19 Feb 2026)
VideoMAE-B (ViT) <1% WAR 91.7% blocks pruned (Xiang et al., 3 Feb 2026)
ResNet50 (CIFAR-10) 0.0% 1.2x-1.3x speedup, SOTA FLOPs (Musat et al., 2023)

A consistent empirical phenomenon is that up to moderate sparsity (often 50–80%), entropy-driven criteria prune mostly redundant or low-utility parameters, with little or no degradation in test accuracy. At high sparsity, block/layer removal enables depth reduction (in overparameterized regimes) (Liao et al., 2023, Liao et al., 2024).

Entropy-based channel, filter, head, and block pruning has shown strong correlation between importance as measured by information-theoretic scores (activation entropy, weight-number entropy, matrix entropy, MI) and oracle sensitivity (as measured by accuracy drop on removal) (Xiang et al., 3 Feb 2026, Li et al., 26 Nov 2025, Choi et al., 10 Oct 2025).

4. Extensions: Domains, Modalities, and Pruning Granularities

Entropy-based pruning is not limited to classic image classification CNNs:

5. Limitations, Practical Considerations, and Future Directions

Real-world deployment of entropy-based pruning involves addressing several practical and theoretical challenges:

  • Entropy Estimation Cost: Calculating spatial, mutual, matrix, or frequency-adjusted entropy in high dimensions is nontrivial. Techniques such as bucketization, AME, and spectral acceleration (dual Gram matrices) mitigate compute overhead (Musat et al., 2023, Wang et al., 19 Feb 2026).
  • Non-differentiability: Binary entropy scores for ReLU/GELU can be non-differentiable, necessitating alternating prune/retrain loops or surrogate relaxations for gradient-based training (Wiedemann et al., 2018, Liao et al., 2024).
  • No Universal Thresholds: The relationship between entropy and redundancy depends on task, architecture, and overparameterization regime; thresholds for pruning or layer removal are typically heuristic or dataset-specific (Liao et al., 2023, Liao et al., 2024).
  • Loss of Fine-Grained Information: Static entropy-based scores can become “stale” in dynamic or heavily iterative pruning (e.g., head pruning in transformers), motivating dynamic, gradient-informed alternatives (Guo et al., 4 Feb 2026).
  • Data-Free vs. Data-Dependent Criteria: While block-level weight-number entropy offers a data-free criterion, activation- or mutual-information based methods require data pass-through for accurate estimation; the choice impacts transfer and retraining scenarios (Xiang et al., 3 Feb 2026, Choi et al., 10 Oct 2025).
  • Hardness in Efficient Models: In compact or underparameterized regimes, achievable safe layer reduction via entropy approaches is limited—most effectiveness is observed in overparameterized networks (Liao et al., 2024).
  • Potential Extensions: The development of differentiable or train-time entropy surrogates, adaptive and per-sample entropy budgets, and integration with energy- or latency-aware objectives are active research frontiers (Liao et al., 2024, Westphal et al., 2024).

6. Representative Algorithms and Pseudocode: Generic Framework

A canonical entropy-based pruning pipeline proceeds as follows (Wiedemann et al., 2018, Liao et al., 2023, Liao et al., 2024):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for epoch in range(num_epochs):
    for minibatch in data_loader:
        # Forward pass, possible noise injection for relaxation
        outputs = model(minibatch["input"])
        data_loss = compute_data_loss(outputs, minibatch["label"])
        # Estimate relevant entropy:
        #   - For weight entropy: use empirical pmf of discrete weights
        #   - For activation/channel/block entropy: compute histogram or covariance entropy over batch/feature
        entropy_loss = compute_entropy(model, minibatch)
        total_loss = data_loss + lambda_ * entropy_loss
        # Backward pass and update
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

Key implementations entail adapting the compute_entropy function to the target granularity (weight, channel, activation, block, sample) and information-theoretic surrogate (Shannon, AME, conditional entropy, mutual information, matrix (Von Neumann) entropy).

7. Impact and Interpretability

Entropy-based pruning methods bring principled, information-theoretic interpretability to model compression—a step beyond simple norm-based or heuristic structural pruning. By aligning with MDL and information bottleneck perspectives, they not only yield improved sparsity–accuracy–efficiency trade-offs but also uncover deeper insight into where, and how, modern neural networks concentrate or dissipate information across architecture and data (Wiedemann et al., 2018, Li et al., 26 Nov 2025, Xiang et al., 3 Feb 2026).

Continued innovation—including train-time entropy regularization, integration with reinforcement learning (spatial entropy rewards), and mutual information preservation algorithms—positions entropy-based pruning as a foundational paradigm in scalable, resource-efficient, and interpretable deep learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entropy-based Pruning.