Entropy-Based Pruning in Neural Networks

Updated 30 March 2026

Entropy-based pruning is a model compression technique that uses information-theoretic measures like Shannon entropy to identify and remove redundant neural network parameters.
It employs methods such as activation entropy, spatial entropy, and weight-number entropy to guide structured and unstructured pruning across various architectures.
Empirical studies show that entropy-based pruning can achieve significant sparsity and compression—with minimal accuracy drop—across models ranging from CNNs to transformers.

Entropy-based pruning refers to a broad class of model compression techniques that leverage information-theoretic quantities, especially entropy, as objective functions or filter/node/connection selection criteria during the pruning of neural networks and other statistical models. By quantifying the information content, uncertainty, or redundancy of parameters, hidden representations, or dataset samples, entropy-based methods provide both theoretical rigor and empirical effectiveness across diverse architectures and data modalities.

1. Theoretical Foundations: Minimum Description Length and Entropy Quantification

The core theoretical underpinning of entropy-based pruning is the Minimum Description Length (MDL) principle. In the neural network context, learning is framed as a two-part code: transmitting the model parameters ( $W$ ) and then the prediction errors given $W$ . The total code length is decomposed as

$L(W) = L_{\text{data}}(W) + \alpha L_{\text{model}}(W)$

where $L_{\text{data}}(W) = -\log_2 p(Y|X, W)$ is the data (likelihood) loss and $L_{\text{model}}(W)$ is the information-theoretic complexity of the weights. The latter can be quantified in terms of entropy: for $n$ weights $W = \{w_1, ..., w_n\}$ taking values in a discrete alphabet $\Omega = \{\omega_0, ..., \omega_{K-1}\}$ , the empirical discrete probability mass function is $\hat\mu_k = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[w_i = \omega_k]$ , yielding a Shannon entropy

$H(\hat\mu) = -\sum_{k=0}^{K-1} \hat\mu_k \log_2 \hat\mu_k$

and hence $L_{\text{model}}(W) = n H(\hat\mu)$ bits to encode $W$ optimally (Wiedemann et al., 2018).

This leads to a regularized objective: $\mathcal{L}(W) = \mathcal{L}_{\text{data}}(W) + \lambda n H(\hat\mu)$ where $\lambda$ controls the tradeoff between accuracy and compression (in bits).

Entropic penalization unifies pruning (increased mass on $\omega_0=0$ yields explicit sparsity) and quantization (concentration on a few discrete values), providing a principled way to induce structured or unstructured sparsity and low-precision parameterizations (Wiedemann et al., 2018).

2. Methodologies: Entropy-Based Pruning Criteria and Algorithms

Multiple algorithmic instantiations of entropy-based pruning have been proposed:

Activation Entropy Pruning: Filters or channels whose activations over a dataset yield low Shannon entropy (i.e., low variability and information) are deemed uninformative and pruned accordingly. For a filter $j$ with activation histogram $\{p_i\}_{i=1}^m$ over $n$ samples, the entropy is $H_j = -\sum_{i=1}^m p_i \log p_i$ (Luo et al., 2017).
Spatial Entropy for CNNs: Local entropy, such as spatial aura entropy (AME), quantifies the disorder among neighboring activations for 2D feature maps, capturing more structural information than global activation entropy. This is critical in structured RL-based filter pruning (Musat et al., 2023, Musat et al., 2023).
Weight-Number Entropy: The entropy of the weight value distribution (over all parameters in a block/layer) reflects how dispersed the parameter values are. High-entropy blocks are considered more important; Gardener prunes the lowest-entropy blocks in vision transformers (Xiang et al., 3 Feb 2026).
Matrix Entropy (Von Neumann Entropy) for Tokens: Pruning visual tokens in MLLMs is guided by the matrix (quantum/Von Neumann) entropy of trace-normalized covariance matrices of token features; tokens with lower entropy are less informative and pruned (Wang et al., 19 Feb 2026).
Conditional Entropy and Mutual Information: Conditional entropy $H(\ell|Z)$ of the loss given filter activation, or mutual information $I(\mathcal{X}_\ell; \mathcal{X}_{\ell+1})$ between adjacent layer activations, directly quantifies task-relevant information, guiding pruning to preserve predictive capacity (Min et al., 2018, Westphal et al., 2024).
Entropy in Pruning Schedules: Adaptive allocation of pruning across layers uses normalized filter or layer entropy to determine where pruning is least likely to damage model capacity or informativeness (Chen et al., 2024, Lu et al., 2022, Liao et al., 2024, Liao et al., 2023).

Continuous relaxations of non-differentiable objectives allow end-to-end gradient-based minimization, using parametric discrete distributions to softly assign weights to possible discrete values: $\mathbb{E}_{W \sim P_\theta}[-\log_2 p(Y|X,W)] + \lambda n H(P)$ where $P_{ik}$ models the probability that $w_i = \omega_k$ , enabling SGD optimization of the entropy-regularized variational loss (Wiedemann et al., 2018).

3. Empirical Results and Compression-Accuracy Tradeoffs

Extensive evaluations demonstrate that entropy-based pruning enables state-of-the-art compression with negligible accuracy loss across diverse architectures and domains. Representative results include:

Model/Domain	Accuracy Drop	Sparsity/Compression	Source
VGG-16 (ImageNet)	~1% Top-5	3.3x speedup, 16.6x params	(Luo et al., 2017)
LeNet-5 (MNIST)	-0.03%	98.1% zeros, 235x compression	(Wiedemann et al., 2018)
Llama3.1-8B (LLM)	-5% avg acc	37.5% blocks pruned, 20%+ spd	(Yang et al., 4 Apr 2025)
MLLM (LLaVA-1.5-7B)	4% relative	68% token pruned, 1.6x+ spd	(Wang et al., 19 Feb 2026)
VideoMAE-B (ViT)	<1% WAR	91.7% blocks pruned	(Xiang et al., 3 Feb 2026)
ResNet50 (CIFAR-10)	0.0%	1.2x-1.3x speedup, SOTA FLOPs	(Musat et al., 2023)

A consistent empirical phenomenon is that up to moderate sparsity (often 50–80%), entropy-driven criteria prune mostly redundant or low-utility parameters, with little or no degradation in test accuracy. At high sparsity, block/layer removal enables depth reduction (in overparameterized regimes) (Liao et al., 2023, Liao et al., 2024).

Entropy-based channel, filter, head, and block pruning has shown strong correlation between importance as measured by information-theoretic scores (activation entropy, weight-number entropy, matrix entropy, MI) and oracle sensitivity (as measured by accuracy drop on removal) (Xiang et al., 3 Feb 2026, Li et al., 26 Nov 2025, Choi et al., 10 Oct 2025).

4. Extensions: Domains, Modalities, and Pruning Granularities

Entropy-based pruning is not limited to classic image classification CNNs:

Transformers and LLMs: Entropy-based block pruning (EntroDrop, E-Sparse) and head pruning (HIES, AE-based pruning) consistently outperform norm or gradient-only baseline metrics for reducing transformer depth, width, and N:M sparsity (Yang et al., 4 Apr 2025, Li et al., 2023, Choi et al., 10 Oct 2025).
Generative and Diffusion Models: Entropy-guided block importance (Conditional Entropy Deviation) is used to prune transformer/UNet blocks in diffusion/flow models, preserving output diversity and fidelity (Li et al., 26 Nov 2025).
Dataset/Data Pruning: Sample-level informativeness measured by negative log-likelihood under a probe model enables entropy-driven corpus pruning that can reduce LM training cost (and even improve generalization) (Kim et al., 2024).
Self-Supervised and Multimodal Models: Weight-number entropy (Gardener) and matrix entropy (EntropyPrune) enable one-shot, data-free block and token pruning in masked autoencoders and MLLMs, matching or exceeding sensitivity-based and attention-weight methods at a small computational cost (Xiang et al., 3 Feb 2026, Wang et al., 19 Feb 2026).
Bayesian Networks/Graphical Models: In structure learning, conditional entropy bounds guarantee safe pruning in candidate parent sets, yielding significant search space and runtime reductions with no loss in optimality (Campos et al., 2017).
Fine-Grained Granularity: Node-wise, neuron-wise, or even pixel-wise entropy and mutual information have been used to drive unstructured as well as structured pruning strategies, with dynamic pruning schedules and entropy-weighted per-layer allocation (Westphal et al., 2024, Liao et al., 2023, Liao et al., 2024).

5. Limitations, Practical Considerations, and Future Directions

Real-world deployment of entropy-based pruning involves addressing several practical and theoretical challenges:

Entropy Estimation Cost: Calculating spatial, mutual, matrix, or frequency-adjusted entropy in high dimensions is nontrivial. Techniques such as bucketization, AME, and spectral acceleration (dual Gram matrices) mitigate compute overhead (Musat et al., 2023, Wang et al., 19 Feb 2026).
Non-differentiability: Binary entropy scores for ReLU/GELU can be non-differentiable, necessitating alternating prune/retrain loops or surrogate relaxations for gradient-based training (Wiedemann et al., 2018, Liao et al., 2024).
No Universal Thresholds: The relationship between entropy and redundancy depends on task, architecture, and overparameterization regime; thresholds for pruning or layer removal are typically heuristic or dataset-specific (Liao et al., 2023, Liao et al., 2024).
Loss of Fine-Grained Information: Static entropy-based scores can become “stale” in dynamic or heavily iterative pruning (e.g., head pruning in transformers), motivating dynamic, gradient-informed alternatives (Guo et al., 4 Feb 2026).
Data-Free vs. Data-Dependent Criteria: While block-level weight-number entropy offers a data-free criterion, activation- or mutual-information based methods require data pass-through for accurate estimation; the choice impacts transfer and retraining scenarios (Xiang et al., 3 Feb 2026, Choi et al., 10 Oct 2025).
Hardness in Efficient Models: In compact or underparameterized regimes, achievable safe layer reduction via entropy approaches is limited—most effectiveness is observed in overparameterized networks (Liao et al., 2024).
Potential Extensions: The development of differentiable or train-time entropy surrogates, adaptive and per-sample entropy budgets, and integration with energy- or latency-aware objectives are active research frontiers (Liao et al., 2024, Westphal et al., 2024).

6. Representative Algorithms and Pseudocode: Generic Framework

A canonical entropy-based pruning pipeline proceeds as follows (Wiedemann et al., 2018, Liao et al., 2023, Liao et al., 2024):

for epoch in range(num_epochs):
    for minibatch in data_loader:
        # Forward pass, possible noise injection for relaxation
        outputs = model(minibatch["input"])
        data_loss = compute_data_loss(outputs, minibatch["label"])
        # Estimate relevant entropy:
        #   - For weight entropy: use empirical pmf of discrete weights
        #   - For activation/channel/block entropy: compute histogram or covariance entropy over batch/feature
        entropy_loss = compute_entropy(model, minibatch)
        total_loss = data_loss + lambda_ * entropy_loss
        # Backward pass and update
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

Key implementations entail adapting the compute_entropy function to the target granularity (weight, channel, activation, block, sample) and information-theoretic surrogate (Shannon, AME, conditional entropy, mutual information, matrix (Von Neumann) entropy).

7. Impact and Interpretability

Entropy-based pruning methods bring principled, information-theoretic interpretability to model compression—a step beyond simple norm-based or heuristic structural pruning. By aligning with MDL and information bottleneck perspectives, they not only yield improved sparsity–accuracy–efficiency trade-offs but also uncover deeper insight into where, and how, modern neural networks concentrate or dissipate information across architecture and data (Wiedemann et al., 2018, Li et al., 26 Nov 2025, Xiang et al., 3 Feb 2026).

Continued innovation—including train-time entropy regularization, integration with reinforcement learning (spatial entropy rewards), and mutual information preservation algorithms—positions entropy-based pruning as a foundational paradigm in scalable, resource-efficient, and interpretable deep learning.