EntroDrop: Entropy-Based Pruning for Transformers
- EntroDrop Framework is a methodology that uses entropy to quantify information richness in Transformer blocks for guiding efficient pruning.
- It leverages statistical uncertainty rather than geometric similarity to identify and remove redundant components, reducing computational cost.
- Empirical evaluations on models like Llama3.1-8B show near-linear inference speed improvements with minimal accuracy loss after pruning.
The EntroDrop Framework is a methodology for efficient block-level pruning in large Transformer-based LLMs, using entropy as a quantitative measure of “information richness” within computational blocks. Unlike conventional criteria based on geometric similarity (e.g., cosine similarity), EntroDrop leverages statistical uncertainty to identify and remove redundant components, thereby reducing model size and computational demands while maintaining predictive performance.
1. Entropy as a Measure of Information Content
EntroDrop evaluates the hidden states computed by each Transformer block through entropy estimation. Let denote the output of block , computed as . Entropy of the output quantifies the uncertainty over the block’s activation values, providing a direct indicator of the diversity and informativeness of representations being produced at each stage.
Several estimation techniques for entropy are considered, including discrete bucket-based (histogram) methods, K-nearest neighbors (KNN), and Renyi entropy. Each method models the probability distribution over the activation values (where is a bin or neighborhood within the activation space), allowing the computation of
In practice, approximate entropy is computed efficiently by mapping activations to bins or estimating local densities, making the methodology scalable across large models and datasets.
Unlike cosine similarity, which characterizes geometric alignment but not content uncertainty, entropy directly reflects the statistical diversity and information flow within network layers, providing a more robust criterion for pruning.
2. Block Pruning Strategy via EntroDrop
The entropy-based pruning strategy is formulated as a two-stage process, informed by empirical observation of entropy trends throughout model depth:
- Stage 1 (Early Layers): Entropy decreases across the first blocks, as these layers compress and refine input features. These layers are deemed essential and retained.
- Stage 2 (Later Layers): Entropy progressively increases, indicating that subsequent blocks add features with similar uncertainty, often redundantly.
For each block, the change in entropy is computed as
After identifying the transition point (boundary between the two stages), only blocks with are considered for pruning. These candidate blocks’ entropy change values are ranked in ascending order:
The set of blocks with the smallest are selected for removal:
This approach targets layers that contribute least to new feature enrichment while protecting the functionality of foundational layers.
3. Empirical Evidence and Entropy Dynamics
Extensive experiments on models such as Llama3.1-8B and Mistral-7B-v0.3, across datasets including C4, Law, Medicine, and Wikitext2, validate the empirical trend: early blocks compress information (decreasing entropy), while subsequent blocks introduce higher entropy at a nearly constant rate.
This pattern suggests that later blocks are less critical, as they often represent redundant expansion rather than unique enrichment of hidden states. Such redundancy is confirmed across both general and domain-specific datasets, providing consistency in the rationale for pruning.
Empirical analysis corroborates that pruning blocks based on minimal entropy increases effectively reduces redundancy with negligible impact on task accuracy.
4. Performance Metrics and Comparative Benchmarks
The effectiveness of EntroDrop is evaluated using a suite of benchmarks: PIQA, HellaSwag, WSC273, CSQA, Winogrande, ARC, OBQA, MMLU, CMMLU, and RACE. Performance metrics include:
- Task accuracy (per dataset and aggregate)
- Model size reduction (number of blocks/pruned parameters)
- Inference speed improvements
When compared to cosine similarity–based pruning (LLMDrop) and layer-pruning methods (LaCo, ShortGPT), EntroDrop demonstrates superior accuracy retention during aggressive pruning. Specifically, when incrementally pruning later-stage attention blocks (notably the first 12 layers), inference time decreases nearly linearly with the number of pruned blocks, while model accuracy remains high.
These results underscore the ability of entropy-based selection to optimize both computational efficiency and predictive validity beyond geometric metrics.
| Method | Pruning Criterion | Accuracy Retention | Speedup Trend |
|---|---|---|---|
| EntroDrop | Entropy Increase | High | Nearly linear (per block) |
| LLMDrop | Cosine Similarity | Moderate | Nonlinear |
| LaCo, ShortGPT | Layer Pruning | Variable | Model-specific |
5. Implications for Model Deployment
By pruning blocks with minimal entropy increase, EntroDrop-supported models exhibit:
- Reduced parameter and computational size
- Lower latency during inference
- Enhanced suitability for deployment on edge/mobile devices and in large-scale cloud environments
Calibration datasets—both general and domain-specific—produce consistently robust entropy estimations, indicating that EntroDrop maintains generalizability and flexibility across model architectures and applications.
This suggests that entropy-based pruning is widely applicable for practical large model deployment, particularly when resource constraints or low-latency response are critical.
6. Technical Significance and Future Considerations
The use of entropy as a block-level pruning criterion represents a shift toward statistical, information-theoretic assessment of network layer utility, aligning model compactness with preserved information flow. Unlike static signature-based or purely geometric methods, this framework foregrounds probabilistic measures of uncertainty, better capturing the relevance of neural computations.
A plausible implication is that future developments may explore dynamic, task-conditioned entropy metrics or joint entropy–diversity criteria, further optimizing pruning decisions as models and datasets evolve.
The EntroDrop Framework establishes a principled, empirically validated approach to block pruning that aligns model efficiency with high performance, facilitating scalable and deployable LLMs in diverse computational environments.