Entropy-Based Block Pruning for Efficient Large Language Models (2504.03794v1)

Published 4 Apr 2025 in cs.CL and cs.AI

Abstract: As LLMs continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.

Summary

The paper introduces EntroDrop, an entropy-based pruning strategy that leverages entropy metrics to identify and remove less informative Transformer blocks.
It details a two-stage framework that preserves early layers while pruning blocks in later layers with minimal entropy gain.
Empirical evaluations using Bucket and KNN estimations reveal significant inference speedups and maintained accuracy across diverse datasets.

Entropy-Based Block Pruning for Efficient LLMs

Introduction

The paper "Entropy-Based Block Pruning for Efficient LLMs" focuses on optimizing the deployment of LLMs by reducing their computational and storage burden through an innovative pruning strategy. The proposed method, EntroDrop, introduces entropy as a metric to identify and prune redundant blocks in Transformer architectures, surpassing traditional methods that utilize cosine similarity. Empirical analyses reveal that entropy in hidden representations decreases initially but increases across later Transformer blocks, underscoring entropy's potential as an indicator of information richness.

Entropy Dynamics in LLMs

The analysis begins by examining the entropy dynamics of Transformer models, such as Llama3.1-8B and Mistral-7B-v0.3, during inference. The findings indicate an entropy decrease in the initial layers, followed by an increase as the process continues. This biphasic behavior suggests that while early layers prioritize compressing and structuring input representations, later layers focus on expanding contextual understanding.

Figure 1: Entropy Dynamics among Layers during Inference.

These observations inform the strategic pruning of layers, leveraging entropy as a more telling criterion than geometric alignment metrics like cosine similarity.

EntroDrop Framework

EntroDrop adopts a two-stage pruning framework. During the initial entropy decrease stage, no blocks are pruned. The focus shifts during the entropy increase stage, where blocks with minimal entropy gain are pruned.

Figure 2: Overview of the EntroDrop framework. Stage 1 keeps intact, while Stage 2 exhibits increasing entropy. Blocks in Stage 2 are ranked based on their entropy increase, and those with the lowest increase are pruned.

This entropy-based approach enables more strategic pruning by targeting blocks that contribute less to the overall information capacity of the model, thus preserving core functionalities while reducing resource requirements.

Experimental Evaluation

Comprehensive experiments demonstrate EntroDrop's ability to maintain accuracy while achieving significant reductions in model size. The studies involve multiple entropy estimation methods: Bucket, KNN, and Renyi. Bucket-based and KNN estimations are especially effective in maintaining model performance, unlike Renyi, which introduces performance degradation.

Figure 3: Impact of Entropy Estimate Methods.

The experiments show a consistent pattern of redundancy, particularly in attention layers, validating the potential of structured pruning methodologies.

Considerations and Results

Robustness across Calibration Datasets: EntroDrop exhibits consistent performance across various datasets, indicating its flexible applicability in different contexts.

Figure 4: Heatmap of Calibration Datasets.

Figure 5: Impact of Calibration Datasets.

Speedup Metrics: The pruning significantly boosts inference speed, with the initial stages of pruning contributing to the most drastic reductions in computational time.

Figure 6: SpeedUp Experiments.

These results suggest that EntroDrop not only aids in resource-efficient deployment but also ensures that performance remains largely unaffected by calibration datasets.

Conclusion

This paper underscores the efficacy of EntroDrop, emphasizing its ability to leverage entropy to prune redundant blocks within LLMs effectively. Unlike traditional metrics, entropy provides a nuanced understanding of information distribution across layers, offering a reliable basis for pruning decisions. The method demonstrates substantial improvements in computational efficiency without compromising accuracy, highlighting its practicality for real-world applications of LLMs. Future explorations might focus on refining entropy estimation techniques and extending applicability to domain-specific applications.