ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization (2505.02819v3)

Published 5 May 2025 in cs.CL

Abstract: We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several LLMs, ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe.

PDF Abstract

ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations

The paper "ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations" presents a novel method for compressing LLMs by pruning redundant layers from transformer architectures without requiring retraining or fine-tuning. This method, ReplaceMe, substitutes selected transformer blocks with a single linear transformation, estimated using a small calibration dataset, thereby maintaining the model's original performance even under compression.

Overview of ReplaceMe

ReplaceMe introduces a strategic training-free approach to depth pruning by using linear transformations to approximate the pruned transformer blocks. Unlike traditional pruning methods that necessitate retraining to regain performance, ReplaceMe achieves effective compression through:

Layer Selection: Identifying contiguous sets of transformer blocks suitable for pruning using a layer importance metric, primarily based on activation distances measured via cosine similarity.
Linear Transformation Estimation: Estimating a linear mapping that compensates for the contributions of the pruned layers using least squares (LS) or cosine distance optimization techniques.
Integration: Merging the transformation with existing model parameters without adding new ones, thus preserving the architectural integrity while minimizing computational overhead.

Experimental Results and Comparative Analysis

The authors validate ReplaceMe across several LLM architectures, notably the LLaMA and Falcon models, achieving up to 25% pruning while retaining approximately 90% of the original model's accuracy on standardized benchmarks, without any retraining. Notably, ReplaceMe outperforms traditional pruning methods such as UIDL in both computational efficiency and environmental impact, as highlighted by the comparative metrics:

Latency and Environmental Metrics: ReplaceMe demonstrates reduced compression time, lower energy consumption, and decreased emissions compared to UIDL.
Accuracy Retention: Even at high compression rates, ReplaceMe maintains competitive performance scores across various tasks, proving its efficacy over retraining-required methods.

Implications and Future Work

The implications of ReplaceMe extend to practical deployments of LLMs under resource-constrained environments, offering a feasible solution for reducing model sizes while maintaining accuracy. The approach can enhance the accessibility of LLMs for users with limited hardware capabilities, thereby democratizing the use of advanced AI models.

Future research directions may involve:

Exploration of Structured LT Matrix Constraints: Investigating variations such as diagonal or orthonormal transformations to further refine model performance post-pruning.
Application to Other Transformer-Based Models: Extending ReplaceMe to encompass models beyond LLMs, including vision transformers, as preliminary experiments with CLIP models suggest promising results.
Dynamic Layer Pruning: Developing strategies to adaptively prune layers during deployment based on real-time performance metrics, which could further optimize resource usage.

In conclusion, ReplaceMe introduces a novel, efficient pruning method that bridges the gap between model compression and performance retention, with significant implications for deploying sophisticated AI models in resource-constrained settings. Its training-free nature marks a substantial shift towards more sustainable AI practices, prioritizing computational efficiency and environmental considerations.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Dmitriy Shopkhoev (1 paper)
Ammar Ali (2 papers)
Magauiya Zhussip (8 papers)
Valentin Malykh (24 papers)
Stamatios Lefkimmiatis (14 papers)
Nikos Komodakis (37 papers)
Sergey Zagoruyko (17 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/GptMaestro/status/1923213658982699403

YouTube

Show All Videos