IteRABRe: Iterative Recovery-Aided Block Reduction (2503.06291v1)

Published 8 Mar 2025 in cs.CL

Abstract: LLMs have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.

Authors (6)

Haryo Akbarianto Wibowo (9 papers)
Haiyue Song (18 papers)
Hideki Tanaka (6 papers)
Masao Utiyama (39 papers)
Alham Fikri Aji (94 papers)
Raj Dabre (65 papers)

Summary

Overview of IteRABRe: Iterative Recovery-Aided Block Reduction for LLM Compression

The paper "IteRABRe: Iterative Recovery-Aided Block Reduction" addresses the significant challenge of deploying LLMs, which are computationally expensive due to their substantial size. To tackle this issue, the authors propose a novel model compression technique called IteRABRe, designed to reduce the size of LLMs effectively while maintaining performance, using minimal computational resources. This methodology leverages iterative pruning with recovery phases to achieve superior compression results compared to existing methods.

Methodology

IteRABRe focuses on block pruning, a technique informed by the layer redundancy inherent in LLMs. This approach involves identifying and removing less significant blocks within a model to decrease its size. The process is structured to alternate between pruning and recovery phases until a target model size is reached.

Pruning Phase: The authors employ a layerwise strategy in which blocks are ranked by their contribution to the model's output quality. The importance of each block is quantified using metrics such as cosine similarity between hidden states of the original and pruned models. Blocks with the lowest importance scores are selectively removed, minimizing performance loss.
Recovery Phase: To mitigate the adverse effects of pruning, the method incorporates knowledge distillation, where the pruned model is fine-tuned to adapt to its new structure. Using only 2.5M tokens for recovery, IteRABRe demonstrates an efficient yet effective recovery process that better allocates internal knowledge to remaining model capacities.

Experimental Results

The experiments demonstrate that IteRABRe consistently outperforms baseline approaches such as LaCO and ShortGPT in compressing models like Llama3.1-8B and Qwen2.5-7B. The results show an average performance improvement of approximately 3% over baselines in general tasks and 5% in language-related tasks, highlighting its effectiveness in preserving linguistic capabilities. Furthermore, the methodology shows zero-shot cross-lingual preservation with non-English languages using English-only recovery data.

Insights and Implications

IteRABRe's results suggest promising directions for reducing deployment costs associated with LLMs. By maintaining model performance through efficient compression techniques, it provides practical benefits to organizations requiring large-scale language processing capabilities but facing resource constraints. Additionally, the observed capability of zero-shot cross-lingual transfer suggests theoretical insights into the inherent multilingual capacities of LLMs that can be preserved through intelligent recovery strategies.

Future Developments

The iterative and minimal-resource requirements of IteRABRe open avenues for enhancing AI model deployment efficiency across various applications. Future work could explore extending this methodology to other model architectures and tasks, incorporating diverse datasets to optimize performance further. Moreover, integrating advanced techniques into the recovery phase could potentially enhance knowledge retention and transfer across different linguistic contexts, thereby expanding the applicability of compressed models.

In summary, IteRABRe offers a significant contribution to the domain of LLM compression, balancing size reduction with competent performance preservation, and providing a foundation for addressing the practical and theoretical challenges of deploying large-scale LLMs efficiently.

Related Papers

Find Related Papers

Tweets

https://twitter.com/haryoaw/status/1900471515247677521

https://twitter.com/fly51fly/status/1901026829026423020