Streamlining Redundant Layers to Compress Large Language Models (2403.19135v4)

Published 28 Mar 2024 in cs.CL and cs.AI

Abstract: This paper introduces LLM-Streamline, a pioneer work on layer pruning for LLMs. It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.

References (52)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Streamlining Redundant Layers to Compress Large Language Models (2403.19135v4)

Summary

Related Papers