SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Published 14 Feb 2024 in cs.CL and cs.LG | (2402.09025v6)

Abstract: LLMs have proven to be highly effective across various natural language processing tasks. However, their large number of parameters poses significant challenges for practical deployment. Pruning, a technique aimed at reducing the size and complexity of LLMs, offers a potential solution by removing redundant components from the network. Despite the promise of pruning, existing methods often struggle to achieve substantial end-to-end LLM inference speedup. In this paper, we introduce SLEB, a novel approach designed to streamline LLMs by eliminating redundant transformer blocks. We choose the transformer block as the fundamental unit for pruning, because LLMs exhibit block-level redundancy with high similarity between the outputs of neighboring blocks. This choice allows us to effectively enhance the processing speed of LLMs. Our experimental results demonstrate that SLEB outperforms previous LLM pruning methods in accelerating LLM inference while also maintaining superior perplexity and accuracy, making SLEB as a promising technique for enhancing the efficiency of LLMs. The code is available at: https://github.com/jiwonsong-dev/SLEB.

Abstract PDF Upgrade to Chat

Citations (9)

View on Semantic Scholar

Summary

The paper introduces SLEB, which prunes LLMs by eliminating redundant transformer blocks to speed up inference without retraining.
It leverages cosine similarity for redundancy verification, ensuring minimal impact on model perplexity while maintaining language quality.
Experimental results demonstrate significant speedup and robust performance, with compatibility to quantization techniques for further optimization.

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

The paper presents SLEB, a novel methodology for pruning LLMs by removing redundant transformer blocks. This approach aims to enhance the inference speed of LLMs while maintaining their linguistic capabilities.

Introduction to SLEB

The exponential growth in LLM parameters poses challenges in deploying these models for real-world applications due to substantial memory and computational demands. Pruning, a common technique to reduce model size, often struggles to accelerate end-to-end inference due to the complexities of managing sparsity, especially on GPU architectures optimized for dense matrices.

SLEB introduces the concept of streamlining LLMs by eliminating entire transformer blocks, which often exhibit block-level redundancy. This method enhances the inference speed of LLMs without compromising their linguistic prowess (Figure 1).

Figure 1: Typical LLM architecture.

Motivations and Challenges

Pruning Techniques

Existing pruning techniques can be categorized into unstructured and structured. Unstructured pruning often requires extremely high sparsity to achieve speedups, whereas structured pruning must contend with hardware inefficiencies, such as the dependence on the batch size and matrix dimensions for achieving optimal speed gains (Figure 2).

Figure 2: The speedup achieved through 2:4 pruning on matrix multiplication.

Challenges with Early Exit

Earlier approaches like early exit techniques suffer from limitations such as unfixed memory requirements, resource-intensive training, and inefficiencies in multi-batch settings. The inability to maintain consistent model performance without substantial dynamic skipping decisions poses significant challenges (Figure 3).

Figure 3: Perplexity comparison on WikiText-2 for LLMs after removing consecutive transformer blocks.

Proposed Approach: SLEB

Redundancy Verification

SLEB assesses redundancy by evaluating the cosine similarity across transformer block outputs, revealing high degrees of redundancy among neighboring blocks. Unlike early exit methods, SLEB's static approach does not rely on heuristic-based dynamic skipping, thus better preserving model integrity (Figure 4).

Figure 4: Cosine similarity between the outputs of two transformer blocks.

Implementation Details

SLEB removes redundant blocks iteratively, using a metric that minimizes impact on model perplexity. This metric considers the cumulative effect on transformer blocks, maintaining linguistic capabilities without additional training. The process iteratively identifies and prunes redundant blocks, allowing for significant end-to-end speed improvements.

Experimental Results

Language Modeling

SLEB's pruning achieved considerable performance retention across tested models, maintaining perplexity scores competitive with non-pruned baselines (Table 1). This indicates robustness at the transformer block level, effectively preserving language capabilities.

Model	Sparsity	Perplexity (C4)
OPT-6.7B	10%	13.84
OPT-13B	20%	12.54
LLaMA-70B	20%	7.31

Table 1: Perplexity results on C4 dataset for models pruned with SLEB.

Deployment Speedup

SLEB consistently improved inference latency and throughput across various LLMs, demonstrating scalability and effectiveness in real-world deployment scenarios. It achieved superior speedups compared to state-of-the-art pruning methods, which often depend on case-specific matrix sizes and tensor hardware support.

Compatibility with Quantization

The approach also complements post-training quantization techniques like AWQ, enabling further compression while preserving linguistic nuances. This compatibility supports additional memory savings and speeding up of LLMs through 4-bit weight quantization.

Conclusion

SLEB presents a significant advance in LLM optimization by systematically identifying and removing redundant transformer blocks. By addressing the limitations of previous methods, SLEB enhances model deployment in practical settings, achieving increased inference speedup and reduced computational resource requirements without compromising the model's language understanding capabilities. The methodology holds promise for applications requiring efficient deployment of large-scale LLMs.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (6)

Collections

GitHub

GitHub - leapingjagg-dev/SLEB: Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks (39 stars)

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Summary

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Introduction to SLEB

Motivations and Challenges

Pruning Techniques

Challenges with Early Exit

Proposed Approach: SLEB

Redundancy Verification

Implementation Details

Experimental Results

Language Modeling

Deployment Speedup

Compatibility with Quantization

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (6)

Collections

GitHub

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Summary

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Introduction to SLEB

Motivations and Challenges

Pruning Techniques

Challenges with Early Exit

Proposed Approach: SLEB

Redundancy Verification

Implementation Details

Experimental Results

Language Modeling

Deployment Speedup

Compatibility with Quantization

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

GitHub

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research