What Matters in Transformers? Not All Attention is Needed (2406.15786v6)

Published 22 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: While scaling Transformer-based LLMs has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: \url{https://github.com/Shwai-He/LLM-Drop}.

PDF HTML Abstract

Analysis of Redundancy in Transformer Architectures: An Examination of Attention and MLP Layer Pruning

This paper explores the redundancy prevalent in Transformer-based models, particularly focusing on attention and MLP layers within LLMs. The paper systematically analyzes which components within Transformer architectures may be pruned to streamline operations without significant performance degradation, thereby enhancing efficiency and deployment feasibility in real-world applications.

Redundancy in Transformer Structures

Transformer configurations traditionally contain multiple stacked elements, including blocks, MLP layers, and attention layers, each serving distinct roles. Here, the paper identifies and evaluates redundancy within these structures using a similarity-based metric. This approach measures the similarity between a module's input and output, offering an estimate of its redundancy.

Key Findings and Numerical Results

A particularly striking result is the high redundancy observed in attention layers. The paper provides robust quantitative evidence that a substantial portion of these layers can be pruned. For instance, the Llama-2-70B model achieved a 48.4% speedup with only a 2.4% drop in performance after pruning half of the attention layers. These findings suggest that the critical functional identifiers of Transformers—attention layers—can be pruned extensively while maintaining model efficacy.

Contrastingly, MLP layers, when pruned, tend to degrade performance more significantly, indicating lower redundancy compared to attention layers. Pruning MLP layers alone resulted in notable performance declines, affirming that attention layers offer more pruning potential.

Implications on Model Efficiency

The investigation highlights that attention layer redundancy is consistent throughout training stages. This consistency indicates that the observed redundancy is an intrinsic characteristic of Transformer architectures, providing insights into future network design. The researchers also present a combined pruned approach—Joint Layer Drop—that optimizes both attention and MLP layer removal, enhancing performance. When applied, Llama-2-13B preserved 90% of performance on the MMLU task after substantial layer removal, underscoring the method's effectiveness.

Theoretical and Practical Implications

Theoretically, these findings advance our understanding of model sparsity and pruning strategies, underlining that not all layers contribute equally to a model's performance. Practically, such insights are instrumental in designing more efficient, cost-effective models suitable for deployment under constrained resources. The trade-offs presented by the Joint Layer Drop open possibilities for significant computational and memory savings without incurring substantial performance costs.

Future Directions

Given the promising results highlighted, further research could explore adaptations to Transformer architectures that inherently incorporate fewer attention layers or employ alternative attention mechanisms that can maintain reduced redundancy from the onset. Additionally, combining pruning with retraining post-pruning could serve to refine model outputs further, leveraging identified redundancies and efficiencies.

In conclusion, this paper not only identifies substantial redundancy within existing Transformer architectures but also provides effective pruning strategies that maintain model integrity and performance. Such advances are pivotal as the field seeks to balance model accuracy with efficiency, particularly in deploying large-scale models in practical settings.