FFN-SkipLLM: Adaptive Feed-Forward Skipping Strategy for Enhanced Autoregressive Decoding in LLMs
Introduction
The exponential growth in the capabilities of Autoregressive LLMs has been met with increasing challenges related to their deployment due to the substantial computational demands these models entail. While several strategies focusing on early exits and layer dropping have been proposed to mitigate this, they often encounter limitations, such as generation collapse and hallucination issues, due to ineffective handling of the Key-Value (KV) cache. This paper introduces FFN-SkipLLM, a novel strategy that targets the computationally expensive Feed-Forward Network (FFN) blocks within LLMs' layers. By allowing for a fine-grained and input-adaptive skipping of approximately 25-30% of FFN blocks, FFN-SkipLLM achieves marginal performance changes on knowledge-intensive generation tasks without encountering the KV cache issues that hamper existing approaches.
Motivation
The observation that motivates this work is two-fold: First, a significant redundancy exists in the computation performed by FFN blocks within LLMs, particularly in the middle layers. Second, leveraging the "attention sink" phenomenon, whereby early tokens in a sequence disproportionately influence model output, allows for a portion of the model's computation to be bypassed without substantially degrading performance. This approach proposes a departure from traditional layer-skipping methodologies by focusing on FFN block skipping, thereby circumventing the complexities related to KV cache handling.
FFN-SkipLLM: An Approach to FFN Block Skipping
Preliminaries
Analysis reveals that FFN blocks, which constitute approximately two-thirds of the parameters in a given layer (as demonstrated in LLaMa-7B layers), exhibit a high degree of computational redundancy. This redundancy is primarily observed in the middle layers of LLMs, with cosines similarity analyses indicating that tensors before and after FFN blocks undergo minimal change. Consequently, FFN blocks within these "non-cold" regions emerge as prime candidates for skipping, promising substantial computational savings with negligible impact on output quality.
Methodology
FFN-SkipLLM employs a dynamic strategy that adapts FFN block skipping according to input-specific characteristics. This strategy is detailed in an algorithm that selectively bypasses FFN blocks within non-cold regions based on the cosine similarity between input and output tensors of these blocks. By maintaining the computation in the initial and final layers (cold regions) and employing a warm-up mechanism that temporarily foregoes skipping for the initial tokens, FFN-SkipLLM preserves the integrity of the KV cache and ensures a stable generation process.
Experimental Evaluation
Extensive experiments across benchmarks such as MT-Bench, Factoid-QA, and variable-length text summarization demonstrate the efficacy of FFN-SkipLLM. Notably, the model can skip a significant portion of FFN blocks while retaining nearly full model performance across a range of knowledge-intensive tasks. This capability starkly contrasts with the performance drops and inaccuracies observed in existing layer-skipping approaches, affirming the potential of FFN-SkipLLM as a more robust and efficient alternative.
Implications and Future Directions
The introduction of FFN-SkipLLM opens up new avenues for enhancing the performance and efficiency of autoregressive LLMs. By sidestepping the challenges associated with KV cache management inherent in layer-skipping strategies, this approach paves the way for more sustainable and accessible deployment of LLMs across various applications. Moving forward, integrating FFN-SkipLLM with other model compression techniques, such as sparsity and quantization, may yield further improvements in computational efficiency. Additionally, addressing the current limitations related to the scaling of skip ratios beyond 35% without performance degradation remains an area ripe for future research.
Conclusion
FFN-SkipLLM represents a significant stride toward mitigating the computational demands of deploying state-of-the-art autoregressive LLMs. By leveraging insights into the redundancy of FFN blocks and the strategic skipping of these components, this approach achieves a delicate balance between computational efficiency and model performance, heralding a new era of more accessible and performant LLMs.