FFN Fusion: Rethinking Sequential Computation in Large Language Models (2503.18908v1)

Published 24 Mar 2025 in cs.LG

Abstract: We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in LLMs by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

Summary

FFN Fusion: Rethinking Sequential Computation in LLMs

This paper presents FFN Fusion, a novel architectural optimization technique designed to enhance the efficiency of LLMs by minimizing sequential computation. The method capitalizes on natural opportunities for parallelization within Feed-Forward Network (FFN) layers, especially following the removal of certain attention layers. The central proposition is that many FFN layers can be executed in parallel with negligible impact on model accuracy, thereby substantially reducing inference latency.

The research introduces a systematic approach for identifying sequences of FFN layers suited for fusion, transforming these sequences into operations that can leverage parallel processing. This approach was applied to Llama-3.1-405B-Instruct, resulting in the development of Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base). This model achieves a significant 1.71× reduction in inference latency and reduces per-token cost by 35× while maintaining strong performance across diverse benchmarks.

Core Contributions

FFN Fusion Methodology: The paper describes a process wherein sequences of FFN layers are systematically transformed into a single, parallelized computation step. This is particularly effective for modern GPU architectures, which benefit from reduced inter-layer synchronization communications.
Practical Impact: Demonstrated through the Ultra-253B-Base model, which not only exploits FFN Fusion for efficiency but also retains performance compatibility with its larger predecessor, Llama-3.1-405B-Instruct. The Ultra-253B-Base showcases state-of-the-art performance on benchmarks, such as Arena Hard (84.92%), HumanEval (86.58%), and MT-Bench (9.19).
Analysis of Structural Redundancies: The research highlights significant structural redundancies within LLM architectures, particularly regarding attention mechanisms. Eliminating unnecessary attention layers can lead to sequences of FFN layers with low inter-dependency, well-suited for parallel execution.
Scalability: Extensive experiments indicate that the benefits of FFN Fusion amplify as model size increases, demonstrating efficacy in models ranging from 49B to 253B parameters. This suggests a complementary relationship with existing optimization techniques, including quantization and pruning.
Initial Exploration of Full Transformer Block Parallelization: Preliminary results suggest that entire Transformer blocks could potentially be parallelized, opening avenues for new architectural designs in LLMs.

Implications and Future Directions

The introduction of FFN Fusion has several theoretical and practical implications. From a theoretical standpoint, it challenges the conventional understanding that sequential processing is imperative for FFN layers in Transformer architectures. The findings suggest a new perspective on model design, encouraging architectures that inherently support parallel computations.

Practically, the reduced inference latency and cost efficiency achieved through FFN Fusion can significantly enhance the accessibility and scalability of LLM deployments, particularly in environments with stringent resource constraints.

Looking forward, the exploration into full Transformer block parallelization hints at deeper opportunities for architectural redesigns that could further optimize LLM performance. Additionally, there is potential for synergistic benefits when combining FFN Fusion with other model optimization strategies such as MoE techniques, given effective handling of computational overheads.

In summary, the innovations presented in this paper lay a robust foundation for future endeavours in model efficiency, proposing both a practical methodology and a conceptual shift towards more parallelizable model architectures.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1904390339009417588

https://twitter.com/fly51fly/status/1906460592225100279

https://twitter.com/TheTuringPost/status/1907074703598182470

https://twitter.com/dessatel/status/1907516963901026419

https://twitter.com/ai_hakase_/status/1904851075615588481