- The paper introduces UltraFastBERT, which activates only 0.3% of its neurons during inference to dramatically reduce computational cost.
- It employs conditional matrix multiplication to replace dense operations, resulting in a reported speedup from 78x to a theoretical maximum of 341x.
- Evaluation on the GLUE benchmark shows UltraFastBERT retains over 96% of BERT-base’s performance, underscoring its practical efficiency in NLP tasks.
Introduction
The field of natural language processing has witnessed significant advancements with the introduction of LLMs that have dramatically improved comprehension and generation abilities. These models often come with a high computational cost due to their extensive number of parameters, especially during inference time. To address this, research has been directed toward optimizing the efficiency of such models while maintaining performance levels.
Model Architecture
An emerging approach for efficient LLMing is introduced through UltraFastBERT, a model built upon the architecture of BERT (Bidirectional Encoder Representations from Transformers). UltraFastBERT distinguishes itself by incorporating fast feedforward networks (FFFs) in place of the conventional feedforward layers in BERT's architecture. This novel structure significantly reduces the number of neurons required during inference—to the extent that only 0.3% of the model's neurons are engaged in this process. Specifically, within its layers, UltraFastBERT only activates 12 out of 4095 neurons for individual inferences. Despite the massive reduction in active neurons, UltraFastBERT shows no loss in performance when compared to BERT-like models of similar size and training regimen.
Downstream Performance
To validate the efficacy of UltraFastBERT, comprehensive evaluations were conducted using the GLUE benchmark, a widely recognized suite of natural language understanding tasks. The reported results indicate that UltraFastBERT, with substantially fewer active neurons, retained at least 96% of BERT-base's downstream predictive performance. Interestingly, the reduction in performance due to the model’s sparse activation was mainly noticeable in a single GLUE task, suggesting that the overall approach is sound. For those interested in replicating or extending this research, the model weights have been made public.
Inference Acceleration and Compatibility
UltraFastBERT introduces conditional matrix multiplication (CMM) as the core of its efficiency gains. CMM is a departure from dense matrix multiplication (DMM), traditionally used in feedforward networks, as it computes dot products conditional on the input, sparing the need to engage all neurons simultaneously. Remarkably, current rudimentary implementations of CMM on standard hardware already yield a 78x speedup over DMM. Analysis of CPU and GPU compatibility suggests that with optimized device-specific programming, the actual speedup could approach the theoretical maximum, which, for a BERT-base sized model, is a 341x improvement.
Conclusion and Future Outlook
The pioneering UltraFastBERT model demonstrates the vast potential of LLMs that leverage conditional neural execution. This work paves the way for substantial enhancements in processing speeds, potentially making large-scale LLMs accessible on devices with far less computational power. A key takeaway is that through refined implementations of FFFs, the AI community stands on the brink of achieving unprecedented efficiency in LLMs while preserving their impressive language understanding and generation capabilities. The next step involves integrating native support for CMM into deep learning frameworks and hardware firmware to fully capitalize on the impressive speedup potential demonstrated by UltraFastBERT.