Exponentially Faster Language Modelling (2311.10770v2)

Published 15 Nov 2023 in cs.CL, cs.AI, cs.LG, and cs.NE

Abstract: LLMs only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces UltraFastBERT, which activates only 0.3% of its neurons during inference to dramatically reduce computational cost.
It employs conditional matrix multiplication to replace dense operations, resulting in a reported speedup from 78x to a theoretical maximum of 341x.
Evaluation on the GLUE benchmark shows UltraFastBERT retains over 96% of BERT-base’s performance, underscoring its practical efficiency in NLP tasks.

Introduction

The field of natural language processing has witnessed significant advancements with the introduction of LLMs that have dramatically improved comprehension and generation abilities. These models often come with a high computational cost due to their extensive number of parameters, especially during inference time. To address this, research has been directed toward optimizing the efficiency of such models while maintaining performance levels.

Model Architecture

An emerging approach for efficient LLMing is introduced through UltraFastBERT, a model built upon the architecture of BERT (Bidirectional Encoder Representations from Transformers). UltraFastBERT distinguishes itself by incorporating fast feedforward networks (FFFs) in place of the conventional feedforward layers in BERT's architecture. This novel structure significantly reduces the number of neurons required during inference—to the extent that only 0.3% of the model's neurons are engaged in this process. Specifically, within its layers, UltraFastBERT only activates 12 out of 4095 neurons for individual inferences. Despite the massive reduction in active neurons, UltraFastBERT shows no loss in performance when compared to BERT-like models of similar size and training regimen.

Downstream Performance

To validate the efficacy of UltraFastBERT, comprehensive evaluations were conducted using the GLUE benchmark, a widely recognized suite of natural language understanding tasks. The reported results indicate that UltraFastBERT, with substantially fewer active neurons, retained at least 96% of BERT-base's downstream predictive performance. Interestingly, the reduction in performance due to the model’s sparse activation was mainly noticeable in a single GLUE task, suggesting that the overall approach is sound. For those interested in replicating or extending this research, the model weights have been made public.

Inference Acceleration and Compatibility

UltraFastBERT introduces conditional matrix multiplication (CMM) as the core of its efficiency gains. CMM is a departure from dense matrix multiplication (DMM), traditionally used in feedforward networks, as it computes dot products conditional on the input, sparing the need to engage all neurons simultaneously. Remarkably, current rudimentary implementations of CMM on standard hardware already yield a 78x speedup over DMM. Analysis of CPU and GPU compatibility suggests that with optimized device-specific programming, the actual speedup could approach the theoretical maximum, which, for a BERT-base sized model, is a 341x improvement.

Conclusion and Future Outlook

The pioneering UltraFastBERT model demonstrates the vast potential of LLMs that leverage conditional neural execution. This work paves the way for substantial enhancements in processing speeds, potentially making large-scale LLMs accessible on devices with far less computational power. A key takeaway is that through refined implementations of FFFs, the AI community stands on the brink of achieving unprecedented efficiency in LLMs while preserving their impressive language understanding and generation capabilities. The next step involves integrating native support for CMM into deep learning frameworks and hardware firmware to fully capitalize on the impressive speedup potential demonstrated by UltraFastBERT.

Related Papers

Fast Feedforward Networks (2023)
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (2023)
Fast DistilBERT on CPUs (2022)
Scalable MatMul-free Language Modeling (2024)
Accelerating Large Language Model Inference with Self-Supervised Early Exits (2024)

GitHub

GitHub - pbelcak/UltraFastBERT: The repository for the code of the UltraFastBERT paper (516 stars)

Tweets

https://twitter.com/192201556/status/1733266491645034511

https://twitter.com/IlyasHairline/status/1797152949879136617

https://twitter.com/secemp9/status/1818874281637396487

https://twitter.com/how_uhh/status/1920364285198213294

https://twitter.com/Final_Industry/status/1772348923866190163

https://twitter.com/Ji_Ha_Kim/status/1772376781946839422

YouTube

Show All Videos