Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding (2010.13382v1)

Published 26 Oct 2020 in cs.CL

Abstract: Transformer-based models are the state-of-the-art for Natural Language Understanding (NLU) applications. Models are getting bigger and better on various tasks. However, Transformer models remain computationally challenging since they are not efficient at inference-time compared to traditional approaches. In this paper, we present FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks. We show how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency. We provide effective recipes that can guide practitioners to choose the best settings for various NLU tasks and pretrained models. Applying the proposed recipes to the SuperGLUE benchmark, we achieve from 9.8x up to 233.9x speed-up compared to out-of-the-box models on CPU. On GPU, we also achieve up to 12.4x speed-up with the presented methods. We show that FastFormers can drastically reduce cost of serving 100 million requests from 4,223 USD to just 18 USD on an Azure F16s_v2 instance. This translates to a sustainable runtime by reducing energy consumption 6.9x - 125.8x according to the metrics used in the SustaiNLP 2020 shared task.

Citations (39)

Summary

  • The paper presents a novel framework combining knowledge distillation, structured pruning, and quantization to drastically reduce Transformer inference time.
  • It demonstrates significant efficiency gains with speed improvements up to 233.9x on CPUs and 12.4x on GPUs while largely preserving accuracy.
  • The approach offers scalable solutions for production, cutting serving costs from $4,223 to $18 for 100 million requests and significantly lowering energy consumption.

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

The paper entitled "FastFormers: Highly Efficient Transformer Models for Natural Language Understanding" addresses a significant challenge in deploying Transformer-based models: the computational inefficiency during inference time. While Transformer models, such as BERT and RoBERTa, have achieved state-of-the-art results in Natural Language Understanding (NLU), their complex architectures often lead to prohibitive inference costs, rendering them less viable for large-scale production use. This paper introduces FastFormers, a novel approach combining knowledge distillation, structured pruning, and numerical optimization to significantly enhance inference efficiency.

Key Techniques

  1. Knowledge Distillation (KD): FastFormers employs knowledge distillation to transfer insights from a large teacher model to a smaller student model. Utilizing task-specific and task-agnostic distillation methods, it achieves substantial model size reductions without compromising accuracy. For instance, using BERT and RoBERTa with modified student architectures led to optimized models retaining comparable accuracy on SuperGLUE tasks.
  2. Structured Pruning: Unlike random pruning, structured pruning directly targets computationally expensive components such as multi-head attention (MHA) heads and feed-forward network hidden states. By systematically reducing dimensions while preserving essential model performance, FastFormers demonstrated significant speed improvements, particularly on complex tasks like MultiRC and ReCoRD.
  3. Model Quantization: By leveraging 8-bit quantization on CPUs and converting model parameters to 16-bit on GPUs, FastFormers utilizes modern hardware capabilities for faster execution. Such optimizations, especially on the CPU, result in speed improvements up to 3x with minimal accuracy degradation.
  4. Runtime Optimizations: Additional enhancements include multi-processing optimizations and computational graph optimizations for improved core utilization and operational fusion, respectively. These contribute further to reducing inference time and energy consumption.

Numerical and Experimental Results

The paper's experimental evaluations demonstrate remarkable efficiency gains across various NLU tasks in the SuperGLUE benchmark. Employing FastFormers on CPU, the inference speed for models showed improvements from 9.8x to 233.9x, with GPU implementations seeing up to 12.4x increase. Furthermore, these optimizations drastically cut the cost of serving 100 million requests from $4,223 to$18, underscoring its scalability potential in practical scenarios. Energy consumption was notably decreased by factors ranging from 6.9x to 125.8x, evaluated through the metrics established in SustaiNLP 2020.

Implications and Future Work

The implications of FastFormers are twofold. Practically, it allows for broader deployment of Transformer models in cost-sensitive applications by significantly reducing inference costs and energy usage. Theoretically, it presents a framework that encourages future exploration into combining model compression and optimization strategies for enhanced efficiency.

Future research directions could consider integrating emerging techniques, such as early exiting and linear complexity self-attention, to push efficiency boundaries further. The comprehensive open-source release of FastFormers offers a foundation for continued innovation in sustainable AI.

In conclusion, FastFormers provides a comprehensive set of recipes to tackle the computational challenges inherent in Transformer models, significantly advancing AI deployment capabilities in resource-constrained environments.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com