- The paper presents a novel framework combining knowledge distillation, structured pruning, and quantization to drastically reduce Transformer inference time.
- It demonstrates significant efficiency gains with speed improvements up to 233.9x on CPUs and 12.4x on GPUs while largely preserving accuracy.
- The approach offers scalable solutions for production, cutting serving costs from $4,223 to $18 for 100 million requests and significantly lowering energy consumption.
FastFormers: Highly Efficient Transformer Models for Natural Language Understanding
The paper entitled "FastFormers: Highly Efficient Transformer Models for Natural Language Understanding" addresses a significant challenge in deploying Transformer-based models: the computational inefficiency during inference time. While Transformer models, such as BERT and RoBERTa, have achieved state-of-the-art results in Natural Language Understanding (NLU), their complex architectures often lead to prohibitive inference costs, rendering them less viable for large-scale production use. This paper introduces FastFormers, a novel approach combining knowledge distillation, structured pruning, and numerical optimization to significantly enhance inference efficiency.
Key Techniques
- Knowledge Distillation (KD): FastFormers employs knowledge distillation to transfer insights from a large teacher model to a smaller student model. Utilizing task-specific and task-agnostic distillation methods, it achieves substantial model size reductions without compromising accuracy. For instance, using BERT and RoBERTa with modified student architectures led to optimized models retaining comparable accuracy on SuperGLUE tasks.
- Structured Pruning: Unlike random pruning, structured pruning directly targets computationally expensive components such as multi-head attention (MHA) heads and feed-forward network hidden states. By systematically reducing dimensions while preserving essential model performance, FastFormers demonstrated significant speed improvements, particularly on complex tasks like MultiRC and ReCoRD.
- Model Quantization: By leveraging 8-bit quantization on CPUs and converting model parameters to 16-bit on GPUs, FastFormers utilizes modern hardware capabilities for faster execution. Such optimizations, especially on the CPU, result in speed improvements up to 3x with minimal accuracy degradation.
- Runtime Optimizations: Additional enhancements include multi-processing optimizations and computational graph optimizations for improved core utilization and operational fusion, respectively. These contribute further to reducing inference time and energy consumption.
Numerical and Experimental Results
The paper's experimental evaluations demonstrate remarkable efficiency gains across various NLU tasks in the SuperGLUE benchmark. Employing FastFormers on CPU, the inference speed for models showed improvements from 9.8x to 233.9x, with GPU implementations seeing up to 12.4x increase. Furthermore, these optimizations drastically cut the cost of serving 100 million requests from $4,223 to$18, underscoring its scalability potential in practical scenarios. Energy consumption was notably decreased by factors ranging from 6.9x to 125.8x, evaluated through the metrics established in SustaiNLP 2020.
Implications and Future Work
The implications of FastFormers are twofold. Practically, it allows for broader deployment of Transformer models in cost-sensitive applications by significantly reducing inference costs and energy usage. Theoretically, it presents a framework that encourages future exploration into combining model compression and optimization strategies for enhanced efficiency.
Future research directions could consider integrating emerging techniques, such as early exiting and linear complexity self-attention, to push efficiency boundaries further. The comprehensive open-source release of FastFormers offers a foundation for continued innovation in sustainable AI.
In conclusion, FastFormers provides a comprehensive set of recipes to tackle the computational challenges inherent in Transformer models, significantly advancing AI deployment capabilities in resource-constrained environments.