HAT: Hardware-Aware Transformers for Efficient Natural Language Processing (2005.14187v1)

Published 28 May 2020 in cs.CL, cs.LG, and cs.NE

Abstract: Transformers are ubiquitous in NLP tasks, but they are difficult to be deployed on hardware due to the intensive computation. To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search. We first construct a large design space with $\textit{arbitrary encoder-decoder attention}$ and $\textit{heterogeneous layers}$. Then we train a $\textit{SuperTransformer}$ that covers all candidates in the design space, and efficiently produces many $\textit{SubTransformers}$ with weight sharing. Finally, we perform an evolutionary search with a hardware latency constraint to find a specialized $\textit{SubTransformer}$ dedicated to run fast on the target hardware. Extensive experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware (CPU, GPU, IoT device). When running WMT'14 translation task on Raspberry Pi-4, HAT can achieve $\textbf{3}\times$ speedup, $\textbf{3.7}\times$ smaller size over baseline Transformer; $\textbf{2.7}\times$ speedup, $\textbf{3.6}\times$ smaller size over Evolved Transformer with $\textbf{12,041}\times$ less search cost and no performance loss. HAT code is https://github.com/mit-han-lab/hardware-aware-transformers.git

Authors (7)

Hanrui Wang (49 papers)
Zhanghao Wu (7 papers)
Zhijian Liu (41 papers)
Han Cai (79 papers)
Ligeng Zhu (22 papers)
Chuang Gan (195 papers)
Song Han (155 papers)

Citations (250)

View on Semantic Scholar

Summary

Analysis of "HAT: Hardware-Aware Transformers for Efficient Natural Language Processing"

The paper "HAT: Hardware-Aware Transformers for Efficient Natural Language Processing" outlines a novel approach to designing Transformer models that are optimized for various hardware platforms, leveraging neural architecture search (NAS) to create efficient models without sacrificing performance. These models, called Hardware-Aware Transformers (HAT), are specifically constructed to minimize latency on diverse hardware configurations such as CPUs, GPUs, and IoT devices.

Overview of the HAT Framework

Transformers have become a cornerstone in NLP due to their effectiveness in handling tasks involving sequential data. However, their deployment on resource-constrained hardware is hindered by their high computational demand. To address this, the paper introduces HAT, which makes use of a large search space with variable encoder-decoder attention layers and heterogeneous Transformer layers.

The approach involves training a SuperTransformer that encompasses the entire design space. This SuperTransformer facilitates weight sharing, allowing numerous SubTransformers to be derived efficiently. With the sub-networks formed, an evolutionary search guided by a latency predictor, rather than FLOPs, identifies the most suitable SubTransformer for specific hardware constraints.

Experimental Results

HAT's performance was evaluated across multiple machine translation tasks, demonstrating its versatility. Specifically, on the WMT'14 En-De task running on a Raspberry Pi-4, HAT achieved a 3-fold speedup and a 3.7-fold reduction in model size compared to conventional Transformers, without losing accuracy. Even when compared to Evolved Transformers, HAT maintained the same BLEU score with significantly lower search costs, highlighting the effectiveness of hardware-specific optimization over other methods.

The paper presents detailed comparisons across various hardware platforms, emphasizing the importance of specialized model architectures for distinct hardware environments. Notably, the paper notes that GPU-optimized models tend to be wide and shallow, whereas models optimized for ARM CPUs are deep and narrow, reflecting the stark differences in performance characteristics across hardware architectures.

Implications and Future Prospects

The implications of this work are multifaceted. Practically, HAT enables the deployment of Transformer models on edge devices and other hardware with stringent computational limits, broadening the scope of NLP applications. Theoretically, it underscores the significance of incorporating hardware-specific feedback in the design of deep learning models, challenging the conventional reliance on FLOPs for efficiency assessment.

The introduction of arbitrary encoder-decoder attention and heterogeneous layer architectures as part of the design space are key contributions that enhance the adaptability of models to varying hardware constraints. The low-cost nature of the proposed search process—attributed to the SuperTransformer architecture—opens up avenues for more efficient model specialization in other domains beyond NLP.

Future developments could involve extending the HAT framework to integrate additional optimization techniques, such as knowledge distillation and quantization, more seamlessly, thereby enhancing both model efficiency and performance further. The potential for combining HAT with dedicated hardware accelerators specially designed for Transformer architectures is also a promising direction that could yield even greater efficiency gains.

In summary, the HAT framework represents a significant step toward democratizing the deployment of high-performance NLP models across diverse computational environments, offering a scalable, hardware-conscious approach to deep learning model design.

PDF Markdown

Related Papers

YouTube

Show All Videos