Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers (2307.03493v2)

Published 7 Jul 2023 in cs.AR and cs.LG

Abstract: Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W, while outperforming them in area efficiency with 5.93 TOPS/mm$2$ in 22 nm fully-depleted silicon-on-insulator technology at 0.8 V.

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

The paper presents a novel architecture, named ITA (Integer Transformer Accelerator), which aims to optimize the performance and energy efficiency of transformer model inference on embedded systems. Leveraging 8-bit quantization and a unique integer-only softmax implementation, ITA addresses the intrinsic challenges of transformer networks, such as high arithmetic intensity and substantial memory requirements.

Architectural Design

ITA integrates several design innovations to enhance energy efficiency and area efficiency. Key elements of its architecture include:

  1. 8-bit Quantization: ITA exploits 8-bit integer quantization for executing transformer models, maintaining performance parity with floating-point counterparts but significantly reducing the complexity and size of the computational units.
  2. Softmax Implementation: A central innovation of ITA is its ability to perform the softmax operation directly on quantized integer values. This approach minimizes data movement and power consumption by enabling softmax calculations in a streaming data model, reducing the number of memory accesses typically associated with floating-point operations.
  3. Weight Stationary Dataflow: ITA employs a weight stationary dataflow approach, in which weights are fixed while inputs stream past reusable processing elements (PEs). This method enhances power efficiency by minimizing data movement and leverages a double-buffered weight buffer to prefetch weights, conserving memory bandwidth.

Performance and Results

In detailed evaluations, ITA demonstrates competitive energy and area efficiencies when compared to contemporary transformer accelerators. Implemented in 22nm fully-depleted silicon-on-insulator technology, ITA achieves an energy efficiency of 16.9 tera operations per watt (TOPS/W) and an area efficiency of 5.93 tera operations per millimeter squared (TOPS/mm²). These metrics are indicative of ITA's performance-oriented focus, emphasizing efficient execution within constrained energy and area budgets typical of embedded environments.

Comparative Analysis

When compared to other accelerators, such as solutions put forward by Keller et al. and Wang et al., ITA presents superior energy efficiency and area efficiency despite being developed using less advanced technology nodes. This is particularly notable in its direct integer execution strategy for softmax, which eliminates the need for costly floating-point units.

Implications and Future Directions

ITA's efficient use of quantization and other lightweight computational strategies positions it as a significant advancement in the deployment of transformer models on low-power devices. By addressing data movement challenges and minimizing power consumption, ITA opens avenues for wider application of powerful NLP models in resource-constrained scenarios.

The paper suggests potential improvements like further reducing the bit-width of operations or extending quantization techniques to other components of transformer models, thus offering a roadmap for future research directed at enhancing the efficiency and applicability of AI models in more domains.

Conclusion

ITA sets a benchmark for energy-efficient transformer accelerators with its innovative approach to integer-only computation and the integration of data-efficient architectures. The paper successfully details a pathway toward implementing advanced AI models in contexts where power and computational resources are limited, contributing valuable insights to the field of hardware accelerators for artificial intelligence.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Gamze İslamoğlu (3 papers)
  2. Moritz Scherer (12 papers)
  3. Gianna Paulin (8 papers)
  4. Tim Fischer (17 papers)
  5. Victor J. B. Jung (7 papers)
  6. Angelo Garofalo (33 papers)
  7. Luca Benini (362 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com