Optimizing Tiny Transformers Deployment on Low-Power Microcontrollers
Introduction to the Framework
The recent surge in deploying Transformer models for edge computing applications emphasizes the necessity for efficient implementation strategies, especially on low-power microcontroller units (MCUs). This paper introduces a comprehensive framework that enhances the deployment of encoder-based Tiny Transformers across multiple commercial MCUs. The key contribution includes a novel library of optimized kernels targeting the efficient execution of Multi-Head Self-Attention (MHSA) mechanisms, fundamental to Transformer architectures. Additionally, the work presents a Fused-Weight Self-Attention (FWSA) inference schedule and a Depth-First Tiling (DFT) scheme aimed at minimizing memory footprint and computational overhead for MHSA operations.
Attention on Edge
The efficient execution of Transformer models on MCUs faces unique challenges, primarily due to the demanding memory and computation requirements of the attention mechanism. This paper's approach modifies traditional attention computations by introducing fused-weight and depth-first tiling strategies to mitigate these challenges.
The proposed Fused-Weight Self-Attention (FWSA) method reduces the computational complexity by fusing linear projection weights for queries and keys, effectively reducing the number of parameters and operations needed. This approach is particularly beneficial for models with a smaller embedding size (E), where it demonstrates a clear advantage in reducing both latency and memory requirements.
The Depth-First Tiling (DFT) method addresses the high memory footprint during the computation of attention maps by allowing their piecewise execution, thus never materializing the entire matrix in memory. This technique shows a significant reduction in memory peak usage, up to 6.19 times in some instances, highlighting its effectiveness for cache-less MCU devices.
Qualitative and Quantitative Enhancements
The paper reports a comprehensive evaluation of the proposed framework on a range of MCUs exploiting ARM and RISC-V Instruction Set Architectures (ISAs), showing substantial improvements over state-of-the-art (SotA) libraries. On average, a 4.79 times lower latency is observed compared to ARM's CMSIS-NN library and a 2 times lower latency compared to RISC-V's PULP-NN library.
A series of micro-benchmarks on the MHSA and FWSA operations highlight the scalability of performance across various input dimensions and the efficiency of parallel execution on multi-core platforms. The ablation paper emphasizes the individual contributions of the FWSA and DFT optimizations to reducing the runtime and memory requirements.
Practical Implications and Future Directions
The real-world implications of this research are profound, enhancing the deployment flexibility and efficiency of Tiny Transformers across a spectrum of IoT endpoints. The framework's ability to mitigate memory and computational bottlenecks opens new horizons for advanced on-device inference tasks within strict power and performance constraints.
This paper lays the groundwork for future research in the optimization of Transformer models for edge computing. Future work may explore the extension of these optimizations to other Transformer variants and the automatic generation of optimized tiling strategies based on model and hardware profiles.
The open-source availability of this framework encourages further community engagement and development, potentially expanding its applicability and improving the robustness of Tiny Transformers on low-power MCUs.