Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

Published 8 Mar 2025 in cs.LG, cs.AI, cs.DC, and cs.PF | (2503.06183v2)

Abstract: The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.

Summary

Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

The paper titled "Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers" proposes a comprehensive approach to enhance the performance of sparse deep neural networks (DNNs) on constrained microcontrollers (MCUs). With the growing demand for local execution of DNNs on IoT devices, optimizing DNNs to run under tight constraints of power and memory is essential.

Overview and Contributions

The authors introduce a multifaceted strategy to address the challenges of executing pruned DNNs on MCUs. The contributions can be delineated as follows:

  1. Optimized Software Kernels: The paper elaborates on the design of efficient software kernels tailored for N:M pruned layers on ultra-low-power, multicore RISC-V MCUs. Different levels of sparsity (1:4, 1:8, and 1:16) are considered. These kernels demonstrate speedups between 1.1x to 3.4x compared to dense counterparts, depending on the layer type and sparsity level.

  2. ISA Extensions: A key innovation is the lightweight Instruction-Set Architecture (ISA) extension developed to accelerate specific operations of the sparse matrix processing. The use of xDecimate, a novel instruction that improves the speed of decoding operations and indirect loads, shows up to 1.9x additional speedup with only a 5% area overhead.

  3. Integration with DNN Compiler: The integration of these optimized software kernels into an open-source DNN compiler, specifically adapted to support sparse kernels, facilitates practical usage. Speedups of 3.21x and 1.81x are observed for ResNet18 and Vision Transformers (ViTs), respectively, with negligible accuracy decline.

Numerical Results and Implications

The impressive numerical results, particularly the observed speedups and reduced memory footprints, underscore the viability of executing sparse DNNs on MCUs. The practical implications of this research are manifold:

  • Energy Efficiency: By optimizing the latency and memory usage, the methods proposed can significantly improve the energy efficiency of neural network execution on edge devices.
  • Scalability: The lightweight nature of the proposed ISA extensions ensures scalability across various MCU designs without substantial modification of existing architectures.
  • Deployment: The integration with popular frameworks like Apache TVM facilitates straightforward deployment and integration into current neural network pipelines.

Theoretical Implications and Future Directions

Beyond the practical implications, the research contributes to the theoretical understanding of sparse DNN execution:

  • Sparse Format Efficiency: The demonstrated efficacy of the N:M pruning format as a middle ground between structured and unstructured approaches opens avenues for further exploration in trade-offs between sparsity and computational benefits.
  • Instruction Design: The specific design of the xDecimate instruction provides insights into designing efficient instruction sets for specialized operations, which could extend to future advances in other sparsity or data processing domains.

Looking forward, this research paves the way for numerous advancements. It could spur further innovation in MCU architecture design, aimed at supporting more complex DNN operations. Furthermore, exploration into even more aggressive sparsity formats and their impact on execution efficiency would be a natural extension.

In conclusion, the paper presents a significant step toward enabling energy-efficient and expedited execution of sparse DNNs on microcontrollers, contributing both practical solutions and theoretical insights into the deployment of TinyML on the edge.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 16 likes about this paper.