Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Published 8 Aug 2024 in cs.LG and cs.AR | (2408.04413v1)

Abstract: With the rise of Embodied Foundation Models (EFMs), most notably Small LLMs (SLMs), adapting Transformers for edge applications has become a very active field of research. However, achieving end-to-end deployment of SLMs on microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this paper, we demonstrate high-efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multi-dimensional memory vs. computation tradeoffs involved in aggressive SLM deployment on heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel Deep Neural Network (DNN) compiler, which generates highly-optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates end-to-end code for executing SLMs, fully exploiting the RV32 cores' instruction extensions and the NPU: We achieve leading-edge energy and throughput of \SI{490}{\micro\joule \per Token}, at \SI{340}{Token \per \second} for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without external memory.

Abstract PDF HTML Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper introduces Deeploy, a novel deep neural network compiler that generates optimized C code for deploying small language models on constrained microcontroller systems.
It leverages a bottom-up compilation approach with pattern matching and constraint programming to manage memory and computation trade-offs effectively.
Empirical results on the Siracusa platform show a 23x speedup and 26x energy efficiency improvement, highlighting its practical impact on tinyML applications.

Deeploy: Enabling Energy-Efficient Deployment of Small LLMs On Heterogeneous Microcontrollers

The paper "Deeploy: Enabling Energy-Efficient Deployment of Small LLMs On Heterogeneous Microcontrollers" by Moritz Scherer et al. presents Deeploy, a novel Deep Neural Network (DNN) compiler designed for efficient deployment of Small LLMs (SLMs) on heterogeneous microcontroller (MCU)-class systems. This work addresses the significant challenge of executing SLMs on constrained devices without relying on high-bandwidth off-chip memory.

Overview of Deeploy

Deeploy introduces a customizable, domain-specific framework that generates highly optimized C code suited for tinyML platforms with extreme memory and computation constraints. The compiler stands out by leveraging a bottom-up compilation approach, which contrasts with the traditional top-down methods used in other DNN compilers. This approach facilitates the integration of hand-optimized kernel libraries and supports the precise configuration required for heterogeneous systems with multiple accelerators.

Key Contributions

Compiler Architecture:
- Frontend: Transforms input ONNX graphs and assigns kernel templates, incorporating platform-specific optimizations through pattern matching and type inference.
- Midend: Optimizes the execution flow by computing geometrical constraints for tensor tiling and addressing static memory allocation through a Constraint Programming (CP) approach.
- Backend: Generates the final C code, leveraging code generation passes and a detailed memory hierarchy model to maximize the efficiency of data movement and computation.
Handling Complexity of Transformers: Deeploy is particularly effective for deploying transformers by managing the intricate memory and computation trade-offs inherent in these models. The system supports advanced features like $KV$ caching, crucial for optimizing autoregressive inference in transformer models.
Deployment on Siracusa:
- Demonstrated the deployment on Siracusa, a state-of-the-art MCU with an octa-core RISC-V cluster and an NPU.
- Achieved a throughput of 340 tokens per second with an energy cost of 490 µJ per token, marking a significant advancement in energy efficiency for tinyML applications.

Detailed Analysis

Compiler Performance

The paper provides numerical results highlighting the efficiency of Deeploy. A noteworthy achievement includes reducing data marshaling overheads to just 9% for certain transformer layers even when both core clusters and NPU are utilized concurrently. The solution ensures that the compiled code fits the stringent memory constraints of MCUs, balancing tiling and static memory allocation effectively.

End-to-End Deployment

The practical deployment results on the Siracusa platform are compelling:

Autoregressive Inference: Utilizes on-chip $KV$ caching to optimize performance. Achieving a 23x speedup and 26x energy efficiency improvement over conventional parallel inference reflects Deeploy's efficacy in real-world applications.
Benchmarking: The comparison with state-of-the-art systems shows superior energy efficiency and throughput, underscoring Deeploy's advancements. For instance, it outperforms the MobileLLM deployed on an A15 Bionic chip in terms of throughput and energy efficiency.

Implications for Future AI Developments

Deeploy's contributions go beyond immediate application to small LLMs. The methodological advancements in handling complex memory hierarchies and computation constraints provide a robust foundation for future research in deploying advanced AI models on edge devices. As SLMs and other efficient neural network architectures gain traction, tools like Deeploy will be essential in bridging the gap between high computational demands and the capabilities of low-power edge devices.

Future developments could extend Deeploy's applicability to emerging architectures, such as Compute-In Memory (CIM) systems, and optimize other relevant AI workloads beyond SLMs. The flexible nature of the compiler makes it a valuable tool for various applications in embedded AI, paving the way for more intelligent, autonomous, and efficient devices.

In conclusion, Deeploy represents a significant step forward in the energy-efficient deployment of neural networks on constrained hardware, combining technical sophistication with practical efficiency. Its successful deployment on the Siracusa MCU validates its utility and sets a precedent for future innovations in tinyML and edge AI deployments.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Summary

Deeploy: Enabling Energy-Efficient Deployment of Small LLMs On Heterogeneous Microcontrollers

Overview of Deeploy

Key Contributions

Detailed Analysis

Compiler Performance

End-to-End Deployment

Implications for Future AI Developments

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (8)

Collections

Tweets

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Summary

Deeploy: Enabling Energy-Efficient Deployment of Small LLMs On Heterogeneous Microcontrollers

Overview of Deeploy

Key Contributions

Detailed Analysis

Compiler Performance

End-to-End Deployment

Implications for Future AI Developments

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Tweets