Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 465 tok/s Pro
Kimi K2 205 tok/s Pro
2000 character limit reached

Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration (1911.09925v3)

Published 22 Nov 2019 in cs.DC, cs.AR, cs.LG, and cs.PF

Abstract: DNN accelerators are often developed and evaluated in isolation without considering the cross-stack, system-level effects in real-world environments. This makes it difficult to appreciate the impact of System-on-Chip (SoC) resource contention, OS overheads, and programming-stack inefficiencies on overall performance/energy-efficiency. To address this challenge, we present Gemmini, an open-source*, full-stack DNN accelerator generator. Gemmini generates a wide design-space of efficient ASIC accelerators from a flexible architectural template, together with flexible programming stacks and full SoCs with shared resources that capture system-level effects. Gemmini-generated accelerators have also been fabricated, delivering up to three orders-of-magnitude speedups over high-performance CPUs on various DNN benchmarks. * https://github.com/ucb-bar/gemmini

Citations (171)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces Gemmini, a full-stack DNN accelerator generator that integrates hardware templates, programming interfaces, and OS-level support for realistic evaluations.
  • The paper employs a two-level hierarchical spatial design and automatic mapping from ONNX models to accelerate deep learning workloads, achieving significant speedups over baseline CPUs.
  • The paper’s case studies reveal that tailored memory partitioning and virtual address translation can optimize resource usage and improve overall system performance in multi-core environments.

Overview of Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration

Introduction

The paper presents Gemmini, an open-source, full-stack DNN accelerator generator designed to confront the challenges of evaluating deep learning accelerators in real-world systems. By integrating the entire stack, Gemmini captures system-level impacts, often overlooked in isolated evaluations of accelerators. The platform offers a comprehensive solution including hardware templates, programming interfaces, and full System-on-Chip (SoC) integration, leveraging the RISC-V ecosystem.

Architectural Design

Gemmini provides a flexible architectural template allowing users to explore various configurations across the performance, efficiency, and extensibility spectrum. The central unit features a two-level hierarchical spatial architecture composed of processing elements (PEs), supporting both vector- and systolic-based designs. This flexibility facilitates the quantitative comparison of architectural variations, enabling researchers to evaluate trade-offs between pipeline and parallel execution strategies effectively.

Programming and System Support

Gemmini addresses programming challenges by offering a multi-level software stack. High-level DNN descriptions in ONNX format can be automatically mapped onto accelerators, while low-level APIs cater to developers needing granular control. Notably, Gemmini supports virtual memory, a feature often underexplored in accelerator environments. By providing hardware support without special drivers, Gemmini simplifies accelerator programming.

At the system level, Gemmini promotes full SoC integration to explore impacts such as resource contention and OS overhead. Configurations range from single-core systems to complex multi-core environments, all capable of running Linux. This enables evaluation and debugging within realistic software stacks, unearthing potential inefficiencies often masked in bare-metal environments.

Performance Evaluation

Gemmini-generated accelerators have been fabricated in advanced process technologies, achieving noteworthy speedups compared to baseline CPUs. The evaluation demonstrates that architecture and configuration adjustments can lead to significant performance optimizations. For instance, equipping accelerators with specific computation blocks like im2col can mitigate host CPU bottlenecks, promoting harmonious CPU-accelerator co-design.

Case Studies

The paper extends its contribution through practical case studies showcasing Gemmini's potential in full system co-design:

  1. Virtual Address Translation: Gemmini allows for the exploration and customization of virtual address translation schemes. A notable innovation includes the use of filter registers, significantly reducing TLB hit latency, resulting in notable performance improvements with minimal hardware overhead.
  2. System-Level Resource Partitioning: By evaluating different memory partition configurations, the studies reveal the significance of balancing scratchpad and shared cache resources. The configurations demonstrate how memory design choices can substantially impact performance, particularly in multi-core environments with high-resource contention.

Implications and Future Developments

Gemmini's systematic approach opens new avenues in the evaluation and design of deep-learning systems, emphasizing the importance of integrated solutions over isolated component assessments. The platform's flexibility and openness also lay the groundwork for future research into co-design practices, exploring the interplay between hardware, software, and system architecture.

Future developments in AI systems could greatly benefit from Gemmini's framework, particularly in domains requiring dynamically adaptable architectures and resource-efficient deployments. The full-stack perspective encouraged by Gemmini underscores a shift towards holistic design principles in AI hardware research, actively promoting innovations aligned with realistic operational constraints.

Conclusion

Overall, Gemmini provides a robust infrastructure for the systematic evaluation and development of DNN accelerator architectures. By integrating all system components, from hardware to software to operating systems, researchers can comprehensively assess and optimize deep learning architectures for current and future applications.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com