- The paper introduces Gemmini, a full-stack DNN accelerator generator that integrates hardware templates, programming interfaces, and OS-level support for realistic evaluations.
- The paper employs a two-level hierarchical spatial design and automatic mapping from ONNX models to accelerate deep learning workloads, achieving significant speedups over baseline CPUs.
- The paper’s case studies reveal that tailored memory partitioning and virtual address translation can optimize resource usage and improve overall system performance in multi-core environments.
Overview of Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration
Introduction
The paper presents Gemmini, an open-source, full-stack DNN accelerator generator designed to confront the challenges of evaluating deep learning accelerators in real-world systems. By integrating the entire stack, Gemmini captures system-level impacts, often overlooked in isolated evaluations of accelerators. The platform offers a comprehensive solution including hardware templates, programming interfaces, and full System-on-Chip (SoC) integration, leveraging the RISC-V ecosystem.
Architectural Design
Gemmini provides a flexible architectural template allowing users to explore various configurations across the performance, efficiency, and extensibility spectrum. The central unit features a two-level hierarchical spatial architecture composed of processing elements (PEs), supporting both vector- and systolic-based designs. This flexibility facilitates the quantitative comparison of architectural variations, enabling researchers to evaluate trade-offs between pipeline and parallel execution strategies effectively.
Programming and System Support
Gemmini addresses programming challenges by offering a multi-level software stack. High-level DNN descriptions in ONNX format can be automatically mapped onto accelerators, while low-level APIs cater to developers needing granular control. Notably, Gemmini supports virtual memory, a feature often underexplored in accelerator environments. By providing hardware support without special drivers, Gemmini simplifies accelerator programming.
At the system level, Gemmini promotes full SoC integration to explore impacts such as resource contention and OS overhead. Configurations range from single-core systems to complex multi-core environments, all capable of running Linux. This enables evaluation and debugging within realistic software stacks, unearthing potential inefficiencies often masked in bare-metal environments.
Gemmini-generated accelerators have been fabricated in advanced process technologies, achieving noteworthy speedups compared to baseline CPUs. The evaluation demonstrates that architecture and configuration adjustments can lead to significant performance optimizations. For instance, equipping accelerators with specific computation blocks like im2col can mitigate host CPU bottlenecks, promoting harmonious CPU-accelerator co-design.
Case Studies
The paper extends its contribution through practical case studies showcasing Gemmini's potential in full system co-design:
- Virtual Address Translation: Gemmini allows for the exploration and customization of virtual address translation schemes. A notable innovation includes the use of filter registers, significantly reducing TLB hit latency, resulting in notable performance improvements with minimal hardware overhead.
- System-Level Resource Partitioning: By evaluating different memory partition configurations, the studies reveal the significance of balancing scratchpad and shared cache resources. The configurations demonstrate how memory design choices can substantially impact performance, particularly in multi-core environments with high-resource contention.
Implications and Future Developments
Gemmini's systematic approach opens new avenues in the evaluation and design of deep-learning systems, emphasizing the importance of integrated solutions over isolated component assessments. The platform's flexibility and openness also lay the groundwork for future research into co-design practices, exploring the interplay between hardware, software, and system architecture.
Future developments in AI systems could greatly benefit from Gemmini's framework, particularly in domains requiring dynamically adaptable architectures and resource-efficient deployments. The full-stack perspective encouraged by Gemmini underscores a shift towards holistic design principles in AI hardware research, actively promoting innovations aligned with realistic operational constraints.
Conclusion
Overall, Gemmini provides a robust infrastructure for the systematic evaluation and development of DNN accelerator architectures. By integrating all system components, from hardware to software to operating systems, researchers can comprehensively assess and optimize deep learning architectures for current and future applications.