Maximizing CNN Accelerator Efficiency Through Resource Partitioning (1607.00064v2)

Published 30 Jun 2016 in cs.AR

Abstract: Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.

Authors (3)

Yongming Shen (2 papers)
Michael Ferdman (4 papers)
Peter Milder (5 papers)

Citations (302)

View on Semantic Scholar

Summary

Resource Partitioning to Maximize CNN Accelerator Efficiency

The paper, "Maximizing CNN Accelerator Efficiency Through Resource Partitioning," explores a novel approach to improving the computational efficiency of FPGA-based CNN accelerators. The predominant paradigm for leveraging FPGAs in CNN acceleration has involved designing a single Convolutional Layer Processor (CLP) that processes CNN layers iteratively. However, this approach often results in underutilization of FPGA resources due to the diverse dimensional requirements of different CNN layers.

Problem Statement and Approach

The authors identify a key inefficiency in the conventional design methodology: using a uniform hardware configuration to process CNN layers of varying dimensions leads to dynamic inefficiencies, as evident in underutilized arithmetic units. The paper highlights a 24% dynamic utilization rate in existing Single-CLP designs, which implies significant computational resources remain idle.

To address this, the authors propose a Multi-CLP paradigm. This architecture partitions FPGA resources into smaller, specialized processors tailored to the distinct dimensions of various CNN layers. This strategy promises an increase in computational throughput by allowing concurrent processing of multiple layers across different CLPs.

Optimization Methodology

The authors introduce an optimization algorithm to automatically design Multi-CLP configurations. This algorithm evaluates constraints such as the FPGA's DSP and BRAM limits, alongside the CNN's specific compute and memory demands. The algorithm aims to maximize throughput by tailoring CLP sizes to specific layer dimensions, thereby enhancing the dynamic utilization of resources.

The Multi-CLP approach is meticulously evaluated against several deep learning networks including AlexNet, SqueezeNet, and GoogLeNet. Numerical results reveal substantial performance gains: a 3.8x increase in throughput for AlexNet and speedups of 2.2x and 2.0x for SqueezeNet and GoogLeNet respectively, compared to state-of-the-art Single-CLP configurations.

Theoretical and Practical Implications

The authors substantiate their claims through a thorough evaluation, demonstrating enhanced arithmetic unit utilization—up to 99%—and systematically analyzing resource consumption and performance metrics. Practically, using Multi-CLP designs leads to a more balanced and efficient utilization of FPGA resources, which directly translates into accelerated processing times for various CNN tasks.

Theoretically, this paper challenges the status quo of CNN accelerator design by advocating for a more adaptable resource allocation that aligns with the heterogeneous nature of layer dimensions in CNN architectures. The implications reach beyond CNN acceleration into areas requiring domain-specific architectures that could benefit from similar partitioning strategies.

Future Research Directions

The paper opens avenues for further research into adaptive FPGA designs that can dynamically reallocate resources based on real-time computational demands. Investigating the interplay between power consumption and increased throughput in larger and more complex neural networks can offer insights into designing resource-effective deep learning accelerators.

In conclusion, through methodological resource partitioning, this paper presents a compelling argument for Multi-CLP architectures, providing both a robust framework and practical benchmarking results that demonstrate notable improvements in the efficiency of FPGA-based CNN accelerators.

PDF Markdown