Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks (1805.03718v1)

Published 9 May 2018 in cs.AR and cs.NE

Abstract: This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed. The Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in-cache. The proposed architecture also supports quantization in-cache. Our experimental results show that the proposed architecture can improve inference latency by 18.3x over state-of-art multi-core CPU (Xeon E5), 7.7x over server class GPU (Titan Xp), for Inception v3 model. Neural Cache improves inference throughput by 12.4x over CPU (2.2x over GPU), while reducing power consumption by 50% over CPU (53% over GPU).

Citations (313)

View on Semantic Scholar

Summary

The paper presents a novel in-cache acceleration method by repurposing SRAM arrays for bit-serial arithmetic in deep neural networks.
It enables efficient execution of convolutional, fully connected, and pooling layers through massively parallel computation within the cache.
Experimental results achieve up to an 18.3x reduction in latency and a 50% decrease in power consumption compared to traditional CPU and GPU architectures.

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

The paper "Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks" presents an innovative approach to improving the efficiency of Deep Neural Network (DNN) inferences by utilizing cache structures as computation units rather than merely storage. This paper highlights a significant departure from traditional Processor-In-Memory (PIM) designs, offering a refined solution to mitigate the memory wall challenge pervasive in data-intensive applications.

The proposed architecture, Neural Cache, involves re-purposing the existing SRAM arrays in the cache memory to support massively parallel computations directly within the memory. This is achieved by executing arithmetic operations, like addition and multiplication, in a bit-serial manner leveraging row activation techniques and bit-line interaction within the SRAM cells. The approach facilitates execution of convolutional, fully connected, and pooling layers fully in-cache and is capable of supporting quantization in situ. By storing data in transposed form, Neural Cache resolves potential challenges related to data layout and interaction needed for more complex arithmetic operations.

The experimental results underscore the proposed architecture's competence in reducing the inference latency of the Inception v3 model by 18.3x over cutting-edge multi-core CPUs equipped with Xeon E5 processors and by 7.7x when compared to advanced server-grade GPUs like the Titan Xp. Additionally, the provided throughput improvements are notable, at 12.4x over CPU and 2.2x over GPU, with power consumption reduced by approximately 50%.

Key Contributions:

In-Cache Arithmetic: This architecture differentiates itself by introducing computational capabilities within cache structures themselves, a novel advancement that leverages the large silicon area dedicated to caches in modern processors.
Massive Parallel Compute Surface: By repurposing the cache, the architecture transforms caches into vast vector computation units significantly larger than those found in contemporary GPUs.
Transposed Data Layout and Execution Model: The researchers have proposed efficient data layouts and execution models that capitalize on underlying hardware geometry, exposing DNN parallelism efficiently.
Energy Efficiency: The design delivers substantial energy savings by drastically reducing on-chip data movement, a recognized significant factor in processing energy consumption.

Implications and Future Directions:

The Neural Cache architecture has profound implications for the design of future processors, suggesting an integrated compute-storage approach that extends beyond neural network acceleration to a range of data-parallel computational tasks. The ability to perform computation directly within the cache has the potential to influence low-power and edge computing devices where energy efficiency is paramount.

From a theoretical standpoint, this approach negotiates the dichotomy between logic and memory, akin to the structure of human cognition, representing a step towards neuromorphic computing architectures. Practically, this methodology can lead to more responsive AI systems with lower latency requirements, broadening the reach of AI in real-time applications.

Future research could explore optimization of the bit-serial operations themselves or the incorporation of sparsity-aware computation to handle recent trends in sparse neural networks efficiently. Integration with other emerging memory technologies, such as non-volatile memory, may also offer opportunities for combining speed, reliability, and density in future cache architectures.

In conclusion, the Neural Cache represents a forward-thinking paradigm that effectively pushes the boundaries of existing computing architectures by synergizing memory with computation. As compute workloads continue to grow in complexity and size, such architectural innovations provide valuable pathways to harness the ever-increasing need for efficient computation.

PDF Markdown

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks (1805.03718v1)

Summary

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Related Papers