pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables (2104.07699v5)

Published 15 Apr 2021 in cs.AR and cs.DC

Abstract: Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU baselines by an average of 713$\times$ and 1.2$\times$, respectively, while simultaneously reducing energy consumption by an average of 1855$\times$ and 39.5$\times$. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3$\times$. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo's source code is openly and fully available at https://github.com/CMU-SAFARI/pLUTo.

Citations (36)

View on Semantic Scholar

Summary

The paper presents a novel DRAM-based Processing-using-Memory architecture that replaces complex computations with efficient in-DRAM lookup table queries.
It proposes three design variants (pLUTo-BSA, pLUTo-GSA, and pLUTo-GMC) to optimize trade-offs between performance, area, and energy efficiency.
Simulations demonstrate up to 1413x speedup over CPUs and significant energy savings, paving the way for advanced memory-centric processing.

Overview of "pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables"

The paper presents a novel DRAM-based Processing-using-Memory (PuM) architecture, known as pLUTo, designed to enable the execution of complex operations through efficient in-DRAM lookup table (LUT) queries. pLUTo addresses the limitations of existing PuM architectures which struggle with complex operations by introducing a mechanism that queries LUTs directly within DRAM, replacing intricate computations with simple, bulk memory read operations. This approach capitalizes on DRAM's inherent high storage density and parallelism, significantly improving performance and energy efficiency for a variety of workloads.

Key Contributions

pLUTo demonstrates several notable contributions:

Architectural Innovation: The architecture leverages DRAM’s intrinsic properties to store and query LUTs massively in parallel. This innovation sidesteps the need for costly and complex logic to perform operations typically burdensome for conventional PuM solutions, such as multiplication and division.
Multiple Design Options: Three variants of the pLUTo architecture are proposed — pLUTo-BSA, pLUTo-GSA, and pLUTo-GMC. Each design offers a different trade-off across performance, area overhead, and energy efficiency, catering to diverse application requirements and hardware constraints.
Performance Evaluation: Through simulation, pLUTo is shown to outperform conventional CPU and GPU systems by substantial margins — up to 1413x and 2.3x, respectively, in terms of execution speed for DDR4-based implementations. This was achieved while also significantly reducing energy consumption.
Open Access Tools: The authors provide comprehensive access to pLUTo's source code, along with scripts necessary for reproducing the research results, enabling further exploration and verification by the research community.

Evaluation and Results

The authors evaluate pLUTo against various baselines across several representative workloads, including vector arithmetic and neural network operations. The findings illustrate that pLUTo consistently delivers superior performance and energy savings compared to both processor-centric and existing PIM architectures at a moderate area cost.

Performance and Area Analysis

Scalability: pLUTo's LUT-based computation inherently supports high scalability through subarray-level parallelism, with a dozen or more subarrays operating concurrently to further amplify throughput.
Energy Efficiency: Notably, the pLUTo-GMC architecture achieves the highest energy efficiency due to its optimized in-DRAM LUT queries, reducing unnecessary data movement and utilizing energy-efficient DRAM operations.

Practical and Theoretical Implications

Practically, pLUTo advances the state-of-the-art in memory-centric computation and paves the way for broader PIM adoption in data-intensive applications. By lowering the computational complexity and energy barriers, it facilitates more efficient processing in contexts like real-time data analytics and AI inference, especially pertinent in energy-constrained environments like edge computing.

Theoretically, pLUTo contributes to the architectural exploration of DRAM as more than a passive storage medium, driving innovation toward highly integrated computing environments. It underscores the potential for redefining traditional computing hierarchies, which could inspire future research into synergistic combinations of memory and compute capabilities.

Future Directions

Given pLUTo's promising results, future research may explore its application in hybrid architectures that combine it with near-bank and conventional CPU/GPU processing. Additionally, expanding support for even more complex operations via enhanced LUT schemes and exploring alternative memory technologies for PIM could provide further performance gains and application versatility.

PDF Markdown

Related Papers

GitHub

GitHub - CMU-SAFARI/pLUTo: pLUTo is a DRAM-based Processing-using-Memory architecture that leverages the high density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs) (17 stars)

YouTube

Show All Videos