RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing
The paper presents RecNMP, a solution designed to speed up personalized recommendation systems using near-memory processing. A primary focus of recommendation systems is to handle the large demand for AI cycles within data centers, driven by deep learning models that primarily execute sparse embedding operations. These operations are characterized by irregular memory access patterns that create bottlenecks, making acceleration challenging. RecNMP addresses these challenges by deploying near-memory processing (NMP) in a commodity DRAM environment, specifically to speed up personalized recommendation inference.
Characterization of Recommendation Models
The paper characterizes production-grade recommendation models and identifies a major bottleneck: memory bandwidth saturation due to embedding operations. These operations exhibit high parallelism but lead to bandwidth constraints, stalling the performance of AI inference. Recommendation systems, particularly at Facebook, demonstrate that over 70% of AI cycles are consumed by these models. Despite their significance, research in optimizing these models remains limited compared to CNNs and RNNs.
Recommendation models utilize both dense (continuous) and sparse (categorical) features. Sparse features come with large embedding tables accessed through SparseLengthsSum (SLS), which works by performing small sparse lookups within large tables. The paper highlights that SLS operations present two unique challenges: poor predictability due to irregular table indices and overwhelming on-chip memory resources, which traditional caching cannot address effectively.
RecNMP Architecture
RecNMP introduces a near-memory processing architecture embedded within DRAM buffer chips. It operates by executing bandwidth-intensive embedding operations locally and leveraging rank-level parallelism. This approach minimizes off-chip bottlenecks, achieving up to 8× increased bandwidth within the constrained memory access architectures. The architecture incorporates a DDR4-compatible design utilizing lightweight functional units tailored for SLS-family operators.
The NMP instructions are optimized to compress DDR commands, allowing higher parallelism across data channels without compromising C/A bus bandwidth. This compression is pivotal to managing the irregular data pattern prevalent in handling sparse embeddings. RecNMP’s programming model offers a heterogeneous compute setup, akin to OpenCL, to facilitate host-NMP coordination.
Performance Benefits and Evaluation
The substantial findings from RecNMP indicate a 9.8× latency speedup over optimized baseline systems. Memory-side caching enhances performance further, while table-aware packet scheduling and hot entry profiling offer additional locality-based optimization. Speculative evaluation via production traces rather than purely randomized patterns reveals RecNMP's ability to improve throughput by 4.2× and cut memory energy usage by 45.8%.
The scalable nature of RecNMP demonstrates benefits across both single and co-located models, reducing cache interference and optimizing the memory usage associated with non-SLS layers, such as fully connected (FC) operations, leading to an up to 30% reduction in latency. Moreover, despite incorporating additional hardware within the DIMM architecture, RecNMP remains efficient in terms of area and power consumption, well within industrial standards for DRAM components.
Conclusion
RecNMP renders a practical solution to the challenges faced by personalized recommendation systems, addressing fundamental memory bottlenecks and optimizing scarce resources within data center environments. The research presented in the paper lays groundwork for further exploration into NMP-driven architectures, particularly as AI model complexities continue to evolve, alongside demands for optimized high-throughput systems in practical deployments. Future work may involve refining the co-optimization strategies and exploring further simplification of the instruction set to broaden the applicability and efficiency of near-memory architectures in emerging AI applications.