Papers
Topics
Authors
Recent
Search
2000 character limit reached

RecNMP: Near-Memory Processing for Recommendations

Updated 16 June 2026
  • RecNMP is a lightweight, DRAM-compliant near-memory processing architecture that integrates programmable accelerators within DDR4 buffer chips to boost recommendation inference.
  • It employs on-chip RankCache and table-aware packet scheduling to reduce memory-access latency by up to 9.8× and enhance throughput across multiple DRAM ranks.
  • Through hardware/software co-optimization and hot-entry profiling, RecNMP achieves significant energy efficiency improvements, reducing memory energy by up to 45.8%.

RecNMP (Near-Memory Processing for Recommendation)

RecNMP denotes a lightweight, DRAM-compliant near-memory processing (NMP) architecture specifically engineered to accelerate personalized recommendation inference workloads in data centers. The approach targets the memory-bound sparse embedding operations that dominate deep learning–based recommendation systems, such as the embedding lookups in models like Facebook’s DLRM. RecNMP integrates simple, programmable accelerators within DDR4 DIMM buffer chips, exploiting memory-side caching and software–hardware co-optimization to achieve significant improvements in system throughput, memory-access latency, and energy efficiency (Ke et al., 2019).

1. Motivation: Personalized Recommendation and Memory Boundness

Modern deep learning–based personalized recommendation models are dominated by sparse embedding lookups rather than traditional compute-heavy matrix–matrix or convolutional operations. Each lookup typically fetches a small (64–256 B) vector from massive, multi-GB embedding tables, followed by a low-intensity reduction (sum or weighted sum). This results in extremely low arithmetic intensity for SparseLengthsSum (SLS) operations: AISLS=#FLOPsBytes moved2d8d=0.25 FLOP/byte\text{AI}_{\rm SLS} = \frac{\# \text{FLOPs}}{\text{Bytes moved}} \approx \frac{2d}{8d} = 0.25~\text{FLOP/byte} Such operations are strictly memory-bound, sitting well below the compute roofline. Moreover, the access patterns are highly irregular:

  • Temporal reuse is modest (20–60% hit rate in a small LRU cache)
  • Spatial locality is negligible (row buffer hit rate decreases with larger line sizes)
  • Standard prefetching/tiling strategies yield little benefit
  • On a typical DDR4-2400 interface, ∼120 threads saturate the bus, after which latency sharply increases but throughput plateaus

This context demonstrates that further gains for recommendation inference require fundamentally increasing available memory bandwidth and reducing locality-insensitive DRAM latency (Ke et al., 2019).

2. Architecture: RecNMP Design and Operation

RecNMP extends the buffer chip of commodity DDR4 DIMMs to embed low-overhead SLS accelerators (“Processing Units,” PUs) at each DRAM rank. Each PU is capable of:

  • Decoding compressed NMP instructions (NMP-Inst)
  • Generating appropriate ACT/RD/PRE controls for its DRAM devices
  • Executing local gather–reduce (e.g., vector sum) on the DRAM rank
  • Optionally caching recent embedding vectors (RankCache)

Partial sums across ranks are aggregated in the buffer chip (DIMM Adder). Crucially, RecNMP does not require any modification of the DRAM chips themselves and is fully compliant with standard DDR4 protocol, leveraging only buffer-chip logic. All communication utilizes standard C/A and DQ pins, respecting timing constraints such as tRCDt_{RCD}, tCLt_{CL}, and tRPt_{RP}.

Conceptual Block Diagram

tRCDt_{RCD}6

This structure enables scalable, rank-parallel execution of memory-bound embedding operations directly on the memory infrastructure.

3. Memory-Side Caching, Packet Scheduling, and Hot-Entry Profiling

Memory-Side Caching (RankCache)

Each PU incorporates a small SRAM cache (optimum: 128 KB, 4-way set associative LRU, write-once) that exploits modest temporal locality. RankCache can capture up to 59% locality for colocated tables; with further co-optimizations, hit rates approach 75%, yielding a 2×\approx2\times reduction in median memory access time per rank. Requests can explicitly bypass the cache via a 1-bit LocalityBit carried with each NMP-Inst.

The effective latency model: Teff=HThit+(1H)TmissT_{\rm eff} = H\,T_{\rm hit} + (1-H)\,T_{\rm miss} where HH is the hit rate observed in RankCache.

Table-Aware Packet Scheduling

A host-side packet scheduler gathers NMP-Insts from all CPU threads and reorders them to maximize row buffer hits and cache locality by grouping accesses by table ID. This reordering reduces DRAM thrashing and cache conflicts.

tRCDt_{RCD}7

Hot-Entry Profiling

Hot-entry profiling distinguishes between "hot" (frequently accessed in batch) and "cold" embedding IDs. Frequencies above threshold tt receive LocalityBit=1, others 0. Threshold t=2t=2–$3$ is optimal. This policy further improves RankCache efficiency by reducing pollution from single-use accesses.

tRCDt_{RCD}8

By bypassing cold entries, this yields an additional ∼7% latency reduction.

4. Hardware/Software Co-Optimization Interface

Software API

A specialized API (e.g., nmp_sls(emb, idx, lens, out)) compiles high-level SLS calls into NMP-Inst bundles, tagging instructions with DDR commands, DRAM addresses, burst lengths, partial sum tags, and the LocalityBit.

System/Driver Adaptations

  • Embedding arrays are allocated as non-cacheable NMP pages.
  • Optional page coloring ensures physical contiguity and optimal mapping of each embedding table to a physical rank.
  • The driver profiles SLS requests for hotness, constructs packets, and issues them to the memory controller via a FIFO.
  • Exposed tuning parameters include RankCacheSize, PacketSize, PsumTagWidth, and the profiling threshold tRCDt_{RCD}0.

These co-optimizations permit RecNMP to be adopted with minimal kernel and application modifications, facilitating end-to-end optimization and deployment.

5. Evaluation Methodology and Quantitative Results

Experimental Setup

  • Benchmarks: Facebook DLRM variants (RM1-small, RM1-large, RM2-small, RM2-large), production embedding traces (T1–T8).
  • Hardware: Intel Skylake-EP, 18 cores @1.6 GHz, DDR4-2400MHz (4 channels, 1 DIMM/channel, 2 ranks/DIMM).
  • Metrics: Throughput (queries/s), latency percentiles (p50, p95, p99), memory energy (pJ/bit).
  • Simulation: MLC for bandwidth, Ramulator for cycle-level DRAM timing.

Core Results

  • Operator-Level Speedup: Up to tRCDt_{RCD}1 memory-access latency reduction for SLS.
  • End-to-End Throughput: Up to tRCDt_{RCD}2 improvement for RM2-large at batch size 256.
  • Memory Energy: tRCDt_{RCD}3 reduction compared to baseline.

Table: End-to-End Speedup by Model and Rank Count (Batch=128)

Model RecNMP (2 ranks) RecNMP (4 ranks) RecNMP (8 ranks)
RM1-small 2.2× 3.8× 4.2×
RM1-large 2.6× 4.4× 4.9×
RM2-small 3.5× 6.1× 7.9×
RM2-large 3.8× 6.8× 8.4×
  • Parallelism Decomposition: SLS achieves up to tRCDt_{RCD}4 operator-level gain, with fully connected (FC) layers unaffected. Model-level parallelism and data-level parallelism (batch size) linearly increase benefits.

6. Deployment Considerations, Limitations, and Implications

Limitations

  • Benefits diminish if embedding tables fit entirely in last-level cache.
  • High spatial locality workloads (e.g., range queries) are better served by standard burst DRAM.
  • Compute-bound recommender components (wide FC layers) require traditional CPU/GPU acceleration.

Integration Guidance

  1. Co-location: Group inference tasks by model table size to optimize DRAM channel utilization.
  2. Page Placement: Use OS page coloring or hinting to bind specific tables to single DRAM ranks for balanced load and improved locality.
  3. Scaling: Add more RecNMP-enabled DIMMs or channels to scale nearly linearly.
  4. Parameter Tuning: Adjust packet and batch sizes, as well as hot-entry threshold tRCDt_{RCD}5, according to workload characteristics.

RecNMP demonstrates that commodity DRAM-compliant, buffer-chip–centric NMP can accelerate state-of-the-art recommendation systems by exploiting available rank-level parallelism, temporal reuse, and co-optimizations in request scheduling and caching. The architecture provides a practical, scalable path for lifting the memory bandwidth constraint in real-world inference serving systems (Ke et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RecNMP.