Near Data Processing (NDP)

Updated 1 July 2025

Near Data Processing (NDP) is an architectural paradigm that co-locates processing elements with data storage or memory to minimize data movement and improve efficiency.
NDP accelerates data-intensive applications like similarity search, machine learning, database analytics, and key-value stores, showing significant performance and energy efficiency gains over traditional CPU-centric architectures.
NDP encompasses various implementations, including processing near memory (NMP), processing-in-memory (PIM), and in-storage processing (ISP), leveraging technologies like 3D-stacked memory and computational SSDs.

Near Data Processing (NDP) is an architectural paradigm in which computation is co-located with data storage or memory modules, enabling data-intensive operations to execute close to where the data physically resides. The primary motivation for NDP arises from the inefficiencies and scalability bottlenecks of modern compute-centric architectures, which incur significant performance and energy penalties due to the repeated movement of large data volumes across the memory hierarchy. NDP encompasses a range of implementations—from in-memory accelerators embedded in 3D-stacked DRAM, to programmable logic within SSDs, DIMMs, and storage-class memory. This article synthesizes foundational concepts, representative system architectures, methodologies, and quantitative benefits drawn directly from academic literature and hardware evaluation, providing a comprehensive reference for advanced research and deployment of NDP.

1. Architectural Foundations and Principles

NDP is premised on co-locating, or even embedding, processing elements in or near memory/storage devices to reduce data movement, improve bandwidth utilization, and enhance energy efficiency. Architecturally, NDP can be categorized by its integration level:

Near-Memory Processing (NMP): Compute logic is placed near memory, typically in a dedicated logic layer of a 3D-stacked device such as the Hybrid Memory Cube (HMC). An exemplar is the Similarity Search Associative Memory (SSAM), where programmable accelerators are tightly coupled to DRAM vaults in the HMC logic layer, maximizing bandwidth for parallel memory-bound tasks and returning only compact results to the host (1606.03742).
Processing-in-Memory (PIM): Computational resources are physically embedded in memory arrays (e.g., DRAM, SRAM, ReRAM). This enables massively parallel bitwise or analog operations such as matrix-vector multiplication (for DNNs) or in-storage bitmap scans (1905.04767, 2112.12630).
In-Storage Processing (ISP): Programmable compute units (CPUs, FPGAs) are integrated within SSDs or persistent memory controllers. This arrangement supports workload offloading such as in-storage ML training (1610.02273), time series analysis (2010.02079, 2206.00938), and log compaction for key-value stores (1807.04151).
CXL Memory Expanders: NDP is increasingly relevant for memory disaggregation via Compute Express Link (CXL) interfaces. Contemporary designs propose low-overhead, general-purpose NDP (e.g., M $^2$ NDP) to address the bandwidth/latency imbalance of CXL-attached memory (2404.19381).

These approaches fundamentally restructure the compute/memory interface, shifting from a rigid von Neumann bottleneck to a hierarchy where processing elements serve as both data hosts and computation accelerators.

2. System-Level Implementations and Programming Models

System integration of NDP requires careful co-design of hardware acceleration, memory organization, system software, and application APIs:

Host-System Integration: NDP modules can replace or supplement standard DRAM in a system. For example, SSAM HMC devices are accessible via the memory bus and expose a minimal programming interface supporting memory allocation and NDP operation invocation with APIs analogous to those found in GPU offloading frameworks (1606.03742).
Handle of Virtual Memory and Address Translation: The high translation overheads for irregular NDP workloads have led to tailored page table designs (e.g., NDPage), which merge lower-level page tables and introduce L1 cache bypassing for page table entry (PTE) accesses, accommodating the shallow cache hierarchies and irregular access patterns of NDP cores (2502.14220).
Kernel Partitioning and Scheduling: Efficient utilization of NDP systems in heterogeneous architectures requires static or dynamic partitioning of computational kernels according to bandwidth and arithmetic intensity. The NDFT framework assigns memory-bound LR-TDDFT kernels to NDP, while compute-heavy kernels remain on CPUs, leveraging a cost-aware scheduling algorithm and SPM-based shared memory for data sharing (2504.03451).
Cooperation with General-Purpose CPUs/GPUs: NDP is often designed to complement, rather than replace, existing host CPUs/GPUs. This is evident in collaborative ML acceleration using in-storage processing (ISP) platforms (1610.02273) and in LLM inference acceleration by partitioning and offloading cold neuron computation to NDP-enhanced DIMMs while reserving hot neuron computation for GPUs (2502.16963).
Programming Models: NDP programming abstractions borrow from accelerator paradigms—offloading functions to the memory or storage side using high-level APIs, kernel descriptors, or candidate code-JIT compilation (e.g., LLVM IR for predicate pushdown in Taurus) (2506.20010). For general-purpose NDP in CXL memory, memory-mapped functions and lightweight $\mu$ threads provide fine-grained, user-space-compatible API and efficient concurrent execution (2404.19381).

3. Application Domains and Quantitative Impact

NDP has been demonstrated to accelerate a wide range of data-intensive workloads. Representative evaluations include:

Application Domain	NDP Mechanism	Reported Gains over Baseline
Similarity Search (1606.03742)	SSAM HMC-based, vectorized kNN	426× area-normalized throughput, 934× energy efficiency (over CPU)
Machine Learning (1610.02273)	In-Storage Parallel SGD on SSD	3× convergence improvement (EASGD), memory-constrained hosts outperform IHP
Data Analytics (DBMS) (2506.20010)	Predicate/projection/aggregation pushdown in Taurus Page Stores	63% less network traffic, 50% less CPU, 28% lower query time (18/22 TPC-H queries)
Key-Value Stores (1807.04151)	Collaborative offload of LSM compaction to NDP in SSD	2.0× throughput, 36% reduced write ampl., 43–54% less latency (vs LevelDB)
Time Series Analysis (2010.02079, 2206.00938)	NDP in 3D HBM (NATSA)	Up to 14.2× performance, 27.2× energy reduction (compared to multi-core baseline)
DNN Inference (FCL) (2208.05294)	3D-stacked NDP vs CHA, PIM	NDP: 10.6× faster than conventional, 39.9× faster than PIM for FCL layers
LLM Inference (2502.16963)	GPU + NDP-DIMM hybrid inference	75.24× speedup (Hermes vs offloading), server-scale speed at commodity cost
OLAP / Graph / LLM (CXL NDP) (2404.19381)	General-purpose NDP in CXL memory	Up to 128× speedup, 80% energy reduction, linear scaling with device count

Performance and energy improvements are typically achieved by maximizing memory-side bandwidth, minimizing host-to-device data movement, and matching the computation model to the available local memory bandwidth.

4. Memory and Storage Technologies

NDP leverages advances in emerging and mature memory/storage technologies:

3D-Stacked Memory: Devices such as HMC and HBM integrate a logic layer beneath multiple DRAM dies interconnected via TSVs, providing $>320$  GB/s internal bandwidth (1606.03742). This architecture is particularly well-suited for memory-bound workloads (e.g., matrix profile, DNN inference).
NAND Flash SSDs: Modern SSD controllers incorporate multicore processors and, in CSDs (Computational Storage Devices), FPGAs or programmable logic. These platforms are used for in-storage analytics, erasure coding, ML training, and graph search, efficiently utilizing the inherent parallelism in NAND organization (channels, LUNs, planes) (1610.02273, 2312.03141).
ReRAM and Emerging NVM: ReRAM-based NDP enables analog matrix-vector multiplication, supporting high-density DNN acceleration within memory (2112.12630). However, the analog-digital interface and device reliability remain key challenges.
CXL Memory Expanders: CXL-attached memory pools extend DRAM capacity with general-purpose or application-driven NDP capabilities, necessitating highly efficient host-device communication interfaces and light hardware threading to saturate internal bandwidth (2404.19381).

5. NDP for Database, Storage, and Scientific Workloads

NDP is increasingly integrated into database/storage systems and scientific applications:

Database Systems: In Taurus, NDP pushes selection, projection, and (partial) aggregation into storage servers at the page store level, freeing compute resources and reducing network transfer. Predicate evaluation uses JIT-compiled LLVM IR shipped from MySQL compute nodes. Per-query, per-scan caches minimize repeated predicate compilation overhead (2506.20010).
KV Stores and Compaction: NDP-enabled SSDs (e.g., Co-KV) partition LSM-tree compaction tasks between host and device using key-range aware schemes. The collaborative approach delivers lower write amplification and reduced host CPU utilization, substantially accelerating update-intensive workloads (1807.04151).
Persistent Memory with Crash Consistency: In NearPM, NDP in persistent memory controllers accelerates data-intensive crash consistency primitives (logging, checkpointing, shadow paging) by exposing standardized offloadable primitives, with correctness maintained via partitioned persist ordering (PPO) (2210.10094).
Scientific Computation: The NDFT system exploits NDP for memory-bound kernels (e.g., FFT in LR-TDDFT), scheduling based on application profiling and implementing intra-stack shared scratchpad memory to avoid memory bloat during pseudopotential calculations. This approach enables quantum chemistry for system sizes previously unreachable due to memory constraints (2504.03451).

6. Technical Innovations, Design Tradeoffs, and Open Challenges

NDP system design involves tradeoffs among programmability, hardware cost, area/power, bandwidth utilization, and workload flexibility:

Resource Allocation & Scheduling: Effective resource utilization requires matching kernel types to hardware based on memory- versus compute-intensity, employing cost models for function-level partitioning (2504.03451).
Address Translation & Virtual Memory: High address translation overheads from frequent, irregular access patterns are addressed by merged multi-level page tables and metadata-aware cache bypassing (2502.14220).
Synchronization: Synchronization for highly parallel NDP systems is challenging due to the absence of cache coherence and shared caches. Dedicated, lightweight hardware engines (e.g., SynCron) employ hierarchical, message-passing protocols among NDP units and hardware-only overflow handling (2101.07557, 2211.05908).
Data Placement and Computation Affinity: NDP effectiveness is sensitive to the physical colocation of data and computing units. Hardware/software co-designs (e.g., CODA) manage memory placement and steer computation to maintain locality, using dual-mode mapping and fine-grained affinity scheduling for workloads in GPU/NDP clusters (1710.09517).
Data Movement Optimization: Quantization-aware algorithms (e.g., logarithmic activation quantization in QeiHaN) can enable memory-side bit manipulation, reducing memory accesses at a fine granularity via tailored storage layouts and computation logic (2310.18181).
Programmability and Ecosystem Integration: Compatibility with existing ISAs, user-space memory mapping for offload commands, and application-transparency for kernel scheduling are critical for practical NDP deployment, as exemplified in general-purpose CXL NDP designs (2404.19381).

7. Implications and Future Directions

NDP is maturing into a core architectural principle for data-centric systems, with demonstrated impact across scientific computing, cloud database systems, ML inference/training, and high-throughput analytics. Open research and engineering directions include:

Programming models and APIs that unify heterogeneous NDP targets and provide application transparency while exposing optimization opportunities.
OS/hardware collaboration strategies for virtual memory, address translation, and resource management tailored for NDP characteristics.
Integration of high-capacity, high-bandwidth emerging memory/storage technologies (e.g., ReRAM, XPoint) with NDP for broader class of workloads.
Hardware/software co-design for next-generation security, data integrity, and resilience, leveraging embedded compute, as seen in CSD-based solutions (2504.15293).
Scalable synchronization and work distribution mechanisms for exascale, irregular, and graph-centric applications.

NDP continues to evolve as new memory, storage, and interconnect technologies make it possible to design systems with computation ever closer to where large data resides, fundamentally restructuring the relationship between data, memory, and compute in modern and future computing platforms.