In-Storage CDPU: Architecture & Performance

Updated 22 May 2026

CDPUs are integrated computational storage devices that combine heterogeneous compute elements with nonvolatile flash memory to perform in situ data processing.
They support versatile programming models from fixed accelerators to containerized arbitrary code execution, enabling efficient offload of data-centric tasks.
Empirical evaluations indicate throughput improvements up to 4× with reduced host CPU load and energy consumption, although challenges remain in scheduling and standardization.

In-storage Computational Data Processing Units (CDPUs) constitute a class of computational storage devices (CSDs) characterized by the tight integration of programmable, heterogeneous compute elements directly within or immediately adjacent to nonvolatile storage resources (primarily SSD-class NAND flash). Unlike classical storage (which passively shuttles data to host CPUs), CDPUs enable direct offload and execution of user-defined functions (UDFs), security primitives, and analytics kernels near or in situ to the data, reducing host-device data movement, improving energy efficiency, and enabling new system-level services. CDPU architectures span FPGAs, ASICs, embedded ARM/ RISC-V clusters, storage-coupled associative arrays, and container-virtualized cores. SNIA and the recent research literature frame CDPUs as a foundational component in the evolution of data-centric computing platforms (Lukken et al., 2021).

1. Architectural Taxonomy and System Models

CDPUs implement a multi-layered architecture, encapsulating the following key components:

Compute substrates: Embedded processors (e.g., ARM Cortex-A53, Cortex-R5/R8), reconfigurable FPGAs, custom ASIC accelerators (e.g., LZ77/Huffman engines), or associative RCAM arrays. Architectures often support multiple resources for heterogeneous workloads, as in Conduit’s tri-modal ISP/PuD-SSD/IFP design (Nadig et al., 24 Jan 2026).
Memory hierarchy: On-drive DRAM (used for FTL metadata, kernel buffers, or runtime) and SRAM/BRAM on accelerator die; persistent NAND flash channels; optional register and buffer pools for fine-grained offload.
Control and operating environments: Firmware-driven runtimes (CSDGuard, lightweight OS, virtual firmware), hardware-managed schedulers, or embedded container environments (DockerSSD).
Host-device interfaces: NVMe and PCIe (standard block I/O or extended with vendor-specific queues, peer-to-peer DMA, or Ethernet-over-NVMe tunnels), ZNS (zoned namespaces), and, in some instances, user-level APIs (OpenCL, eBPF, POSIX filesystems, or custom stubs for SQL/analytics).
Execution models: Instruction-granularity offload (e.g., SIMD vector ops in Conduit), application-level sandboxing (e.g., eBPF or JIT-ed ELF binaries), or container-driven orchestration (mini-docker engines with syscall emulation).

The SNIA model and research literature further classify CDPUs by their position in the system topology:

Drive-type (CSD): Embedded within single SSDs (e.g., Samsung SmartSSD, NGD Newport, ScaleFlux, Solana).
Array-type (CSA): Rack-level or OSD-aggregated nodes, often orchestrating multiple CSDs with parallelized analytics or federated compute flows.
Processor-type (CSP): PCIe-attached FPGAs or SmartNICs logically coupled to block storage (Lukken et al., 2021).

2. CDPU Programming Models and Offload Protocols

CDPU programmability ranges from fixed-function accelerators to flexible UDF-capable engines. The degree of programmability is captured by a hierarchy:

Fixed-transparency: Inline compression (DPZip), cryptography, regular-expression search.
Event-driven programmable dataflow: Directive-based streaming kernels (Biscuit).
Query offloading: SQL/NoSQL operators, aggregation, selection (Ibex, YourSQL).
Arbitrary code execution: Containerized workloads, OpenCL kernel launches, eBPF/JIT code blobs, or static ELF binaries (as in Solana or ZCSD).

Programming models and host offload stacks expose a variety of interfaces:

API bindings and kernel dispatch: CSDGuard and CSDGuard-like runtimes manage buffer lifecycles, I/O interception, kernel launches, and completion events, often abstracted behind a minimal OpenCL, POSIX, or block-I/O interface (Shi et al., 12 Apr 2025).
Compilation and orchestration: LLVM-based vectorization and metadata embedding enable instruction-level granularity scheduling, as in Conduit (Nadig et al., 24 Jan 2026).
Containerization: mini-docker engines running in virtual firmware partition workloads securely, with REST APIs for orchestration and Ether-oN for network integration (Kwon et al., 7 Jun 2025).
eBPF as offload ISA: Verified, bounded, and portable bytecode, with on-device JIT support (ZCSD) (Lukken et al., 2021).

Offload protocols standardize NVMe extensions, vendor-specific admin commands, or ZNS semantics for user-programmed function execution and result gathering.

3. Performance, Energy, and Latency Evaluation

CDPUs have been empirically shown to deliver significant improvements in throughput, energy efficiency, and end-to-end latency across diverse use cases:

Platform	Throughput speedup	Energy reduction	Data moved off-drive	Workloads
SmartSSD/CDPU (Shi et al., 12 Apr 2025)	up to 3×	—	up to 70%	Matrix-mult, erasure coding, security
Solana CDPU (HeydariGorji et al., 2021)	up to 3.1×	67%	68%	Speech-to-text, sentiment, recommender
DockerSSD (Kwon et al., 7 Jun 2025)	up to 2.0×	host CPU -40–60%	—	LLM, database, web services
DTLP (DDLP) (Wei et al., 2024)	1.2–1.4× (ImageNet)	15–20%	25–35% CPU offload	Deep learning preprocessing
DPZip (ASIC) (Lu et al., 28 Sep 2025)	up to 4.7 µs latency	up to 288 MB/J	—	Lossless compression/decompression
STANNIS (HeydariGorji et al., 2020)	up to 2.7×	69%	—	DNN distributed training
Conduit (Nadig et al., 24 Jan 2026)	up to 4.2×	up to 78%	—	Mixed HPC, LLM, analytics kernels

Performance gains are maximized for data-intensive, bandwidth-bound, or filter/reduce-heavy workloads. Applications with tight data locality, chunk-level parallelism, or decomposable queries (Skytether's cost models, Conduit's dynamic offload) realize the highest benefit (Montana et al., 2022, Nadig et al., 24 Jan 2026).

End-to-end latency is typically dominated by kernel execution rather than data transfer (>90% of time in SmartSSD experiments), and in-storage engines often fit within flash-program or read latency windows (e.g., DPZip). Efficient multi-tenant scheduling, queue partitioning, and fair arbitration are foundational for sustained throughput under virtualization and multi-user loads (Lu et al., 28 Sep 2025).

4. Security, Data Integrity, and Isolation

CDPUs are increasingly leveraged to support data-integrity, privacy, and resilience workloads in-storage:

Data-integrity enforcement: Local erasure-coding/decoding using GF(2^w) matrix-mult on the accelerator. Locally repairable codes minimize cross-device I/O and reduce decoding latency (Shi et al., 12 Apr 2025).
Ransomware detection and instant recovery: Monitors for entropy spikes and suspicious I/O behavior; supports internal region snapshotting or rollback as an atomic operation.
Fault injection and testing: CDPU interceptors corrupt or drop blocks to validate data-path integrity and recovery protocols.
Hardware-level isolation: MPU-enforced DRAM partitioning, strict namespace and file-system (e.g., λFS) policies, and sandboxed container/jit-code execution assure protection for multi-tenant and adversarial workloads (Kwon et al., 7 Jun 2025).
Data privacy in federated/distributed training: In-storage DNN frameworks (e.g., STANNIS) ensure local-only access to private data shards, with only public or gradient artifacts traversing the PCIe fabric (HeydariGorji et al., 2020).

No formal threat models or correctness proofs are supplied in the surveyed design literature, but mechanisms for isolation and atomicity are sketched. Further research into formally verified on-device OS and hardware TCB minimization remains an open direction.

5. Algorithmic and System-level Programming Techniques

CDPU-exploiting frameworks implement several system-level techniques, including:

Decomposable query planning: Super-plan/sub-plan partitioning pushes portions of logical plans (e.g., selection, projection, aggregation) down into independently acting CDPUs. Cost models trade per-device compute, network/queue overhead, and data movement to determine the optimal pushdown split (Montana et al., 2022).
Instruction-granular cost-based offloading: Runtime cost functions estimate per-instruction or per-kernel latency across heterogeneous in-drive substrates (controller, DRAM, and flash), scheduling each operation to the resource with minimum total latency (Conduit) (Nadig et al., 24 Jan 2026).
HLS and FPGA optimizations: Partitioning logic and loop-unrolling directives increase operational throughput but can degrade achievable clock rates; design must instrument clock/logic utilization tradeoffs and cap unrolling once the kernel frequency falls below drive-link limits (Shi et al., 12 Apr 2025).
Dual-pronged pipelines and dynamic scheduling: Coordinated host/CDPU (and GPU) pipelines (Weighted-Round-Robin, MTE) maximize overlap, load-balance, and mask variability in device or host-side throughput, particularly in learning and analytics tasks (Wei et al., 2024).
Compiler-level metadata annotation: Application loops are vectorized and annotated with resource allocation hints and schedule metadata for real-time offloading (Nadig et al., 24 Jan 2026).
Page- and zone-granular offload: ZNS-based CDPUs expose append/write/read and reset primitives with eBPF hooks for page-local UDF execution (Lukken et al., 2021).

6. Limitations, Trade-offs, and Open Challenges

Key limitations and trade-offs derived from current generation CDPU literature include:

Compute/memory/clock trade-offs: CDPU cores typically operate at lower frequencies (1–2.2 GHz) and have less DRAM than host CPUs, constraining their advantage on pure compute-bound (vs. I/O-limited) kernels.
Thermal and area constraints: Increased compute density in SSDs raises TDP, which must be managed within form-factor thermal envelopes (Lukken et al., 2021).
Scheduling complexity: Most reported systems use FIFO or host-driven queueing with limited device-side prioritization; effective scaling will necessitate QoS, priority, and multi-tenant scheduling mechanisms.
Semantic gap between block-level I/O and function-level compute: Advanced features require richer APIs, FTL and firmware bypasses, and zone- or namespace-level awareness (e.g., ZNS and Open-Channel paradigms).
Lack of standardization: Absence of unified APIs, ICDs, and function descriptor runtimes complicates application portability and cross-vendor deployment (OpenCSL and SNIA initiatives are suggested in (Lukken et al., 2021)).
Energy/area/performance non-uniformity: Small-granularity operations, non-local workloads, or incompressible data can degrade expected speedup or energy ratios, mandating cost-aware and adaptive runtime substrate selection (Lu et al., 28 Sep 2025, Nadig et al., 24 Jan 2026).

Open directions include certified on-device code verification, federated and privacy-preserving execution, cache-coherency via new memory fabrics (CXL, CCIX), and OS-level orchestration of CDPUs as compute/storage primitives.

7. Representative Use Cases and Impact Assessment

CDPUs support a broad spectrum of offloaded analytics and security-critical workloads, including:

Filtering, selection, and aggregation: TPC-H, SQL, and regex-driven scan-queries achieve 2–6× throughput improvement and 60–90% reduction in host/storage I/O (Lukken et al., 2021).
Compression and decompression: ASIC-based CDPUs achieve 4.7–9.4 GB/s compress/decompress rates, with latency and efficiency enabling operation in the storage I/O path (DPZip) (Lu et al., 28 Sep 2025).
Deep learning and analytics: End-to-end deep learning preprocessing and multi-device DNN training pipelines realize up to 2.7× speedup, 69% energy reduction, and privacy via federated storage-local training (Wei et al., 2024, HeydariGorji et al., 2020, Choe et al., 2016).
Genomics and scientific data: Array-unfriendly, partitionable workloads (single-cell expr matrices, selection/project kernels) can leverage decomposable query pushdown, trading higher per-CDPU latency for aggregate bandwidth/compute scale-out (Montana et al., 2022).
Container/service orchestration: IP-native CDPUs support Kubernetes-managed pools for disaggregated, large-scale analytics and distributed ML serving (DockerSSD) (Kwon et al., 7 Jun 2025).

Collectively, these empirical results validate the CDPU paradigm as an enabler of both near-data and in-data processing, offering quantifiable reductions in system bottlenecks caused by data movement and enabling new, composable service models for future data-centric infrastructure (Lukken et al., 2021, Nadig et al., 24 Jan 2026, Shi et al., 12 Apr 2025).