ASIC-Based In-Storage CDPU
- ASIC-Based In-Storage CDPUs are specialized integrated circuits embedded in storage devices to process data in-line with minimal movement.
- They balance throughput, latency, and power by integrating multi-stage pipelines and optimized algorithm designs for tasks like compression and encryption.
- Their integration into CSDs leads to significant performance gains, lower energy consumption, and cost reductions in hyperscale data center applications.
An ASIC-based In-Storage Computational Data Processing Unit (CDPU) is a fixed-function, application-specific integrated circuit designed to execute compression, encryption, machine learning, database, or domain-specific processing tasks directly within the storage path of a computational storage device (CSD) or SSD. By embedding compute resources physically near the data, these architectures minimize data movement, reduce energy consumption, and enable real-time, scalable acceleration of storage- and I/O-intensive workloads. Modern in-storage CDPU systems span microarchitectural advances, algorithm selection, placement strategies, security models, and cost optimization, rendering them central to the evolution of hyperscale data center storage infrastructures.
1. Architectural Fundamentals
ASIC-based in-storage CDPUs are physically integrated within the SSD controller or directly as chiplets on the storage device. Unlike peripheral accelerators (e.g., PCIe cards) or host-embedded accelerators, the in-storage CDPU operates "in-line" with the flash I/O datapath. Architecturally, the CDPU can comprise multi-stage pipelines: for example, DPZip (from (Lu et al., 28 Sep 2025)) embeds a pipelined LZ77 encoder/decoder, Huffman and FSE entropy coders, dual-port SRAM, and register-backed buffers to process 4KB flash pages with sub-10 µs latency.
A typical CDPU design balances area, throughput, and power by optimizing placement of compute logic near the flash controller, using AXI/SoC interconnects. Key trade-offs are made between match-processing policy (lazy vs. exhaustive), compression granularity (fixed page size), hash table design, and entropy encoding schedules. Deterministic, high-frequency pipelining (e.g., up to 1 GHz in 12 nm process) and canonical Huffman trees (with scheduling such as cycles) enable predictable, high-throughput operation.
2. Placement Strategies and Performance Sensitivity
Placement of the CDPU profoundly impacts throughput, latency, and energy characteristics. Three regimes are recognized (Lu et al., 28 Sep 2025):
Placement Regime | Throughput & Latency | Data Path Overhead |
---|---|---|
Peripheral (QAT 8970) | High throughput (5.1 GB/s), higher latency | Significant PCIe transfer overhead |
On-chip (QAT 4xxx) | Low latency (down to 9 µs), moderate throughput | Limited bandwidth; tight CPU-coupling |
In-storage (DPZip/CDPU) | Highest throughput (5.6 GB/s compress, 9.4 GB/s decompress), lowest latency (4.7 µs/2.6 µs) | Minimal data movement; direct flash interface |
In-storage placement eliminates host-SSD memory copy, enables line-rate IO processing, and delivers superior scalability. DPZip, for example, scales throughput linearly with multi-device scaling (achieving over 98 GB/s with eight devices) and maintains application-level isolation (CV < 0.5%) under SR-IOV partitioning.
3. Algorithmic Design and Selection
Optimal algorithm selection involves a trade-off between compression ratio, computational resource requirement, and silicon efficiency. DPZip employs a resource-lean Zstd-like pipeline: a pipelined LZ77 variant with first-fit lazy matching, bounded hash tables, and canonical Huffman/FSE encoding. While a single optimized algorithm enables hardware simplicity and power efficiency, it may incur modest penalties in compression ratio relative to exhaustive software approaches.
The use of fixed 4KB granularity aligns with SSD page sizes but may limit redundancy detection across longer spans. Hardware area constraints preclude multi-algorithm configurability for most designs; thus, next-generation CDPUs may explore dictionary-based or multi-level adaptive schemes.
4. Practical Applications and System-Level Implications
ASIC-based in-storage CDPUs find application across lossless compression (Lu et al., 28 Sep 2025), erasure coding, fault injection (FI), ransomware detection (Shi et al., 12 Apr 2025), database query acceleration (Montana et al., 2022), and serverless machine learning inference (Mahapatra et al., 2023). These tasks benefit from near-data acceleration, reducing system-level energy and latency.
System integration reveals that microbenchmark gains (throughput, latency) may not directly translate to application speedup unless IO stack bottlenecks and host memory amplification are addressed. For instance, coupling SSTable writes in RocksDB or optimizing filesystem metadata management are necessary to harness full CDPU benefits. In Btrfs and ZFS, asynchronous block operations and read amplification modulate real-world acceleration.
Multi-tenant isolation is achieved through hardware partitioning (SR-IOV), with DPZip maintaining predictably low performance oscillation compared to peripheral or on-chip solutions. Standalone module power efficiency is cited at 2.5 W for DPZip versus ~132 W for CPU compression (~50× improvement); full system savings are closer to 3.5× when accounting for all management overheads.
5. Security and Trusted Execution
Security models are shifting toward hardware-based Trusted Execution Environments (TEE) for CDPUs (Xue et al., 2022). Designs such as IceClave partition controller memory into normal, protected, and secure regions, utilize hybrid-counter memory encryption, and employ Bonsai Merkle Tree integrity checks—all with minimal (<8%) overhead.
Dedicated on-chip cipher engines protect data movement between compute and flash, complementing secure boot and inline cryptographic routines. These hardware features are critical for multi-tenant cloud environments, edge computing, and sensitive applications (e.g., ransomware detection, robust integrity checks). The attack surface is reduced by minimizing reliance on software layers and implementing hardware barriers.
6. Capacity Planning, Cost, and Scalability
Analytical models such as CSDPlan (Byun et al., 2023) provide quantitative frameworks for capacity planning in CDPU deployments. Key equations model execution time for SSD-host systems () and for CSD arrays (), with break-even point (BEP) formulas:
Where improved ASIC-based CDPU compute () and internal I/O () lower the required device count and total cost of ownership (TCO). Real-world deployments indicate cost reductions of 32–55% over traditional SSD-host solutions when in-storage acceleration is leveraged.
7. Evolution, Limitations, and Future Directions
Historically, computational storage evolved from "active disks" with limited mechanical resources (Shi et al., 12 Apr 2025) to advanced SSD-based CSDs integrating FPGAs and ASICs. The modern trend favors hardwired, in-storage CDPUs for data-intensive and security-critical applications.
Recognized limitations include fixed compression granularity, lack of multi-algorithm configurability, and challenges in mapping microarchitectural gains to application-level acceleration. Future directions include integrating dictionary-based and multi-granularity compression (Lu et al., 28 Sep 2025), dynamic resource management for multi-tenant environments, standardized interfaces, and enhanced SoC interconnects to further reduce data movement bottlenecks.
A plausible implication is that increasing algorithm configurability, coupled with further SoC-level integration and standardized resource management, will position ASIC-based in-storage CDPUs as core infrastructure across cloud and edge datacenters, combining ultra-low latency, high-throughput, power efficiency, and security native to the storage path.