In-Storage Processing (ISP)
- In-Storage Processing (ISP) is a paradigm where compute resources like embedded CPUs, FPGAs, or ASICs are integrated within storage devices to process data in-place.
- ISP leverages specialized hardware architectures to minimize host-storage transfers, enhance throughput, and improve energy efficiency for tasks such as graph analytics and machine learning.
- ISP enables advanced use cases, including genomic analysis and containerized analytics, through tailored offload strategies that optimize data movement and computation.
In-Storage Processing (ISP) refers to the deployment of general-purpose or specialized compute resources inside storage devices (notably Solid-State Drives, or SSDs) to execute user-defined or system-assistive tasks directly where the data resides. ISP architectures leverage the internal parallelism and bandwidth of modern storage subsystems, aiming to eliminate or drastically reduce host–storage data transfers, thereby improving performance, energy efficiency, and often enabling new computational paradigms for data-intensive workloads. ISP has become integral in domains ranging from graph analytics and machine learning to large-scale retrieval, genomics, and containerized analytics.
1. Hardware Architectures and System Abstractions
ISP systems are typically built upon computational storage devices (CSDs) that integrate one or more compute resources—embedded CPUs (e.g., ARM Cortex-A53/A9), FPGAs, or custom ASIC accelerators—alongside standard NAND flash, NVMe/PCIe interfaces, and local DRAM or SRAM pages buffers. Architectures include:
- Firmware-based CSDs with lightweight ARM cores executing both flash management (FTL, wear-leveling, garbage collection) and offloaded kernels (e.g., SmartSAGE on Cosmos+ OpenSSD or Newport CSDs) (Lee et al., 2022, HeydariGorji et al., 2020).
- FPGA-augmented SSDs for domain-specific acceleration (e.g., PreSto for RecSys data preprocessing, BlueDBM for sparse pattern processing) (Lee et al., 2024, Jun et al., 2016).
- ASIC-integrated units (e.g., DPZip for on-drive lossless compression) (Lu et al., 28 Sep 2025).
- DRAM/flash crossbar and RCAM arrays for associative or parallel in-data operations (e.g., PRINS) (Yavits et al., 2018).
- Virtualized/containerized ISP enabling disaggregation via secure, managed, in-SSD microenvironments (e.g., DockerSSD) (Kwon et al., 7 Jun 2025).
A key architectural differentiator is the location and accessibility of the compute substrate (controller CPU, DRAM, flash periphery logic), the type and size of local memory available, and the device’s support for custom firmware, networked management, and resource isolation.
2. Programming Models, Software Co-Design, and Security
Programming models for ISP encompass both low-level firmware extensions and higher-level container or runtime systems:
- Firmware Co-Design: Direct offload of phases such as sampling, filtering, or feature computation to in-firmware logic, exposed to the host via custom NVMe admin commands (e.g., SmartSAGE’s subgraph generation command, PreSto’s admin channel for per-feature task configuration) (Lee et al., 2022, Lee et al., 2024).
- Virtualization and Containerization: Lightweight container-like infrastructure (mini-Docker, OS-level virtualization with syscall emulation) isolates workloads and exposes standard
dockerworkflows to the host, supporting distributed management and resource throttling (Kwon et al., 7 Jun 2025). - Security and Trust: Since co-tenant ISP workloads share in-SSD DRAM and may request flash access, sophisticated isolation mechanisms are essential. IceClave extends ARM TrustZone to partition controller memory and FTL tables, employing in-DRAM and flash-channel encryption, and per-entry access control (unique TEE ID check) (Xue et al., 2022). Overhead for such TEE enforcement is low: 7.6% in throughput compared to unsecured ISP, with up to 2.31× speedup over host-based TEEs.
- Languages and APIs: Current practice uses low-level kernel insertion or OpenCL/bitstream deployment for FPGAs, ARM ELF binaries for embedded CPUs, and, increasingly, Docker/REST APIs for managed, distributed environments.
3. Task Partitioning, Offload Strategies, and Workload Mapping
Deciding what computation to offload is crucial. ISP excels when offloaded tasks are:
- High-latency, low-compute, memory-bound: e.g., neighbor sampling in GNN training (SmartSAGE), feature bucketing in RecSys (PreSto), or MapReduce pre-filtering (Lee et al., 2022, Lee et al., 2024, HeydariGorji et al., 2021).
- Sparse pattern processing and filtering: e.g., in-genomics via hash-based or chain-based filtering of reads (GenStore, MegIS) (Ghiasi et al., 2022, Ghiasi et al., 2024)
- Vector and top-k/threshold pruning operations: e.g., retrieval-augmented generation (REIS), mass-spectrometry vector similarity (FeNOMS) (Chen et al., 19 Jun 2025, Pinge et al., 13 Oct 2025).
- Compression, page hashing, content scans: e.g., ASIC-based DPZip, BlueDBM, PRINS (Lu et al., 28 Sep 2025, Jun et al., 2016, Yavits et al., 2018).
Task partitioning is generally achieved by profiling runtime memory, bandwidth, and computational requirements. For example, in SmartSAGE, only the “neighbor sampling” stage of GNN training is offloaded, while feature-lookup and aggregation execute on the host’s GPU (Lee et al., 2022). PreSto offloads feature generation and normalization but leaves host–SSD decode and final assembly on host servers (Lee et al., 2024).
Offloading is also used for privacy (data never leaves the drive, as in STANNIS and Solana) and energy efficiency, with energy savings up to 69% (STANNIS) and 67% (Solana) for distributed ML and NLP workloads (HeydariGorji et al., 2020, HeydariGorji et al., 2021).
4. Quantitative Performance Models and Experimental Outcomes
Performance analyses in the literature employ both analytical models and extensive microbenchmarks:
- Latency and Bandwidth: ISP can amplify sampling throughput by 10–20× (SmartSAGE), achieving near parity with in-memory operations (within 1.1–1.3× slowdown); capacity scaling is at terabyte scale for SSDs (vs. hundreds of GB for DRAM) (Lee et al., 2022).
- Effective Data Movement Reduction: Only filtered, sampled, or aggregated (compact) results cross the PCIe bus, reducing DRAM↔SSD movement by up to 20× (SmartSAGE), 71× (MegIS for metagenomics), or more (Lee et al., 2022, Ghiasi et al., 2024).
- Throughput and Scaling:
- PreSto achieves 9.6× speedup in RecSys preprocessing and 11.3× better energy efficiency vs. large pools of CPU servers (Lee et al., 2024).
- GenStore attains up to 33.6× speedup on error-prone long reads and 27× energy reduction in genomic filtering (Ghiasi et al., 2022).
- Containerized ISP (DockerSSD) reaches up to 2× speedup for I/O-intensive analytics and 7.9× for distributed LLM inference (native flash KV caching) (Kwon et al., 7 Jun 2025).
- Multi-CSD deployments exhibit near-linear scaling in throughput and energy savings (STANNIS, Solana) (HeydariGorji et al., 2020, HeydariGorji et al., 2021).
- Comparative Placement: In-storage CDPU (DPZip) offers 4.7 μs compression latency per 4 KB block, outperforming host/PCIe/on-chip placements (up to 2× read/write throughput and 0.5% CV in latency across VMs) (Lu et al., 28 Sep 2025).
- Application-specific ISP: Specialized architectures (e.g., FeNOMS with 3D FeNAND, PRINS with RCAM) break the memory bandwidth wall, achieving up to 224× (energy) and 10⁴× (throughput) advantage over off-storage/host approaches (Pinge et al., 13 Oct 2025, Yavits et al., 2018).
5. Key Design Trade-Offs, Limitations, and Security
Trade-offs in ISP design include:
- Compute Power vs. Data Movement: Embedded cores are generally less powerful than host CPUs/GPUs, but ISP’s data reduction and parallelization can dominate for I/O-bound or memory-bound tasks (Lee et al., 2022, Kwon et al., 7 Jun 2025).
- Granularity of Offload: In-storage CDPUs operate at fixed block sizes (often 4 KB); while optimal for stream analytics, less flexible for content-aware or variable-size workloads (Lu et al., 28 Sep 2025).
- Firmware and Programmability: Deep firmware modification can impede adaptability or vendor interoperability. DockerSSD minimizes this with OS-level containers and virtual firmware abstraction (Kwon et al., 7 Jun 2025).
- Security and Multi-Tenancy: Robust resource partitioning, DRAM encryption, and TEE enforcement (e.g., IceClave’s TrustZone partition, in-channel flash encryption) are necessary to prevent cross-tenant attacks or code injection (Xue et al., 2022).
- Wear, Endurance, Resource Limits: SLC/TLC partitions, coarse mapping, and wear-leveling are employed to balance endurance and performance. For complex or large models, limited DRAM on-device can be a constraint (Chen et al., 19 Jun 2025, HeydariGorji et al., 2021).
- Workload Suitability: ISP is least effective for compute-bound kernels with high floating-point requirements or workloads that exceed internal DRAM footprints and require significant multi-stage computation not easily mapped to local PEs (Lee et al., 2022, Yavits et al., 2018).
6. Broader Implications and Future Research Trajectories
ISP advances not only performance and efficiency but also reconfigures the architectural boundary between storage and compute for:
- Big Data and Scientific Analytics: Line-rate scan/filter/aggregate on petabyte-scale data, practical on commodity form factors (E1.S/U.2) (HeydariGorji et al., 2021, Byun et al., 2023).
- Scalable and Energy-Efficient ML Pipelines: Enabling terabyte-scale GNNs, LLM inference pools, federated privacy-preserving ML (Lee et al., 2022, Kwon et al., 7 Jun 2025, HeydariGorji et al., 2020).
- Cloud and Disaggregated Storage: Direct NVMe-over-Ethernet deployment of containerized, in-SSD code and dynamic analytics orchestration (Kwon et al., 7 Jun 2025).
- Capacity and TCO Planning: Closed-form models enable break-even analysis and cost optimization for CSD/ISP deployment vs. host-centric architectures (Byun et al., 2023).
Important open areas include adaptive and ML-guided offload partitioning (e.g., Conduit), extension to heterogeneous and accelerator-filled SSDs (virtual-GPU, direct FPGA), formal verification of firmware isolation, and advances in high-level programming frameworks for dynamic and multi-tenant ISP (Nadig et al., 24 Jan 2026, Kwon et al., 7 Jun 2025, Xue et al., 2022).
Key References:
- SmartSAGE: Training Large-scale Graph Neural Networks using In-Storage Processing Architectures (Lee et al., 2022)
- Containerized In-Storage Processing and Computing-Enabled SSD Disaggregation (Kwon et al., 7 Jun 2025)
- In-Storage Embedded Accelerator for Sparse Pattern Processing (Jun et al., 2016)
- STANNIS: Low-Power Acceleration of Deep Neural Network Training Using Computational Storage (HeydariGorji et al., 2020)
- PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models (Lee et al., 2024)
- In-storage Processing of I/O Intensive Applications on Computational Storage Drives (HeydariGorji et al., 2021)
- Conduit: Programmer-Transparent Near-Data Processing Using Multiple Compute-Capable Resources in Solid State Drives (Nadig et al., 24 Jan 2026)
- REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing (Chen et al., 19 Jun 2025)
- FeNOMS: Enhancing Open Modification Spectral Library Search with In-Storage Processing on Ferroelectric NAND (FeNAND) Flash (Pinge et al., 13 Oct 2025)
- ASIC-based Compression Accelerators for Storage Systems: Design, Placement, and Profiling Insights (Lu et al., 28 Sep 2025)
- PRINS: Resistive CAM Processing in Storage (Yavits et al., 2018)
- MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing (Ghiasi et al., 2024)
- Near-Data Processing for Differentiable Machine Learning Models (Choe et al., 2016)
- Building A Trusted Execution Environment for In-Storage Computing (Xue et al., 2022)
- GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis (Ghiasi et al., 2022)
- An Analytical Model-based Capacity Planning Approach for Building CSD-based Storage Systems (Byun et al., 2023)