MegIS: In-Storage Processing for Metagenomics
- MegIS is an in-storage processing system that accelerates end-to-end metagenomic analysis by relocating computation into the storage layer.
- It partitions tasks between a high-performance host and a custom NAND-flash SSD enriched with lightweight accelerators to optimize data flow.
- MegIS employs streaming algorithms and a tailored FTL to achieve significant speedups, enhanced energy efficiency, and lower system costs.
MegIS is a high-performance, energy-efficient, and low-cost in-storage processing (ISP) system specifically designed for accelerating end-to-end metagenomic analysis. Metagenomic analyses frequently encounter data movement bottlenecks, primarily due to the transfer of massive, low-reuse k-mer datasets between storage and host memory. MegIS addresses this challenge by distributing specific computation tasks between a conventional host and a commodity NAND-flash SSD, which is augmented with lightweight custom logic and firmware. By relocating computation over large, low-reuse data (e.g., k-mers from large reference databases) into the storage layer and transmitting only compact results to the host, MegIS achieves substantial improvements in performance, energy efficiency, accuracy, and system cost (Ghiasi et al., 2024).
1. System Architecture and Design Goals
MegIS is implemented as a drop-in extension to a standard NAND-flash SSD, maintaining standard SSD functionality when not engaged in ISP tasks. The architecture is divided into two distinct domains:
- Host Side: Composed of a large DRAM pool and high-performance CPU cores (e.g., AMD EPYC 7742), responsible for k-mer extraction, sorting, and exclusion.
- In-Storage Side: Utilizes a conventional SSD controller (8–16 channels), with per-channel SRAM register banks, 4 GB LPDDR4 DRAM, and specialized 65 nm logic accelerators (Intersect units, k-mer registers, index generators, and a lightweight FSM-based control unit).
The central design goal of MegIS is to eliminate the “last mile” data-movement bottleneck by maximizing in-storage computation for metagenomic workflows. MegIS incorporates five key design elements:
- Task partitioning between host and SSD.
- Pipelined, storage-aware data and computation flow.
- Storage technology–aware algorithmic optimizations.
- Data mapping tuned to exploit SSD parallelism.
- Lightweight, throughput-matched, and energy-efficient in-SSD accelerators (Ghiasi et al., 2024).
2. Task Partitioning and Data Flow Coordination
MegIS divides the metagenomic pipeline into three pipelined steps, orchestrated across the host and SSD:
- Step 1 (Host): Extracts k-mers from sequencing reads, partitions them lexicographically into up to 512 buckets, sorts them, and applies frequency-based pruning. Each bucket is sequentially written to the SSD via a dual-buffer DRAM scheme, overlapping transfers with subsequent extraction.
- Step 2 (In Storage):
- Intersection Finding: Each channel uses two 120-bit registers to independently stream k-mers from flash (reference database) and DRAM (query buckets), performing real-time matches and recording intersection results in SSD DRAM.
- TaxID Retrieval: MegIS introduces K-mer Sketch Streaming (KSS), a two-table streaming lookup mechanism that avoids pointer-chasing in tree structures. Per-channel Index Generators use prefix recognition to efficiently emit taxIDs.
- Step 3 (In Storage, Optional): For species abundance estimation, MegIS merges candidate per-species k-mer indices from flash into a unified index, updating genome offsets on the fly for efficient host-side read-mapping.
Coordination between host and SSD is facilitated by three new NVMe commands—MegIS_Init, MegIS_Step, and MegIS_Write. A custom MegIS FTL (flash translation layer) implements block-level mappings during ISP, minimizing mapping overhead and freeing DRAM for metagenomic task buffers (Ghiasi et al., 2024).
3. Storage-Aware Algorithmic Optimizations
MegIS achieves full internal SSD bandwidth utilization and reduced computation overhead via the following optimizations:
- Sorted Streaming: By aligning all k-mer and database data in sorted order, MegIS sidesteps random I/O and minimizes internal DRAM buffering.
- Intersection and Lookup Complexity: Both intersection finding and KSS taxID lookups operate in linear time, where is database size and is query k-mer count, compared to the random-lookup in pointer-chasing approaches.
- Streaming Merge: For taxID retrieval, KSS merges table results in a streaming fashion, reducing redundant lookups. If is the set of matched k-mers and is the sketch table size, the KSS method requires reads, compared to for pointer-chasing.
These optimizations eliminate expensive random accesses, reduce SSD-internal contention, and accelerate pipeline stages (Ghiasi et al., 2024).
4. Data Mapping and SSD Parallelism
MegIS leverages a custom data layout to maximize SSD channel parallelism and minimize conflicts:
- The MegIS FTL evenly and sequentially stripes reference database pages across channels, grouping into multiplane blocks at the same page offset. The page physical address (PPA) calculation:
for 0, block 1, and page 2.
- Query buckets are partitioned into 3 stripes and pinned in both host and SSD DRAM, ensuring data transfers align with SSD channel activity.
This mapping fully exposes SSD parallelism and avoids inter-channel conflicts, supporting high-throughput pipelined operation (Ghiasi et al., 2024).
5. Lightweight In-SSD Accelerator Microarchitecture
MegIS integrates a set of simple yet efficient accelerator units, occupying approximately 4 at 65 nm (5 at 32 nm) and drawing 6 mW at 300 MHz per SSD:
- Intersect Core: One 120-bit comparator per channel for streaming k-mer equality.
- K-mer Registers: Registers to track “current” and “next” k-mers, supporting flash-to-register streaming.
- Index Generator: 64-bit prefix-comparison FSM to drive efficient KSS lookups.
- Global Control Unit: Coordinates channel state machines and operation sequencing.
At 300 MHz, each comparator can process 120 bits per cycle (approx. 36 GB/s), exceeding the per-channel flash bandwidth (1.2 GB/s), ensuring computation is flash-bandwidth bound and never bottlenecked by accelerator logic (Ghiasi et al., 2024).
6. Evaluation: Performance, Energy, and Cost
Comprehensive benchmarking using a cycle-accurate simulator (Ramulator + MQSim) and hardware workloads shows MegIS provides significant improvements over multiple baselines:
- Benchmarks: Evaluated on three 100M-read CAMI metagenomic sample datasets, using both cost-optimized SATA SSD (SSD-C) and a performance PCIe Gen4 SSD (SSD-P). Reference databases used include Kraken2 (293 GB) and Metalign (701 GB k-mers + 6.9 GB sketches).
- Baselines: P-Opt (Kraken2), A-Opt (Metalign), A-Opt+KSS, and hardware-accelerated PIM (Kraken2+Sieve).
- Speedup over software baselines:
- On SSD-C, full MegIS (MS) achieves 5.3–6.47 speedup over P-Opt and 12.4–18.28 over A-Opt.
- On SSD-P, MS provides 2.7–6.59 and 6.9–20.40 speedups, respectively.
- For multi-sample analyses (16 samples buffered), MS reaches up to 37.21 (SSD-C) and 20.52 (SSD-P) vs. P-Opt, 100.23/52.04 over A-Opt.
- Comparison with PIM solutions: MS is 4.8–5.15 faster than hardware-accelerated Sieve+Kraken2 on SSD-C and 1.5–2.76 on SSD-P, while matching or exceeding accuracy (4.87 higher F1, 13% lower L1 error).
- Energy efficiency: MS reduces system energy by 5.48 compared to P-Opt, 15.29 vs. A-Opt, and 1.90 vs. PIM on SSD-C (9.81/25.72/3.53 on SSD-P).
- Cost: On a low-cost SSD-C + 64 GB DRAM platform, MS outperforms P-Opt and A-Opt running on a much higher-end SSD-P + 1 TB DRAM platform by 2.44/7.25 and matches accuracy, resulting in superior throughput per dollar.
- Unified-index merger for abundance estimation: Step 3 yields a 65% speedup over host-side index builds (Ghiasi et al., 2024).
7. Integration, Flexibility, and Implications
MegIS is designed as a flexible, extensible solution compatible with different metagenomic input datasets and adaptable to various analysis pipelines. As the first practical ISP accelerator for comprehensive metagenomic analysis, MegIS demonstrates that with careful hardware/software co-design—especially regarding task placement, streaming algorithms, storage-aware data layouts, and flash-bandwidth-matched accelerators—it is possible to compress data movement, maximize SSD internal I/O, and realize step-function gains in both throughput and energy efficiency, with no loss in analytical accuracy (Ghiasi et al., 2024).
A plausible implication is that the MegIS approach—leveraging commodity hardware with minimal OSS customization—could inform the design of future storage-accelerated computing systems for other bioinformatics and data-intensive scientific workloads, provided similar patterns of large, low-reuse datasets and embarrassingly parallel streaming algorithms.