Flash-Compatible Depth-KV Layout
- Flash-compatible depth-KV layouts organize key-value data by aligning storage along sequence and depth axes to optimize page-level access on flash memory.
- They employ head-group parallelism and pipelined flash die partitioning to achieve up to 2× speedup in LLM inference and scalable performance at high token contexts.
- Advanced quantization and compression methods reduce KV cache size by up to 86% with minimal accuracy loss, while enabling rapid 3D imaging in medical workflows.
A flash-compatible depth-KV layout is a specialized data organization and hardware-software strategy designed to maximize bandwidth, reduce memory footprint, and maintain performance for critical applications where key-value (KV) states must be stored and retrieved efficiently from non-volatile or high-latency memory, such as NAND flash in LLMs, or for volumetric imaging in real-time medical workflows. The term "depth-KV" refers to layouts that align the storage and access of KV pairs or kV-level medical imaging data both along the sequence (token, spatial, or temporal axis) and the depth axis (layer, attention head, or detector-angle axis), while being optimized for block/page-based access required by flash or streaming-attention algorithms such as FlashAttention.
1. Flash-Compatible Depth-KV Layout in LLMs
In resource-constrained edge settings, LLM inference bottlenecks stem from the size and bandwidth demands of KV caches, particularly at long context lengths. The KVNAND architecture (Deng et al., 3 Dec 2025) introduces a depth-KV page-level mapping strategy in 3D NAND flash to store the entire KV cache and all model weights, eliminating dependence on external DRAM and exploiting in-flash computing (IFC) primitives for efficient retrieval.
Flash page geometry defines the allocation granularity: typical pages are 4 KB and each flash plane holds a local SRAM buffer. KV entries per (layer, head, token) are sized as precision (e.g., 256 B at , FP16). In KVNAND’s mapping, each page holds consecutive tokens:
where is the page size and is the number of K or V heads. Each page stores the depth-major KV content for a specific (layer, head), improving locality for block reads required in single-batch autoregressive decoding. By grouping KV cache access and update into page-level transactions, and by grid-mapping different (layer, head) pairs to separate physical blocks, KVNAND reduces read amplification and mitigates read-disturb stress on flash blocks.
2. Parallelism Strategies and Pipelined Access
To maximize throughput under the constraints of flash I/O and in-flash compute unit parallelism, KVNAND applies "head-group parallelism" (HG-parallelism). Head-groups are sets of (K,V) pairs and their associated Q heads, tailored for the architecture of Multi-Head Attention (MHA) and Group-Query Attention (GQA). In pipelined execution, flash dies are partitioned so that while one group (G1) reads and generates QKV, a second (G2) concurrently runs attention using stored K, V blocks, allowing continuous utilization of available flash compute bandwidth.
The effective compute resources per plane are provisioned to match the parallel reuse factor:
where is number of heads and is number of (K,V) pairs (relevant for GQA). This structure enables efficient, low-latency, and scalable inference at context lengths up to 100 K tokens, as demonstrated by empirical results showing 1.98–2.05 × speedups versus DRAM-equipped IFC at short and long contexts.
3. Quantization and Compression for Efficient GPU Attention
The MiniKV method (Sharma et al., 2024) complements flash-compatible layouts on GPUs by introducing a two-bit, layer-discriminative, depth-KV cache that is tightly packed and quantized to minimize memory footprint while maintaining compatibility with FlashAttention-style GPU kernels. After selecting heavy-hitter token positions per layer in the prefill phase, only these entries are compressed using 2-bit quantization per group of 16 values:
Keys and values are packed into 32-bit words for efficient memory access, and per-token decoding fuses dequantization, unpacking, and matmul in a single memory pass. This design allows an 86 % reduction in KV cache size with over 98.5 % accuracy retention, and enables single-GPU inference with context windows exceeding 44 K tokens.
4. Depth-KV Layouts for Cross-Layer Attention and Efficient Retrieval
Recent extensions, such as Mixture-of-Depths Attention (MoDA) (Zhu et al., 16 Mar 2026), require "depth-KV" layouts that provide fast retrieval of not just within-layer KV states but also preceding layer states at each token position (depth-wise attention). To achieve high efficiency, MoDA flattens all depth KV tensors into contiguous blocks:
- Key/value tensors: (where is sequence length, is number of layers)
- For each token index , its depth states occupy contiguous rows , supporting contiguous blockwise access in FlashAttention-compatible fused kernels.
- Integration with chunk-aware and group-aware packing improves locality, reduces global memory traffic, and enables utilization above 97 % of baseline FlashAttention-2 speeds at up to K.
Comparison of asymptotic memory bandwidth:
| Layout | Complexity (dominant term) | Bandwidth efficiency (64K context) |
|---|---|---|
| FlashAttention-2 | Baseline (100 %) | |
| MoDA depth-KV | 97.3 % of baseline |
This design enables cross-layer retrieval without sacrificing high-throughput chunked attention or incurring strided gathers that are intractable in device memory.
5. Flash-Compatible Depth-kV Layouts in Medical Imaging
Flash-compatible depth-kV layouts also appear in real-time, image-guided workflows for proton FLASH radiotherapy, as in (Chang et al., 2022). Here, "depth-kV" refers to capturing volumetric patient data via orthogonal, instantaneous kV x-ray projections at oblique gantry angles (135°/225°) that maximize both structural fidelity and WET accuracy while minimizing dose and acquisition time.
The deep-learning pipeline, InverseNet3D, reconstructs a 3D CT volume from these projections in under 0.5 s, guided by custom loss functions (voxelwise , gradient, SSIM). Mean absolute errors (MAE) for 30 patients at 135°/225° reach HU, SSIM = , and WET deviation mm. The full imaging protocol, including synchronized FLASH/non-FLASH exposure, detector collimation, and safety interlocks, is engineered for sub-second acquisition and volumetric guidance.
6. Design-Space Trade-offs, Extensibility, and Impact
Optimal flash-compatible depth-KV layouts require balancing parallelism, on-die buffer sizing, quantization precision, and mapping of KV pages to flash plane/block topologies. Design space exploration in KVNAND reveals that optimal partitioning of dies for weights/QKV versus KV retrieval (G1:G2 split) shifts from G1-heavy at low context (K) to G2-heavy at very long context (K), with quantization (W8A8 vs W4A16) further influencing the trade-off. Reliability is ensured by limiting page read frequency and distributing access to minimize read-disturb.
These patterns generalize to any compute-in-flash substrate capable of page/group-buffered read+MAC, as well as to alternative attention sparsity or hybrid DRAM+flash deployments—the depth-KV allocation strategy remains pivotal for memory bandwidth, latency, and energy efficiency. In medical imaging, the oblique dual-kV capture layout yields optimal trade-offs in volumetric reconstruction accuracy and workflow speed. Together, these depth-KV layouts are enablers for edge-scale LLMs and real-time medical systems that require high-bandwidth, low-latency retrieval from a page-based storage substrate.
7. Experimental Outcomes and Research Landscape
Across LLM and medical imaging domains, flash-compatible depth-KV layouts have concretely demonstrated:
- Sustained throughput and reliability at long context lengths where DRAM-only or naïve KV-in-flash solutions fail (e.g., KVNAND at 100 K tokens yields ≈10 tokens/s on an 8B model, with out-of-memory or severe slowdown for DRAM/naïve-flash baselines) (Deng et al., 3 Dec 2025).
- Empirical speedups of 1.94–2.05 × over prior DRAM-IFC at context lengths from 128–10 K (Deng et al., 3 Dec 2025), and throughput gains up to 66 % with up to 55 % memory reduction in GPU-based attention (Sharma et al., 2024).
- Pareto-optimal trade-offs in cache size and retained accuracy (Sharma et al., 2024), and near-lossless long-context accuracy for compressed depth-KV layouts.
- Retrieval efficiency at 97.3 % of highly-optimized FlashAttention-2 in cross-layer settings (Zhu et al., 16 Mar 2026).
- In medical imaging, rapid and artifact-minimized 3D CT generation, with WET-based error detection for FLASH therapy, in under 1 s total latency (Chang et al., 2022).
These results position flash-compatible depth-KV layouts as a foundational strategy for bandwidth- and latency-bound inference and imaging applications under hardware constraints.