Flash-Compatible Depth-KV Layout

Updated 17 March 2026

Flash-compatible depth-KV layouts organize key-value data by aligning storage along sequence and depth axes to optimize page-level access on flash memory.
They employ head-group parallelism and pipelined flash die partitioning to achieve up to 2× speedup in LLM inference and scalable performance at high token contexts.
Advanced quantization and compression methods reduce KV cache size by up to 86% with minimal accuracy loss, while enabling rapid 3D imaging in medical workflows.

A flash-compatible depth-KV layout is a specialized data organization and hardware-software strategy designed to maximize bandwidth, reduce memory footprint, and maintain performance for critical applications where key-value (KV) states must be stored and retrieved efficiently from non-volatile or high-latency memory, such as NAND flash in LLMs, or for volumetric imaging in real-time medical workflows. The term "depth-KV" refers to layouts that align the storage and access of KV pairs or kV-level medical imaging data both along the sequence (token, spatial, or temporal axis) and the depth axis (layer, attention head, or detector-angle axis), while being optimized for block/page-based access required by flash or streaming-attention algorithms such as FlashAttention.

1. Flash-Compatible Depth-KV Layout in LLMs

In resource-constrained edge settings, LLM inference bottlenecks stem from the size and bandwidth demands of KV caches, particularly at long context lengths. The KVNAND architecture (Deng et al., 3 Dec 2025) introduces a depth-KV page-level mapping strategy in 3D NAND flash to store the entire KV cache and all model weights, eliminating dependence on external DRAM and exploiting in-flash computing (IFC) primitives for efficient retrieval.

Flash page geometry defines the allocation granularity: typical pages are 4 KB and each flash plane holds a local SRAM buffer. KV entries per (layer, head, token) are sized as $\mathrm{KV_{unit}} = (d/h) \cdot$ precision (e.g., 256 B at $d/h=128$ , FP16). In KVNAND’s mapping, each page holds $t$ consecutive tokens:

$t = \frac{P \cdot (N_{\mathrm{die}}\times M_{\mathrm{planes}}/(2k))}{\mathrm{KV_{unit}}}$

where $P$ is the page size and $k$ is the number of K or V heads. Each page stores the depth-major KV content for a specific (layer, head), improving locality for block reads required in single-batch autoregressive decoding. By grouping KV cache access and update into page-level transactions, and by grid-mapping different (layer, head) pairs to separate physical blocks, KVNAND reduces read amplification and mitigates read-disturb stress on flash blocks.

2. Parallelism Strategies and Pipelined Access

To maximize throughput under the constraints of flash I/O and in-flash compute unit parallelism, KVNAND applies "head-group parallelism" (HG-parallelism). Head-groups are sets of (K,V) pairs and their associated Q heads, tailored for the architecture of Multi-Head Attention (MHA) and Group-Query Attention (GQA). In pipelined execution, flash dies are partitioned so that while one group (G1) reads and generates QKV, a second (G2) concurrently runs attention using stored K, V blocks, allowing continuous utilization of available flash compute bandwidth.

The effective compute resources per plane are provisioned to match the parallel reuse factor:

$\text{PEs}_\text{plane} \propto \frac{h}{k}$

where $h$ is number of heads and $k$ is number of (K,V) pairs (relevant for GQA). This structure enables efficient, low-latency, and scalable inference at context lengths up to 100 K tokens, as demonstrated by empirical results showing 1.98–2.05 × speedups versus DRAM-equipped IFC at short and long contexts.

3. Quantization and Compression for Efficient GPU Attention

The MiniKV method (Sharma et al., 2024) complements flash-compatible layouts on GPUs by introducing a two-bit, layer-discriminative, depth-KV cache that is tightly packed and quantized to minimize memory footprint while maintaining compatibility with FlashAttention-style GPU kernels. After selecting heavy-hitter token positions per layer in the prefill phase, only these entries are compressed using 2-bit quantization per group of 16 values:

$s_{\ell,j,g} = \frac{x_{\max} - x_{\min}}{2^b-1}, \quad z_{\ell,j,g} = x_{\min}, \quad q = \mathrm{clip}\left( \mathrm{round}\left(\frac{x-z}{s}\right), 0, 3 \right)$

Keys and values are packed into 32-bit words for efficient memory access, and per-token decoding fuses dequantization, unpacking, and matmul in a single memory pass. This design allows an 86 % reduction in KV cache size with over 98.5 % accuracy retention, and enables single-GPU inference with context windows exceeding 44 K tokens.

4. Depth-KV Layouts for Cross-Layer Attention and Efficient Retrieval

Recent extensions, such as Mixture-of-Depths Attention (MoDA) (Zhu et al., 16 Mar 2026), require "depth-KV" layouts that provide fast retrieval of not just within-layer KV states but also preceding layer states at each token position (depth-wise attention). To achieve high efficiency, MoDA flattens all depth KV tensors into contiguous blocks:

Key/value tensors: $d/h=128$ 0 (where $d/h=128$ 1 is sequence length, $d/h=128$ 2 is number of layers)
For each token index $d/h=128$ 3, its $d/h=128$ 4 depth states occupy contiguous rows $d/h=128$ 5, supporting contiguous blockwise access in FlashAttention-compatible fused kernels.
Integration with chunk-aware and group-aware packing improves locality, reduces global memory traffic, and enables utilization above 97 % of baseline FlashAttention-2 speeds at up to $d/h=128$ 6K.

Comparison of asymptotic memory bandwidth:

Layout	Complexity (dominant term)	Bandwidth efficiency (64K context)
FlashAttention-2	$d/h=128$ 7	Baseline (100 %)
MoDA depth-KV	$d/h=128$ 8	97.3 % of baseline

This design enables cross-layer retrieval without sacrificing high-throughput chunked attention or incurring strided gathers that are intractable in device memory.

5. Flash-Compatible Depth-kV Layouts in Medical Imaging

Flash-compatible depth-kV layouts also appear in real-time, image-guided workflows for proton FLASH radiotherapy, as in (Chang et al., 2022). Here, "depth-kV" refers to capturing volumetric patient data via orthogonal, instantaneous kV x-ray projections at oblique gantry angles (135°/225°) that maximize both structural fidelity and WET accuracy while minimizing dose and acquisition time.

The deep-learning pipeline, InverseNet3D, reconstructs a 3D CT volume from these projections in under 0.5 s, guided by custom loss functions (voxelwise $d/h=128$ 9, gradient, SSIM). Mean absolute errors (MAE) for 30 patients at 135°/225° reach $t$ 0 HU, SSIM = $t$ 1, and WET deviation $t$ 2 mm. The full imaging protocol, including synchronized FLASH/non-FLASH exposure, detector collimation, and safety interlocks, is engineered for sub-second acquisition and volumetric guidance.

6. Design-Space Trade-offs, Extensibility, and Impact

Optimal flash-compatible depth-KV layouts require balancing parallelism, on-die buffer sizing, quantization precision, and mapping of KV pages to flash plane/block topologies. Design space exploration in KVNAND reveals that optimal partitioning of dies for weights/QKV versus KV retrieval (G1:G2 split) shifts from G1-heavy at low context ( $t$ 3K) to G2-heavy at very long context ( $t$ 4K), with quantization (W8A8 vs W4A16) further influencing the trade-off. Reliability is ensured by limiting page read frequency and distributing access to minimize read-disturb.

These patterns generalize to any compute-in-flash substrate capable of page/group-buffered read+MAC, as well as to alternative attention sparsity or hybrid DRAM+flash deployments—the depth-KV allocation strategy remains pivotal for memory bandwidth, latency, and energy efficiency. In medical imaging, the oblique dual-kV capture layout yields optimal trade-offs in volumetric reconstruction accuracy and workflow speed. Together, these depth-KV layouts are enablers for edge-scale LLMs and real-time medical systems that require high-bandwidth, low-latency retrieval from a page-based storage substrate.

7. Experimental Outcomes and Research Landscape

Across LLM and medical imaging domains, flash-compatible depth-KV layouts have concretely demonstrated:

Sustained throughput and reliability at long context lengths where DRAM-only or naïve KV-in-flash solutions fail (e.g., KVNAND at 100 K tokens yields ≈10 tokens/s on an 8B model, with out-of-memory or severe slowdown for DRAM/naïve-flash baselines) (Deng et al., 3 Dec 2025).
Empirical speedups of 1.94–2.05 × over prior DRAM-IFC at context lengths from 128–10 K (Deng et al., 3 Dec 2025), and throughput gains up to 66 % with up to 55 % memory reduction in GPU-based attention (Sharma et al., 2024).
Pareto-optimal trade-offs in cache size and retained accuracy (Sharma et al., 2024), and near-lossless long-context accuracy for compressed depth-KV layouts.
Retrieval efficiency at 97.3 % of highly-optimized FlashAttention-2 in cross-layer settings (Zhu et al., 16 Mar 2026).
In medical imaging, rapid and artifact-minimized 3D CT generation, with WET-based error detection for FLASH therapy, in under 1 s total latency (Chang et al., 2022).

These results position flash-compatible depth-KV layouts as a foundational strategy for bandwidth- and latency-bound inference and imaging applications under hardware constraints.

Markdown Report Issue Upgrade to Chat

References (4)

KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing (2025)

MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache (2024)

Mixture-of-Depths Attention (2026)

Deep learning-based Fast Volumetric Image Generation for Image-guided Proton FLASH Radiotherapy (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flash-Compatible Depth-KV Layout.