Papers
Topics
Authors
Recent
Search
2000 character limit reached

Flash-Compatible Depth-KV Layout

Updated 17 March 2026
  • Flash-compatible depth-KV layouts organize key-value data by aligning storage along sequence and depth axes to optimize page-level access on flash memory.
  • They employ head-group parallelism and pipelined flash die partitioning to achieve up to 2× speedup in LLM inference and scalable performance at high token contexts.
  • Advanced quantization and compression methods reduce KV cache size by up to 86% with minimal accuracy loss, while enabling rapid 3D imaging in medical workflows.

A flash-compatible depth-KV layout is a specialized data organization and hardware-software strategy designed to maximize bandwidth, reduce memory footprint, and maintain performance for critical applications where key-value (KV) states must be stored and retrieved efficiently from non-volatile or high-latency memory, such as NAND flash in LLMs, or for volumetric imaging in real-time medical workflows. The term "depth-KV" refers to layouts that align the storage and access of KV pairs or kV-level medical imaging data both along the sequence (token, spatial, or temporal axis) and the depth axis (layer, attention head, or detector-angle axis), while being optimized for block/page-based access required by flash or streaming-attention algorithms such as FlashAttention.

1. Flash-Compatible Depth-KV Layout in LLMs

In resource-constrained edge settings, LLM inference bottlenecks stem from the size and bandwidth demands of KV caches, particularly at long context lengths. The KVNAND architecture (Deng et al., 3 Dec 2025) introduces a depth-KV page-level mapping strategy in 3D NAND flash to store the entire KV cache and all model weights, eliminating dependence on external DRAM and exploiting in-flash computing (IFC) primitives for efficient retrieval.

Flash page geometry defines the allocation granularity: typical pages are 4 KB and each flash plane holds a local SRAM buffer. KV entries per (layer, head, token) are sized as KVunit=(d/h)\mathrm{KV_{unit}} = (d/h) \cdot precision (e.g., 256 B at d/h=128d/h=128, FP16). In KVNAND’s mapping, each page holds tt consecutive tokens:

t=P(Ndie×Mplanes/(2k))KVunitt = \frac{P \cdot (N_{\mathrm{die}}\times M_{\mathrm{planes}}/(2k))}{\mathrm{KV_{unit}}}

where PP is the page size and kk is the number of K or V heads. Each page stores the depth-major KV content for a specific (layer, head), improving locality for block reads required in single-batch autoregressive decoding. By grouping KV cache access and update into page-level transactions, and by grid-mapping different (layer, head) pairs to separate physical blocks, KVNAND reduces read amplification and mitigates read-disturb stress on flash blocks.

2. Parallelism Strategies and Pipelined Access

To maximize throughput under the constraints of flash I/O and in-flash compute unit parallelism, KVNAND applies "head-group parallelism" (HG-parallelism). Head-groups are sets of (K,V) pairs and their associated Q heads, tailored for the architecture of Multi-Head Attention (MHA) and Group-Query Attention (GQA). In pipelined execution, flash dies are partitioned so that while one group (G1) reads and generates QKV, a second (G2) concurrently runs attention using stored K, V blocks, allowing continuous utilization of available flash compute bandwidth.

The effective compute resources per plane are provisioned to match the parallel reuse factor:

PEsplanehk\text{PEs}_\text{plane} \propto \frac{h}{k}

where hh is number of heads and kk is number of (K,V) pairs (relevant for GQA). This structure enables efficient, low-latency, and scalable inference at context lengths up to 100 K tokens, as demonstrated by empirical results showing 1.98–2.05 × speedups versus DRAM-equipped IFC at short and long contexts.

3. Quantization and Compression for Efficient GPU Attention

The MiniKV method (Sharma et al., 2024) complements flash-compatible layouts on GPUs by introducing a two-bit, layer-discriminative, depth-KV cache that is tightly packed and quantized to minimize memory footprint while maintaining compatibility with FlashAttention-style GPU kernels. After selecting heavy-hitter token positions per layer in the prefill phase, only these entries are compressed using 2-bit quantization per group of 16 values:

s,j,g=xmaxxmin2b1,z,j,g=xmin,q=clip(round(xzs),0,3)s_{\ell,j,g} = \frac{x_{\max} - x_{\min}}{2^b-1}, \quad z_{\ell,j,g} = x_{\min}, \quad q = \mathrm{clip}\left( \mathrm{round}\left(\frac{x-z}{s}\right), 0, 3 \right)

Keys and values are packed into 32-bit words for efficient memory access, and per-token decoding fuses dequantization, unpacking, and matmul in a single memory pass. This design allows an 86 % reduction in KV cache size with over 98.5 % accuracy retention, and enables single-GPU inference with context windows exceeding 44 K tokens.

4. Depth-KV Layouts for Cross-Layer Attention and Efficient Retrieval

Recent extensions, such as Mixture-of-Depths Attention (MoDA) (Zhu et al., 16 Mar 2026), require "depth-KV" layouts that provide fast retrieval of not just within-layer KV states but also preceding layer states at each token position (depth-wise attention). To achieve high efficiency, MoDA flattens all depth KV tensors into contiguous blocks:

  • Key/value tensors: Kdepth,VdepthRTL×HkdK_\mathrm{depth}, V_\mathrm{depth} \in \mathbb{R}^{T\cdot L \times H_kd} (where TT is sequence length, LL is number of layers)
  • For each token index tt, its LL depth states occupy contiguous rows [tL(t+1)L1][tL \ldots (t+1)L-1], supporting contiguous blockwise access in FlashAttention-compatible fused kernels.
  • Integration with chunk-aware and group-aware packing improves locality, reduces global memory traffic, and enables utilization above 97 % of baseline FlashAttention-2 speeds at up to T=64T=64K.

Comparison of asymptotic memory bandwidth:

Layout Complexity (dominant term) Bandwidth efficiency (64K context)
FlashAttention-2 O(T2D)O(T^2 D) Baseline (100 %)
MoDA depth-KV O(T2D+(TLD)/G)O(T^2 D + (T L D)/G) 97.3 % of baseline

This design enables cross-layer retrieval without sacrificing high-throughput chunked attention or incurring strided gathers that are intractable in device memory.

5. Flash-Compatible Depth-kV Layouts in Medical Imaging

Flash-compatible depth-kV layouts also appear in real-time, image-guided workflows for proton FLASH radiotherapy, as in (Chang et al., 2022). Here, "depth-kV" refers to capturing volumetric patient data via orthogonal, instantaneous kV x-ray projections at oblique gantry angles (135°/225°) that maximize both structural fidelity and WET accuracy while minimizing dose and acquisition time.

The deep-learning pipeline, InverseNet3D, reconstructs a 3D CT volume from these projections in under 0.5 s, guided by custom loss functions (voxelwise L1L_1, gradient, SSIM). Mean absolute errors (MAE) for 30 patients at 135°/225° reach 75.6±22.475.6 \pm 22.4 HU, SSIM = 0.938±0.0440.938 \pm 0.044, and WET deviation 1.3±3.7-1.3 \pm 3.7 mm. The full imaging protocol, including synchronized FLASH/non-FLASH exposure, detector collimation, and safety interlocks, is engineered for sub-second acquisition and volumetric guidance.

6. Design-Space Trade-offs, Extensibility, and Impact

Optimal flash-compatible depth-KV layouts require balancing parallelism, on-die buffer sizing, quantization precision, and mapping of KV pages to flash plane/block topologies. Design space exploration in KVNAND reveals that optimal partitioning of dies for weights/QKV versus KV retrieval (G1:G2 split) shifts from G1-heavy at low context (s1s \lesssim 1K) to G2-heavy at very long context (s10s \gg 10K), with quantization (W8A8 vs W4A16) further influencing the trade-off. Reliability is ensured by limiting page read frequency and distributing access to minimize read-disturb.

These patterns generalize to any compute-in-flash substrate capable of page/group-buffered read+MAC, as well as to alternative attention sparsity or hybrid DRAM+flash deployments—the depth-KV allocation strategy remains pivotal for memory bandwidth, latency, and energy efficiency. In medical imaging, the oblique dual-kV capture layout yields optimal trade-offs in volumetric reconstruction accuracy and workflow speed. Together, these depth-KV layouts are enablers for edge-scale LLMs and real-time medical systems that require high-bandwidth, low-latency retrieval from a page-based storage substrate.

7. Experimental Outcomes and Research Landscape

Across LLM and medical imaging domains, flash-compatible depth-KV layouts have concretely demonstrated:

  • Sustained throughput and reliability at long context lengths where DRAM-only or naïve KV-in-flash solutions fail (e.g., KVNAND at 100 K tokens yields ≈10 tokens/s on an 8B model, with out-of-memory or severe slowdown for DRAM/naïve-flash baselines) (Deng et al., 3 Dec 2025).
  • Empirical speedups of 1.94–2.05 × over prior DRAM-IFC at context lengths from 128–10 K (Deng et al., 3 Dec 2025), and throughput gains up to 66 % with up to 55 % memory reduction in GPU-based attention (Sharma et al., 2024).
  • Pareto-optimal trade-offs in cache size and retained accuracy (Sharma et al., 2024), and near-lossless long-context accuracy for compressed depth-KV layouts.
  • Retrieval efficiency at 97.3 % of highly-optimized FlashAttention-2 in cross-layer settings (Zhu et al., 16 Mar 2026).
  • In medical imaging, rapid and artifact-minimized 3D CT generation, with WET-based error detection for FLASH therapy, in under 1 s total latency (Chang et al., 2022).

These results position flash-compatible depth-KV layouts as a foundational strategy for bandwidth- and latency-bound inference and imaging applications under hardware constraints.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flash-Compatible Depth-KV Layout.