Flat GEMM Optimization via Double Buffering

Updated 1 March 2026

The paper demonstrates that double buffering increases effective GEMM throughput by up to 50% on GPUs and achieves 77% efficiency on AI NPUs.
It details a methodology where alternating 'ping-pong' buffers in shared memory or L1 enable overlap of memory transfers and computation to hide latency.
The approach is pivotal for efficient LLM inference in transformer decoders and is adaptable across various AI hardware with on-chip memory.

Flat GEMM optimization via double buffering encompasses a set of techniques designed to maximize resource utilization and throughput for general matrix-matrix multiplication (GEMM) when matrix shapes are "flat," i.e., where one dimension (typically the batch or sequence length) is much smaller than the others. This scenario is prevalent in transformer-based decoders for LLMs and on AI accelerators equipped with low-precision matrix units but lacking native support for high-precision GEMM. Double buffering—alternating use of on-chip memory regions to overlap memory transfers with computation—emerges as a common strategy for hiding memory latency and improving arithmetic intensity. The efficacy and implementation of flat GEMM double-buffering have been rigorously studied on both GPUs (Hong et al., 2023) and AI NPUs (Xue et al., 31 Jul 2025), enabling up to a 50%–77% increase in effective throughput under specific regimes.

1. Flat GEMM Regimes and Bottlenecks

In LLM inference, especially during the decode loop, the GEMM workloads depart from conventional dense regimes. The operation $C \leftarrow A B$ is performed where $A$ has shape $M\times K$ with small $M$ (often $M=1$ for GEMV, or $M \lesssim 8$ for small batch), while $B$ is $K\times N$ with large $N$ (e.g., 4096 or 11008) (Hong et al., 2023). Critical performance bottlenecks arise:

Small $M$ (<8): Insufficient parallel work per CUDA block; device occupancy is low.
Moderate $M$ , moderate $N$ : Blocks have enough computational work, yet the latency of loading columns of $B$ from global memory dominates. Arithmetical intensity (AI) drops, rendering the operation memory-bound.
Large $M$ : Compute-bound regime where classical tiling (e.g., 64×64×32) and standard cuBLAS/CUTLASS routines suffice.

The arithmetic intensity for a typical $K \times BN$ GEMM tile is:

$\mathrm{AI} = \frac{2 M BN BK}{(M BK + BN BK) + M N}$

which trends toward memory-bound for small $N$ (Hong et al., 2023).

2. Double Buffering Technique in Shared Memory

Double buffering enables overlapping the loading of data tiles with compute, thus hiding memory or I/O latencies. In the context of flat GEMM using NVIDIA GPUs, the method works as follows (Hong et al., 2023):

Block Tile Sizes: $BK=32$ (K-tile), $BN=128$ or $256$ (N-tile), $M$ is padded to $8$.
Buffer Allocation: Two regions, e.g., A_s[2] [8] [BK] and B_s[2] [BK] [[BN](https://www.emergentmind.com/topics/batch-normalization-bn)], are allocated in shared memory, alternating between computation and asynchronous prefetch.
Kernel Structure: The kernel loop loads the next tile into one buffer ("ping" or "pong") while computing using the other, ensuring compute units and memory bandwidth are simultaneously utilized.

The prologue loads the first tile; the main loop alternates between loading and computing:

// Prologue
load_global_to_shared(A, B, tile=0, buf=0);
__syncthreads();
// Steady state
for t in 1..T-1 {
  int cur = (t-1)&1, nxt = t&1;
  load_global_to_shared_async(A, B, tile=t, buf=nxt);
  compute_tensor_core_m8n8k32(A_s[cur], B_s[cur], accum);
  __syncthreads();
}
// Epilogue
compute_tensor_core_m8n8k32(A_s[(T-1)&1], B_s[(T-1)&1], accum);

Each warp loads coalesced chunks of

A

and

B

into shared memory.

3. Cache-Aware Blocking and Double Buffering on NPUs

On AI NPUs without full-precision GEMM units, double buffering integrates into a broader cache-aware blocking strategy (Xue et al., 31 Jul 2025):

Matrix Partitioning: $A$ split into $b_m \times b_k$ tiles, $B$ into $b_k \times b_n$ tiles, $C$ into $b_m \times b_n$ blocks.
L1 Buffering: Multiple copies ( $N_\text{fused}$ ) of $A$ -tiles reside in L1; two buffers alternate for B-tiles.
Double-Buffered Pipeline: While one B-tile is consumed in compute, the next B-tile streams in via a GM→UB (global memory to unified buffer)→FP32→FP16 conversion→L1 pipeline.
Loop Structure:
- Load stage: Preload all $A$ -tiles (and conversion residues) for a fused group into L1.
- Compute stage: For each $B$ block, double-buffer between two regions, ping-pong style, while executing the cube GEMM and accumulation.

This approach enables close to 77% of the theoretical FP32-equivalent compute ceiling on Ascend 910A NPU (Xue et al., 31 Jul 2025).

4. Overlap Model and Performance Analysis

The effectiveness of double buffering is governed by the ability to overlap compute and memory transfer. Given $L$ tiles in the K-dimension, compute time $T_\text{comp}$ , and memory time $T_\text{mem}$ per tile:

Without overlap: Total time is $L \cdot (T_\text{comp} + T_\text{mem})$ .
With ideal overlap (double buffering): Time becomes $T_\text{mem} + L \cdot \max(T_\text{comp}, T_\text{mem}) + T_\text{comp}$ .

Choosing block sizes so that $T_\text{mem} \lesssim T_\text{comp}$ can switch the regime from memory- to compute-bound (Hong et al., 2023).

Performance on NVIDIA A100 (M=8, K=4096, N=4096) demonstrates:

Standard CUTLASS (padded to M=64): ~1 TF/s (flat GEMM under-utilized).
M8-BK32-BN128 with double buffering: ~1.8 TF/s, yielding a 50% speedup (Hong et al., 2023).

On Ascend 910A running H2SGEMM, 65.3 TFLOPS is achieved versus a compute-bound estimate of 85.3 TFLOPS (77% efficiency) (Xue et al., 31 Jul 2025).

5. Memory Hierarchy, Block Sizing, and Practical Lessons

Optimizing double buffering for flat GEMM requires careful block sizing and awareness of on-chip memory hierarchy:

Padding: $M$ is padded to match the micro-tile size supported by hardware primitives (e.g., $m8n8k32$ on NVIDIA Tensor Cores).
BK selection: Chosen so each warp independently loads K-tiles without atomic operations.
BN selection: Set to a multiple matching thread block parallelism and maximizing SM utilization.
Buffer Management: "Ping" and "pong" shared memory buffers (or NPU L1 regions) alternate between being loaded from DRAM and consumed in computation.
Tail Handling: For non-aligned $N$ or $K$ , zero-padding brings dimensions to the nearest multiple; this overhead is negligible for small $M$ .

General principles applicable across hardware types include sizing blocks so static operands fit in one region, dedicating two regions for double-buffering the bandwidth-limited operand, and ensuring register or local buffer capacity meets micro-tiling needs. Overlapping memory and compute phases conceals latency as long as the transfer duration does not dominate compute (Xue et al., 31 Jul 2025).

6. Broader Implications and Applicability

These double buffering strategies directly address utilization gaps that arise in flat GEMM due to small $M$ or memory-bound behavior at moderate $N$ . They are crucial for LLM inference engines such as FlashDecoding++, which demonstrate 1.37× average speedup over prior baselines and up to 4.86× over standard Hugging Face PyTorch implementations (Hong et al., 2023). Moreover, the methodology extends to any hardware with scratchpad memory—including GPUs and AI NPUs—and is compatible with mixed-precision and emulated high-precision GEMM. Precision recovery through operand decomposition, as in H2SGEMM, ensures that double buffering does not compromise numerical stability, preserving or even surpassing the accuracy of classical high-precision implementations (Xue et al., 31 Jul 2025).

Device / Engine	Flat GEMM Regime	Kernel Structure	Achieved Speedup/Utilization
NVIDIA A100 (FlashDecoding++)	M=8, K=4096, N=4096	Double buffer (sharedmem)	1.8 TF/s (vs 1 TF/s CUTLASS)
Ascend 910A (H2SGEMM)	m=2816, k=8192, n=8192	Double buffer (L1)	65.3/85.3 TFLOPS (77%)

This suggests that effective flat GEMM double-buffering is hardware-agnostic and a prerequisite for efficient LLM inference and high-throughput GEMM on modern AI accelerators.

Markdown Report Issue Upgrade to Chat

References (2)

FlashDecoding++: Faster Large Language Model Inference on GPUs (2023)

H2SGEMM: Emulating FP32 GEMM on Ascend NPUs using FP16 Units with Precision Recovery and Cache-Aware Optimization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flat GEMM Optimization via Double Buffering.