Papers
Topics
Authors
Recent
2000 character limit reached

Asymmetric Tile Buffering (ATB) in GEMM

Updated 27 November 2025
  • Asymmetric Tile Buffering (ATB) is a tiling strategy for GEMM that decouples input and output tile dimensions to enhance arithmetic intensity and throughput.
  • Performance models demonstrate that ATB can achieve up to 4.54× speedup over symmetric tiling, as evidenced by case studies on AMD’s XDNA2 AI Engine.
  • ATB optimizes buffer utilization by balancing the trade-offs between higher arithmetic intensity and increased kernel switching overhead.

Asymmetric Tile Buffering (ATB) is a tiling strategy for general matrix multiplication (GEMM) that decouples the dimension of buffered input AA tiles along the MM axis from the dimension of output CC tiles, permitting significant enhancements in arithmetic intensity and overall throughput. Introduced in the context of accelerating AI workloads, ATB provides a methodical means to optimize buffer utilization in manycore and array architectures by exploiting a previously overlooked asymmetry in input/output operand buffering. Systematic performance modeling reveals that ATB can offer substantial real-world speedup over conventional symmetric tiling, as established in a detailed case paper on AMD’s XDNA2 AI Engine (Wang et al., 20 Nov 2025).

1. Definition and Formulation

In symmetric GEMM tiling, a single set of tile dimensions (TM,TK,TN)(T_M,T_K,T_N) governs the size of buffered blocks for AA, BB, and CC:

  • AA-tiles are TM×TKT_M \times T_K,
  • BB-tiles are TK×TNT_K \times T_N,
  • CC-tiles are TM×TNT_M \times T_N.

Asymmetric Tile Buffering introduces four distinct tile parameters:

  • TMAT_{M_A}: rows of AA buffered,
  • TMCT_{M_C}: rows of CC buffered,
  • TKT_K: reduction-dimension tile size,
  • TNT_N: columns of BB/CC buffered,

with the constraint TMCTMAT_{M_C}\ge T_{M_A}. The asymmetry ratio is defined as

ρ=TMCTMA1\rho = \frac{T_{M_C}}{T_{M_A}} \ge 1

which quantifies how many output rows are accumulated per input row loaded. Figure 1 in (Wang et al., 20 Nov 2025) illustrates the structural difference between symmetric tiling and ATB.

2. Performance Model and Analytical Framework

The mathematical performance model comprises several core metrics:

Arithmetic Intensity (AIρ\mathrm{AI}_\rho):

Let aa, bb, cc denote per-element bytes of AA, BB, CC (e.g., for BF16, a=b=c=2a=b=c=2), and KK the global reduction length. For output-stationary scheduling: Ops=2KTMCTN|\text{Ops}| = 2\,K\,T_{M_C}\,T_N

Bytes=a(TMATK)Niter+b(TKTN)Niter+c(TMCTN)|\text{Bytes}| = a\,(T_{M_A}T_K)\,N_{\rm iter} + b\,(T_KT_N)\,N_{\rm iter} + c\,(T_{M_C}T_N)

Substituting TMA=TMC/ρT_{M_A}=T_{M_C}/\rho and simplifying gives: AIρ=2KTMCTNaTMCρK+bKTN+cTMCTN\mathrm{AI}_\rho = \frac{2\,K\,T_{M_C}\,T_N}{a\,\tfrac{T_{M_C}}{\rho}\,K + b\,K\,T_N + c\,T_{M_C}\,T_N}

This is subject to the L1 buffer constraint: 2aρTMCTK+2bTKTN+cTMCTNS(1)\frac{2a}{\rho} T_{M_C}T_K + 2b T_KT_N + cT_{M_C}T_N \le S \tag{1}

Kernel-Switching Overhead:

With a microkernel switching cost δ\delta,

Tswitch=δρKTKT_{\rm switch} = \delta\,\rho\,\frac{K}{T_K}

Combined Throughput:

Let Perfcorepeak\mathrm{Perf}^{\rm peak}_{\rm core} be the peak per-core MAC rate and Effmicro\mathrm{Eff}_{\rm micro} the microkernel efficiency.

Tasym=2TMCKTNPerfcorepeakEffmicro+δρKTK(9)T_{\rm asym} = \frac{2T_{M_C}KT_N}{\mathrm{Perf}^{\rm peak}_{\rm core}\mathrm{Eff}_{\rm micro}} + \delta\rho\frac{K}{T_K} \tag{9}

Effcore=11Effmicro+δρPerfcorepeak2TMCTNTK(11)\mathrm{Eff}_{\rm core} = \frac{1} { \frac{1}{\mathrm{Eff}_{\rm micro}} + \frac{\delta\rho\mathrm{Perf}^{\rm peak}_{\rm core}}{2T_{M_C}T_NT_K} } \tag{11}

32-core throughput is bounded by

Perfarray=min(AIarrayBWoffchip, PerfcorepeakEffcore×32)(12)\mathrm{Perf}_{\rm array} = \min \bigl( \mathrm{AI}_{\rm array} \cdot BW_{\rm offchip},\ \mathrm{Perf}^{\rm peak}_{\rm core} \mathrm{Eff}_{\rm core} \times 32 \bigr) \tag{12}

3. Trade-offs and Design Space

ATB’s principal trade-off is between maximized arithmetic intensity and increased kernel switching overhead:

  • High ρ\rho, small TKT_K: Favors greater arithmetic intensity and memory reuse but exacerbates switching overhead and reduces microkernel efficiency due to shorter steady phases.
  • Low ρ\rho, large TKT_K: Improves core efficiency by yielding longer microkernel chains and fewer invocations, but at the expense of attainable arithmetic intensity.

The optimal configuration lies where buffer, compute, and switching costs are jointly minimized. This point is quantitatively determined by jointly satisfying constraints and optimizations in equations (1), (9), (11), and (12) from the analytical model.

4. Parameter Selection Guidelines

For effective deployment of ATB, the following process is recommended:

  1. Choose TKT_K: Select a TKT_K large enough that microkernel efficiency Effmicro>0.4\mathrm{Eff}_{\rm micro} > 0.4–$0.6$ (see Table 1).
  2. Increase ρ\rho: Grow the asymmetry ratio until switching overhead or total buffer capacity becomes the limiting factor.
  3. Buffer Allocation: Allocate available buffer to maximize TMCTNT_{M_C}T_N, thus enlarging the “memory reuse volume” and boosting AIρ\mathrm{AI}_\rho.
  4. Full Array Evaluation: Simulate array-level performance; if memory-bound, consider higher ρ\rho; if compute-bound, adjust by increasing TKT_K or reducing ρ\rho.

Table 1. Microkernel/core performance under Config 1, TMA=128T_{M_A}=128, TN=128T_N=128:

TKT_K ρ\rho Perfmicro\mathrm{Perf}_{\rm micro} (TF) Effcore\mathrm{Eff}_{\rm core}
8 1 0.36 0.156
8 4 0.36 0.134
32 4 0.75 0.312
64 4 1.16 0.511

5. Practical Implementation and Architectural Case Study

ATB’s effectiveness was demonstrated on AMD’s XDNA2 AI Engine comprising 32 compute cores (4 × 8), each core with 64 KB L1 and two input/output streams. The studied GEMM used mixed-precision (BFP16/BF16) with the following configuration (Config 1, Table 3):

  • Problem Size: 4096×4096×20484096 \times 4096 \times 2048
  • L1 Tile: 128×64×128128 \times 64 \times 128, ρ=4\rho=4
  • Buffer Used: 60\approx 60 KB (vs. 91 KB if symmetric)
  • Measured Throughput: $0.95$ TFLOPS/core × 32 cores = $30.4$ TFLOPS (compute limit)
  • Arithmetic Intensity (array): $410$ op/B (memory limit: $26.6$ TFLOPS)
  • Final Throughput: $24.3$ TFLOPS
  • Speedup: 24.3/4.84.54×24.3/4.8 \approx 4.54\times (baseline MLIR-AIE symmetric tiling achieves $4.8$ TFLOPS)

Table 3. Impact of ATB on throughput:

L1 tile (MC×K×N)(M_C\times K\times N) ρ\rho Perfarray\mathrm{Perf}_{\rm array} (TF) Speedup
64×88×6464\times88\times64 (symmetric) 1 4.8 1.00×
64×64×12864\times64\times128 1 17.3 3.61×
128×64×128128\times64\times128 4 24.3 4.54×

Table 3 and additional configurations (see (Wang et al., 20 Nov 2025), Table 3) confirm 2–3× throughput gains across other precisions. ATB enlarges the feasible TMCTNT_{M_C}T_N memory reuse volume, often doubling or tripling arithmetic intensity in fixed scratchpad resources.

6. Practical Considerations and Implementation

ATB is a minor extension to standard tiling loops for GEMM: only TMAT_{M_A} rows of AA are buffered while accumulation to TMCT_{M_C} rows of CC proceeds. This increased reuse of the AA tile directly leverages buffer capacity for enhanced output accumulation, enabling larger TMCTNT_{M_C}T_N without surpassing scratchpad constraints. ATB is particularly beneficial under tight buffer budgets or on architectures that can tolerate moderate kernel switching overheads.

A plausible implication is that architectural features such as hardware support for fast context switching and flexible buffer management can further amplify the benefits of asymmetric tile buffering in practice. However, performance gains are contingent on careful tuning of buffer, reduction, and output tile parameters as prescribed by the performance model (Wang et al., 20 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Asymmetric Tile Buffering (ATB).