Asymmetric Tile Buffering (ATB) in GEMM

Updated 27 November 2025

Asymmetric Tile Buffering (ATB) is a tiling strategy for GEMM that decouples input and output tile dimensions to enhance arithmetic intensity and throughput.
Performance models demonstrate that ATB can achieve up to 4.54× speedup over symmetric tiling, as evidenced by case studies on AMD’s XDNA2 AI Engine.
ATB optimizes buffer utilization by balancing the trade-offs between higher arithmetic intensity and increased kernel switching overhead.

Asymmetric Tile Buffering (ATB) is a tiling strategy for general matrix multiplication (GEMM) that decouples the dimension of buffered input $A$ tiles along the $M$ axis from the dimension of output $C$ tiles, permitting significant enhancements in arithmetic intensity and overall throughput. Introduced in the context of accelerating AI workloads, ATB provides a methodical means to optimize buffer utilization in manycore and array architectures by exploiting a previously overlooked asymmetry in input/output operand buffering. Systematic performance modeling reveals that ATB can offer substantial real-world speedup over conventional symmetric tiling, as established in a detailed case study on AMD’s XDNA2 AI Engine (Wang et al., 20 Nov 2025).

1. Definition and Formulation

In symmetric GEMM tiling, a single set of tile dimensions $(T_M,T_K,T_N)$ governs the size of buffered blocks for $A$ , $B$ , and $C$ :

$A$ -tiles are $T_M \times T_K$ ,
$B$ -tiles are $T_K \times T_N$ ,
$C$ -tiles are $T_M \times T_N$ .

Asymmetric Tile Buffering introduces four distinct tile parameters:

$T_{M_A}$ : rows of $A$ buffered,
$T_{M_C}$ : rows of $C$ buffered,
$T_K$ : reduction-dimension tile size,
$T_N$ : columns of $B$ / $C$ buffered,

with the constraint $T_{M_C}\ge T_{M_A}$ . The asymmetry ratio is defined as

$\rho = \frac{T_{M_C}}{T_{M_A}} \ge 1$

which quantifies how many output rows are accumulated per input row loaded. Figure 1 in (Wang et al., 20 Nov 2025) illustrates the structural difference between symmetric tiling and ATB.

2. Performance Model and Analytical Framework

The mathematical performance model comprises several core metrics:

Arithmetic Intensity ( $\mathrm{AI}_\rho$ ):

Let $a$ , $b$ , $c$ denote per-element bytes of $A$ , $B$ , $C$ (e.g., for BF16, $a=b=c=2$ ), and $K$ the global reduction length. For output-stationary scheduling: $|\text{Ops}| = 2\,K\,T_{M_C}\,T_N$

$|\text{Bytes}| = a\,(T_{M_A}T_K)\,N_{\rm iter} + b\,(T_KT_N)\,N_{\rm iter} + c\,(T_{M_C}T_N)$

Substituting $T_{M_A}=T_{M_C}/\rho$ and simplifying gives: $\mathrm{AI}_\rho = \frac{2\,K\,T_{M_C}\,T_N}{a\,\tfrac{T_{M_C}}{\rho}\,K + b\,K\,T_N + c\,T_{M_C}\,T_N}$

This is subject to the L1 buffer constraint: $\frac{2a}{\rho} T_{M_C}T_K + 2b T_KT_N + cT_{M_C}T_N \le S \tag{1}$

Kernel-Switching Overhead:

With a microkernel switching cost $\delta$ ,

$T_{\rm switch} = \delta\,\rho\,\frac{K}{T_K}$

Combined Throughput:

Let $\mathrm{Perf}^{\rm peak}_{\rm core}$ be the peak per-core MAC rate and $\mathrm{Eff}_{\rm micro}$ the microkernel efficiency.

$T_{\rm asym} = \frac{2T_{M_C}KT_N}{\mathrm{Perf}^{\rm peak}_{\rm core}\mathrm{Eff}_{\rm micro}} + \delta\rho\frac{K}{T_K} \tag{9}$

$\mathrm{Eff}_{\rm core} = \frac{1} { \frac{1}{\mathrm{Eff}_{\rm micro}} + \frac{\delta\rho\mathrm{Perf}^{\rm peak}_{\rm core}}{2T_{M_C}T_NT_K} } \tag{11}$

32-core throughput is bounded by

$\mathrm{Perf}_{\rm array} = \min \bigl( \mathrm{AI}_{\rm array} \cdot BW_{\rm offchip},\ \mathrm{Perf}^{\rm peak}_{\rm core} \mathrm{Eff}_{\rm core} \times 32 \bigr) \tag{12}$

3. Trade-offs and Design Space

ATB’s principal trade-off is between maximized arithmetic intensity and increased kernel switching overhead:

High $\rho$ , small $T_K$ : Favors greater arithmetic intensity and memory reuse but exacerbates switching overhead and reduces microkernel efficiency due to shorter steady phases.
Low $\rho$ , large $T_K$ : Improves core efficiency by yielding longer microkernel chains and fewer invocations, but at the expense of attainable arithmetic intensity.

The optimal configuration lies where buffer, compute, and switching costs are jointly minimized. This point is quantitatively determined by jointly satisfying constraints and optimizations in equations (1), (9), (11), and (12) from the analytical model.

4. Parameter Selection Guidelines

For effective deployment of ATB, the following process is recommended:

Choose $T_K$ : Select a $T_K$ large enough that microkernel efficiency $\mathrm{Eff}_{\rm micro} > 0.4$ –$0.6$ (see Table 1).
Increase $\rho$ : Grow the asymmetry ratio until switching overhead or total buffer capacity becomes the limiting factor.
Buffer Allocation: Allocate available buffer to maximize $T_{M_C}T_N$ , thus enlarging the “memory reuse volume” and boosting $\mathrm{AI}_\rho$ .
Full Array Evaluation: Simulate array-level performance; if memory-bound, consider higher $\rho$ ; if compute-bound, adjust by increasing $T_K$ or reducing $\rho$ .

Table 1. Microkernel/core performance under Config 1, $T_{M_A}=128$ , $T_N=128$ :

$T_K$	$\rho$	$\mathrm{Perf}_{\rm micro}$ (TF)	$\mathrm{Eff}_{\rm core}$
8	1	0.36	0.156
8	4	0.36	0.134
32	4	0.75	0.312
64	4	1.16	0.511

5. Practical Implementation and Architectural Case Study

ATB’s effectiveness was demonstrated on AMD’s XDNA2 AI Engine comprising 32 compute cores (4 × 8), each core with 64 KB L1 and two input/output streams. The studied GEMM used mixed-precision (BFP16/BF16) with the following configuration (Config 1, Table 3):

Problem Size: $4096 \times 4096 \times 2048$
L1 Tile: $128 \times 64 \times 128$ , $\rho=4$
Buffer Used: $\approx 60$ KB (vs. 91 KB if symmetric)
Measured Throughput: $0.95$ TFLOPS/core × 32 cores = $30.4$ TFLOPS (compute limit)
Arithmetic Intensity (array): $410$ op/B (memory limit: $26.6$ TFLOPS)
Final Throughput: $24.3$ TFLOPS
Speedup: $24.3/4.8 \approx 4.54\times$ (baseline MLIR-AIE symmetric tiling achieves $4.8$ TFLOPS)

Table 3. Impact of ATB on throughput:

L1 tile $(M_C\times K\times N)$	$\rho$	$\mathrm{Perf}_{\rm array}$ (TF)	Speedup
$64\times88\times64$ (symmetric)	1	4.8	1.00×
$64\times64\times128$	1	17.3	3.61×
$128\times64\times128$	4	24.3	4.54×

Table 3 and additional configurations (see (Wang et al., 20 Nov 2025), Table 3) confirm 2–3× throughput gains across other precisions. ATB enlarges the feasible $T_{M_C}T_N$ memory reuse volume, often doubling or tripling arithmetic intensity in fixed scratchpad resources.

6. Practical Considerations and Implementation

ATB is a minor extension to standard tiling loops for GEMM: only $T_{M_A}$ rows of $A$ are buffered while accumulation to $T_{M_C}$ rows of $C$ proceeds. This increased reuse of the $A$ tile directly leverages buffer capacity for enhanced output accumulation, enabling larger $T_{M_C}T_N$ without surpassing scratchpad constraints. ATB is particularly beneficial under tight buffer budgets or on architectures that can tolerate moderate kernel switching overheads.

A plausible implication is that architectural features such as hardware support for fast context switching and flexible buffer management can further amplify the benefits of asymmetric tile buffering in practice. However, performance gains are contingent on careful tuning of buffer, reduction, and output tile parameters as prescribed by the performance model (Wang et al., 20 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Can Asymmetric Tile Buffering Be Beneficial? (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Asymmetric Tile Buffering (ATB).