Asymmetric Tile Buffering (ATB) in GEMM
- Asymmetric Tile Buffering (ATB) is a tiling strategy for GEMM that decouples input and output tile dimensions to enhance arithmetic intensity and throughput.
- Performance models demonstrate that ATB can achieve up to 4.54× speedup over symmetric tiling, as evidenced by case studies on AMD’s XDNA2 AI Engine.
- ATB optimizes buffer utilization by balancing the trade-offs between higher arithmetic intensity and increased kernel switching overhead.
Asymmetric Tile Buffering (ATB) is a tiling strategy for general matrix multiplication (GEMM) that decouples the dimension of buffered input tiles along the axis from the dimension of output tiles, permitting significant enhancements in arithmetic intensity and overall throughput. Introduced in the context of accelerating AI workloads, ATB provides a methodical means to optimize buffer utilization in manycore and array architectures by exploiting a previously overlooked asymmetry in input/output operand buffering. Systematic performance modeling reveals that ATB can offer substantial real-world speedup over conventional symmetric tiling, as established in a detailed case paper on AMD’s XDNA2 AI Engine (Wang et al., 20 Nov 2025).
1. Definition and Formulation
In symmetric GEMM tiling, a single set of tile dimensions governs the size of buffered blocks for , , and :
- -tiles are ,
- -tiles are ,
- -tiles are .
Asymmetric Tile Buffering introduces four distinct tile parameters:
- : rows of buffered,
- : rows of buffered,
- : reduction-dimension tile size,
- : columns of / buffered,
with the constraint . The asymmetry ratio is defined as
which quantifies how many output rows are accumulated per input row loaded. Figure 1 in (Wang et al., 20 Nov 2025) illustrates the structural difference between symmetric tiling and ATB.
2. Performance Model and Analytical Framework
The mathematical performance model comprises several core metrics:
Arithmetic Intensity ():
Let , , denote per-element bytes of , , (e.g., for BF16, ), and the global reduction length. For output-stationary scheduling:
Substituting and simplifying gives:
This is subject to the L1 buffer constraint:
Kernel-Switching Overhead:
With a microkernel switching cost ,
Combined Throughput:
Let be the peak per-core MAC rate and the microkernel efficiency.
32-core throughput is bounded by
3. Trade-offs and Design Space
ATB’s principal trade-off is between maximized arithmetic intensity and increased kernel switching overhead:
- High , small : Favors greater arithmetic intensity and memory reuse but exacerbates switching overhead and reduces microkernel efficiency due to shorter steady phases.
- Low , large : Improves core efficiency by yielding longer microkernel chains and fewer invocations, but at the expense of attainable arithmetic intensity.
The optimal configuration lies where buffer, compute, and switching costs are jointly minimized. This point is quantitatively determined by jointly satisfying constraints and optimizations in equations (1), (9), (11), and (12) from the analytical model.
4. Parameter Selection Guidelines
For effective deployment of ATB, the following process is recommended:
- Choose : Select a large enough that microkernel efficiency –$0.6$ (see Table 1).
- Increase : Grow the asymmetry ratio until switching overhead or total buffer capacity becomes the limiting factor.
- Buffer Allocation: Allocate available buffer to maximize , thus enlarging the “memory reuse volume” and boosting .
- Full Array Evaluation: Simulate array-level performance; if memory-bound, consider higher ; if compute-bound, adjust by increasing or reducing .
Table 1. Microkernel/core performance under Config 1, , :
| (TF) | |||
|---|---|---|---|
| 8 | 1 | 0.36 | 0.156 |
| 8 | 4 | 0.36 | 0.134 |
| 32 | 4 | 0.75 | 0.312 |
| 64 | 4 | 1.16 | 0.511 |
5. Practical Implementation and Architectural Case Study
ATB’s effectiveness was demonstrated on AMD’s XDNA2 AI Engine comprising 32 compute cores (4 × 8), each core with 64 KB L1 and two input/output streams. The studied GEMM used mixed-precision (BFP16/BF16) with the following configuration (Config 1, Table 3):
- Problem Size:
- L1 Tile: ,
- Buffer Used: KB (vs. 91 KB if symmetric)
- Measured Throughput: $0.95$ TFLOPS/core × 32 cores = $30.4$ TFLOPS (compute limit)
- Arithmetic Intensity (array): $410$ op/B (memory limit: $26.6$ TFLOPS)
- Final Throughput: $24.3$ TFLOPS
- Speedup: (baseline MLIR-AIE symmetric tiling achieves $4.8$ TFLOPS)
Table 3. Impact of ATB on throughput:
| L1 tile | (TF) | Speedup | |
|---|---|---|---|
| (symmetric) | 1 | 4.8 | 1.00× |
| 1 | 17.3 | 3.61× | |
| 4 | 24.3 | 4.54× |
Table 3 and additional configurations (see (Wang et al., 20 Nov 2025), Table 3) confirm 2–3× throughput gains across other precisions. ATB enlarges the feasible memory reuse volume, often doubling or tripling arithmetic intensity in fixed scratchpad resources.
6. Practical Considerations and Implementation
ATB is a minor extension to standard tiling loops for GEMM: only rows of are buffered while accumulation to rows of proceeds. This increased reuse of the tile directly leverages buffer capacity for enhanced output accumulation, enabling larger without surpassing scratchpad constraints. ATB is particularly beneficial under tight buffer budgets or on architectures that can tolerate moderate kernel switching overheads.
A plausible implication is that architectural features such as hardware support for fast context switching and flexible buffer management can further amplify the benefits of asymmetric tile buffering in practice. However, performance gains are contingent on careful tuning of buffer, reduction, and output tile parameters as prescribed by the performance model (Wang et al., 20 Nov 2025).