Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 98 tok/s Pro
GPT OSS 120B 424 tok/s Pro
Kimi K2 164 tok/s Pro
2000 character limit reached

Zen-Attention: A Compiler Framework for Dynamic Attention Folding on AMD NPUs (2508.17593v1)

Published 25 Aug 2025 in cs.DC

Abstract: Transformer-based deep learning models are increasingly deployed on energy, and DRAM bandwidth constrained devices such as laptops and gaming consoles, which presents significant challenges in meeting the latency requirements of the models. The industry is turning to neural processing units (NPUs) for superior performance-per-watt (perf/watt); however, efficiently mapping dynamic attention layers to the NPUs remains a challenging task. For optimizing perf/watt, AMD XDNA NPUs employ software managed caches and share system memory with host. This requires substantial engineering effort to unlock efficient tiling, buffer allocation, and data movement to extract the maximum efficiency from the device. This paper introduces Zen-Attention, a framework that optimizes DRAM bandwidth utilization in the attention layer of models by systematically exploring the complex design space of layer folding, tiling, and data-movement on the interconnect, and the tensor layouts to come up with an optimal solution. Our evaluation includes comparative analysis of end-to-end model latency and specific attention latency in each model. We demonstrate how the framework enhances mapping capabilities by varying input dimensions, which require padding and masking in the attention block. For representative transformer models, the Zen-Attention Framework achieves up to 4x improvement in the latency of the attention block and up to 32% improvement in end-to-end network latency compared to the baseline Unfolded- approaches.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a compiler framework that dynamically folds transformer attention, reducing DRAM roundtrips and latency.
  • It optimizes hardware-aware graph analysis, tiling, and data transpose to efficiently utilize AMD XDNA NPU resources.
  • Experimental results demonstrate up to 4x reduction in latency and a 32% speedup in end-to-end model performance on transformer models.

Zen-Attention: Compiler-Driven Dynamic Attention Folding for AMD NPUs

Introduction

The paper presents Zen-Attention, a compiler framework designed to optimize the execution of transformer attention mechanisms on AMD XDNA NPUs. The work addresses the acute memory-bound nature of attention layers, which are a primary bottleneck in transformer models deployed on energy- and bandwidth-constrained devices. By leveraging explicit memory management and hardware-aware graph optimizations, Zen-Attention systematically reduces DRAM bandwidth utilization and improves both attention block and end-to-end model latency.

Motivation and Problem Analysis

Transformer attention layers, particularly in multi-head attention (MHA), are characterized by large activation tensors and reduction operations, resulting in low operational intensity and a memory-bound profile. On AMD XDNA NPUs, which utilize software-managed scratchpad memories (L1/L2) and share DRAM (L3) with the host, the challenge is exacerbated by limited memory bandwidth (e.g., 60 GB/s for NPU vs. 130 GB/s for GPU on Ryzen AI 9 HX platforms). Traditional layer-by-layer mapping strategies incur excessive DRAM roundtrips, especially for activation data, and are further hampered by the need for data transformations (e.g., transpose, padding) that are not efficiently handled on NPUs.

The paper identifies that existing approaches, such as kernel folding for individual operators, are limited by on-chip memory size and do not fully exploit the spatial reduction and data reuse opportunities available in the NPU architecture. Furthermore, the diversity of attention variants (MHA, MQA, GQA) and the presence of masks and biases complicate kernel design and memory allocation.

Zen-Attention Framework Architecture

Hardware-Aware Graph Optimization

Zen-Attention begins by analyzing the ONNX computation graph to identify attention blocks amenable to folding. The framework attempts to fuse the canonical attention chain:

  1. A=QKTA = Q \cdot K^T
  2. A=A+B+MA = A + B + M (optional bias and mask)
  3. SMout=SoftMax(A)SM_{out} = \text{SoftMax}(A)
  4. Z=SMoutVZ = SM_{out} \cdot V

The optimizer determines the maximal folding level achievable given L1 memory constraints, classifies attention variants (MHA, GQA, MQA), and batches compatible nodes to maximize hardware utilization. The folding level metric quantifies the depth of fusion, with level 3 indicating full folding of the attention block.

Tiling and Buffer Allocation

The tiler module computes valid subvolume tilings for Q, K, V, B, and M tensors, constrained by L1 buffer sizes and kernel requirements. The tiling strategy prioritizes maximizing the Q dimension and exploits spatial unrolling across NPU cores. When K and V are pinned to L1, Q is unrolled spatially; otherwise, batch size is temporally iterated with spatial reduction across columns. The tiler returns the optimal folding configuration, balancing memory usage and compute efficiency.

Folding-Preserving Transpose

Transpose operations, required for K inputs, are expensive in terms of DRAM access and buffer usage. Zen-Attention introduces a hybrid Folding-Preserving Transpose mechanism: block-level transposes are performed via DMA on L2, while intra-block transposes are handled in L1 using SHUFFLE intrinsics within a specialized MatMul kernel. This approach eliminates the need for a separate transpose kernel and additional L1 buffers, maintaining folding integrity and minimizing DRAM traffic.

Padding Strategy

To satisfy kernel granularity constraints, tensors are padded to the next multiple of the required dimensions. The framework leverages DMA-based padding on L2 channels, and, when infeasible, delegates padding to the preceding layer, obviating the need for explicit Pad operators.

Experimental Results

Zen-Attention was evaluated on a Ryzen AI 9 HX 370 platform with a 32-core NPU (4x8 grid, 50 TOPs, 64KB L1, 512KB L2 per tile). Five transformer models, including ViT-base-patch-16, CLIP variants, and BERT, were tested. All models required K-input transposes, and some required input padding.

Key results include:

  • Up to 4x reduction in attention block latency compared to unfolded (layer-by-layer) execution, particularly for models with large sequence and context lengths.
  • Up to 32% reduction in end-to-end model latency (ViT-base-patch-16), directly attributable to reduced DRAM roundtrips in bandwidth-bound scenarios.
  • For compute-bound cases (e.g., small input dimensions), folded attention still yields an ~8% latency reduction and lower DRAM bandwidth utilization.
  • The framework demonstrates robust handling of diverse attention variants and input shapes, with dynamic adaptation to hardware constraints.

Implications and Future Directions

Zen-Attention demonstrates that compiler-driven, hardware-aware folding of attention blocks can substantially mitigate the memory bottlenecks inherent in transformer models on NPUs. The explicit management of data movement, tiling, and operator fusion is essential for extracting maximal performance from architectures with limited memory bandwidth and scratchpad-based hierarchies.

Practically, this enables more efficient deployment of large language and vision models on client devices, reducing energy consumption and improving responsiveness. The approach is extensible to other memory-bound operators and could inform the design of future NPU architectures and compiler toolchains.

Theoretically, the work underscores the importance of co-design between model architecture, compiler optimizations, and hardware capabilities. Future research may explore automated search for optimal folding strategies, integration with dynamic runtime systems, and generalization to other accelerator platforms.

Conclusion

Zen-Attention provides a comprehensive compiler framework for dynamic attention folding on AMD NPUs, achieving significant reductions in both attention block and end-to-end model latency. By systematically optimizing graph structure, tiling, data movement, and operator fusion, the framework addresses the core memory-bound challenges of transformer attention on bandwidth-constrained hardware. The results highlight the necessity of hardware-aware compiler optimizations for efficient on-device AI inference and suggest promising avenues for further research in compiler-accelerator co-design.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube