Co-design of tile-based accelerator templates and dataflow for efficient LLM mapping

Establish a co-design methodology that couples tile-based accelerator template design with dataflow selection to efficiently map large language model workloads on tile-based many-PE architectures.

Background

The paper highlights that emerging many-PE, tile-based accelerators rely on on-chip interconnects and collective communication to reduce HBM traffic, making dataflow management crucial for performance. Mapping LLM workloads onto these systems requires balancing matrix engine utilization and off-chip memory traffic through careful dataflow choices.

The authors emphasize that designing an accelerator template in tandem with dataflow selection is a key unresolved challenge, as architectural choices (e.g., memory hierarchy, NoC capabilities, and collective primitives) and dataflow determine whether the system can fully exploit on-chip reuse and achieve high utilization.

References

Furthermore, co-designing a tile-based accelerator template that can efficiently map LLM workloads remains an open architectural problem which is tightly coupled with dataflow selection.

— FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators (2604.02110 - Zhang et al., 2 Apr 2026) in Section 1, Introduction

Co-design of tile-based accelerator templates and dataflow for efficient LLM mapping

Background

References

Related Problems