2000 character limit reached

CGRAs: Programmable Accelerator Architectures

Updated 16 August 2025

CGRAs are spatially programmable accelerator architectures featuring a 2D grid of processing elements interconnected by configurable routing fabrics, bridging the gap between ASICs and FPGAs.
They employ advanced mapping methodologies, including ILP and SAT-based scheduling, to optimize dataflow and resource constraints in both regular and irregular workloads.
Innovations such as motif-based mapping, dynamic cache reconfiguration, and ultra-low-power scheduling enable significant performance gains and energy efficiency improvements in CGRA deployments.

Coarse-Grained Reconfigurable Arrays (CGRAs) are spatial programmable accelerator architectures characterized by an array of word-sized processing elements interconnected via a configurable network. CGRAs occupy a design space between ASICs and FPGAs, offering higher-level programmability than fine-grained logic-based fabrics and much lower development costs than fully custom silicon. Their architecture, programming models, mapping toolchains, and application domains have evolved rapidly, resulting in a diverse landscape of architectural specializations and sophisticated software infrastructure.

1. Architectural Principles and Key Components

CGRAs consist of a two-dimensional grid of processing elements (PEs) capable of executing arithmetic or logic operations at word granularity. The grid is interconnected by a spatially programmable routing fabric, often with hardwired crossbars, configurable multiplexers, or specialized code-driven routers between neighboring or globally distributed nodes (Kulkarni et al., 2017, Walker et al., 2019, Walter et al., 26 Feb 2025).

Each PE typically contains:

A functional unit for ALU-like operations (add, subtract, multiply, compare, etc.).
Small local register files or pipeline registers for temporally buffering operands and results.
Configuration memory or state registers for defining the operation of the PE on a per-cycle or per-context basis.

The overall network may also include border-accessible scratchpad memory (SPM), memory operation blocks (MOBs) for optimized data movement, or even tightly integrated SRAM or external DRAM interfaces, especially when targeting larger, data-intensive workloads (Prasad, 17 Jul 2025, Liu et al., 13 Aug 2025).

Distinctive features compared to FPGAs include coarse word-size configurability, much faster compilation flows, and reduced reconfiguration overhead. Compared to ASICs, CGRAs’ programmability and flexibility are retained by maintaining word-level reconfiguration (Kulkarni et al., 2017, Walker et al., 2019).

2. Programming, Compilation, and Mapping Methodologies

CGRAs are typically programmed at a high abstraction, either via C/C++ source code with kernel annotations, DFG extraction, or domain-specific languages. The dominant mapping strategies fall into two classes:

Operation-centric mapping: The dataflow graph (DFG) of the application kernel is extracted, and each node (operation) is mapped individually to a PE. Dependencies are statically routed through the interconnect. Scheduling respects operation latency, resource availability, and routing delays using constraints such as:

$\tau(v_i) + d_i + r_{i,j} = \tau(v_j)$

where $\tau(v_i)$ is the schedule time of operation $v_i$ , $d_i$ its latency, and $r_{i,j}$ the pipeline delay between dependent nodes (Walter et al., 17 Feb 2025).

Iteration-centric mapping: The nested loop’s iteration space is partitioned into tiles, and each tile (comprising many iterations) is assigned to a PE. The PE executes the kernel locally, exploiting spatial and temporal locality (Walter et al., 17 Feb 2025). This style is prevalent in Tightly-Coupled Processor Arrays (TCPAs), which architecturally diverge from traditional CGRAs in their support of local memories, multi-FU PEs, and orthogonal instruction processing.

Mapping is typically carried out using:

Integer Linear Programming (ILP) formulations, which encode resource, routing, and schedule constraints precisely (Walker et al., 2019, Walter et al., 26 Feb 2025).
SAT-based exact scheduling (with modulo scheduling and kernel mobility schedules) for tightly resource-constrained or timing-critical deployments (Tirelli et al., 20 Feb 2024).
Hierarchical or motif-based mapping, where application DFGs are decomposed into recurring subgraphs (motifs), enabling more efficient mapping and routing (Li et al., 11 Dec 2024).
Compiler transformations for control/data flow fusion and liveness analysis to address arbitrary control-flow applications (e.g., with branches and loops) (Wang et al., 4 Aug 2025).

Compiler toolchains such as Morpher (Wijerathne et al., 2023), CGRA-ME (Walker et al., 2019, Walter et al., 26 Feb 2025), CGRA-Flow, and Pillars (Walter et al., 26 Feb 2025) provide DFG extraction, schedule assignment, placement, and routing. Some employ custom dialects in MLIR for IR representations spanning multiple abstraction levels (Wang et al., 4 Aug 2025).

3. Interconnect and Memory Subsystem Design

The interconnect fabric in CGRA architectures is a pivotal factor for both flexibility and efficiency. Several models exist:

Static interconnects: Hardwired connections with dedicated multiplexers, optimized using topology-specific tools (e.g., Canal (Melchert et al., 2022)) to minimize area and routing overhead.
Hybrid interconnects: Incorporate ready–valid handshaking and dynamic control, supporting more runtime pipeline control at the expense of some area and design complexity (Melchert et al., 2022).

Design space parameters include switch box topology (Wilton, Disjoint, etc.), the number of routing tracks, inter-tile connections, and the degree of buffer/FIFO deployment for dynamic pipelining. Frameworks like Canal are used for interconnect IR definition and rapid design space exploration.

Memory architectures have evolved from SPM-only models to hybrid hierarchies that incorporate low-latency SPM, L1/L2 caches, and dynamic, per-PE cache reconfiguration (Liu et al., 13 Aug 2025). Irregular workloads, such as graph analytics, expose the limitation of SPM-centric models. Recent designs employ:

Runahead execution: Proactive prefetching and speculative computation during cache misses, with state backup and re-execution on miss resolution, leading to up to 6.91x speedups for irregular memory access patterns (Liu et al., 13 Aug 2025).
Dynamic cache reconfiguration: Per-L1 cache miss monitoring and reallocation of cache capacity/associativity and line size via a log-maximization linear program:

$\max \sum_{i=0}^{A-1} \log H_i(S_i) \quad \text{subject to} \quad \sum_{i=0}^{A-1} S_i \leq S,\quad S_i \in \mathbb{Z}_{\geq 0}$

where $H_i(S_i)$ is the maximal hit rate for allocation $S_i$ (Liu et al., 13 Aug 2025).

4. Innovations in Compute, Communication, and Energy Efficiency

Recent CGRA architectures have introduced several innovations to increase performance and efficiency:

Collective execution and motif-based mapping: Architectures such as Plaid group multiple DFG nodes into motifs executed with local collective routers, reducing the communication network’s area/power without compromising generality (Li et al., 11 Dec 2024).
Parameterized and virtual CGRA overlays: “Pixie” overlays implement generic PEs and virtual channels on FPGAs using parameterized tool flows, achieving substantial resource reduction (24% for PEs and 82% for VCs) and fast compilation/instantiation (Kulkarni et al., 2017).
Ultra-low-power scheduling and dataflow: Architectures targeting transformer acceleration at the edge utilize heterogeneous PE and MOB arrays with switchless mesh/torus interconnects, ensuring sub-milliwatt operation and dataflow-optimized compute for matrix multiplications (Prasad, 17 Jul 2025).
Dynamic and proactive aging mitigation: Periodic rotation of configuration across the CGRA fabric balances FU usage, thus reducing NBTI-induced threshold voltage shifts and extending device lifetime by up to 2.2x (Brandalero et al., 2020).
Approximate computing and voltage islands: Integration of approximate multipliers (e.g., DRUM) and per-channel mapping guided by mean squared error (MSE) enables significant energy savings (30% reduction, less than 2% area overhead) via static voltage islands (Alexandris et al., 29 May 2025).
Compiler-assisted control flow: MLIR-based compilation frameworks now manage global control/data flow in the compiler, eliminating the need for control-dedicated hardware and achieving up to 2.1x speedups through block merging, loop-head fusion, and modulo scheduling adaptation (Wang et al., 4 Aug 2025).

5. Performance, Scalability, and Toolchain Evaluation

Performance metrics for CGRAs are nuanced and highly dependent on mapping quality, workload structure, and memory system efficiency. Key observations include:

For regular kernels (e.g., GEMM), operation-centric CGRAs often underutilize available PEs due to routing limitations and recurrence constraints, resulting in higher than optimal initiation intervals (IIs), as given by:

$\text{II}_{\text{optimal}} = \max(\text{RecMII}, \lceil \text{total\_ops} / \#\text{PEs} \rceil)$

where RecMII is the recurrence minimum initiation interval (Walter et al., 26 Feb 2025, Walter et al., 17 Feb 2025).

Iteration-centric TCPA architectures, utilizing tiling and local data buffers, can achieve order-of-magnitude improvements in throughput (up to 19x for GEMM), albeit at higher area costs (Walter et al., 17 Feb 2025).
Energy efficiency is not solely a function of technology but arises from architectural alignment (such as collective routing (Li et al., 11 Dec 2024)), per-tile voltage scaling (Alexandris et al., 29 May 2025), and memory hierarchy adaptation (Liu et al., 13 Aug 2025).
Toolchain evaluations demonstrate that mapping complex loops is limited by both routing fabric constraints and insufficient PE utilization. Enhancements such as multi-hop interconnects or improved placement/routing heuristics (e.g., connectivity-pruned ILP (Walker et al., 2019)) are critical for scalable tool performance.
Early estimation frameworks have emerged to provide rapid, cycle-level power and timing forecasts using kernel characterization and Python-based simulation, enabling fast design space exploration prior to time-consuming post-synthesis runs (Aspros et al., 2 Apr 2025).

Architecture or Tool	Main Innovation	Metric/Result
Pixie VCGRA	Parameterized config, TLUT/TCON	24% PE, 82% VC resource reduction
Plaid	Motif detection, collective	43% power, 46% area savings over spatio-temporal CGRA
Morpher	Architecture-adaptive, DFG extraction	Up to 25x speedup for optimized CONV kernels
DR-CGRA	Inter-thread dataflow, no spill	2.1–4.5x speedup on SPEC2017
MLIR-based CFG compilation	Compiler-driven control management	2.1x speedups over previous software
Approximate CGRA	Per-channel DRUM, voltage islands	30% power reduction, 2% area overhead

6. Application Domains, Flexibility, and Future Directions

CGRAs have been applied to a wide range of workloads, including image processing kernels (filters, Sobel), dense and sparse matrix multiplications, convolutional neural networks (CNNs), transformer inference, graph analytics, and even general-purpose loop acceleration.

Compiler advances such as equality saturation for dataflow rewriting (FlexC) have significantly expanded the domain flexibility of CGRAs, enabling accelerators to support kernels outside their original native operation set and increasing mapping coverage by up to 2.2x (Woodruff et al., 2023).

Emergent architectural directions include:

Dual-mode (operation/data-centric) designs for irregular and graph-centric applications (Wu et al., 2023).
Modular, motif-based hierarchical execution units for better scaling and reduced area overhead (Li et al., 11 Dec 2024).
Fine-grained hardware abstraction and partial dynamic reconfiguration for multi-tasked/heterogeneous workloads (Kong et al., 2023).

A plausible implication is that as architectural abstraction and compiler capabilities grow (e.g., hardware-agnostic IRs, support for complex control/data flows, rapid power/timing estimators), CGRAs will become increasingly preferred for energy-constrained, adaptive edge and cloud deployments, bridging the gap between domain-specific accelerators and general-purpose spatial computing fabrics.