CGLA Accelerator: Coarse-Grained Linear Array

Updated 6 December 2025

CGLA Accelerators are hardware architectures that interleave processing elements with local scratchpad memories in linear compute lanes to balance programmability and efficiency.
They leverage streaming dataflows and burst DMA transfers to optimize performance on AI workloads like LLM inference, ASR, and image generation.
System-level trade-offs such as lane count, LMM size, and host-accelerator balance yield significant energy efficiency improvements compared to high-end GPUs.

A Coarse-Grained Linear Array (CGLA) Accelerator is a class of hardware architecture within the broader family of Coarse-Grained Reconfigurable Arrays (CGRAs). CGLAs are designed to deliver a balance between the high programmability of GPUs and the dedicated efficiency of ASICs by arranging processing elements (PEs) in linear “compute lanes” interleaved with local scratchpad memories. This topology, leveraged in systems such as IMAX and IMAX3, sustains high energy efficiency on computationally intensive workloads—including LLM inference, automatic speech recognition, and image generation—while remaining general-purpose and task-agnostic. The architecture’s utility has been demonstrated on diverse workloads spanning quantized matrix operations central to contemporary AI models (Ando et al., 29 Nov 2025, Ando et al., 4 Nov 2025, Ando et al., 4 Nov 2025).

1. Architectural Organization and Microarchitecture

At the core of a CGLA, exemplified by the IMAX3 system, is a one-dimensional array of compute lanes, each consisting of an alternating sequence of Coarse-Grained Processing Elements (PEs) and Local Memory Modules (LMMs). Each PE integrates multiple ALUs (integer, logical, shift), address-generation units, a double-buffered LMM, and, for certain configurations, a pipelined FMA unit capable of inline FP16↔FP32 conversion. This configuration allows for decoupling of compute and data movement paths, enabling concurrent data-loading and execution phases.

Lanes communicate with a host processor (commonly an ARM Cortex-A72) via a high-bandwidth network-on-chip (NoC) and dedicated DMA engines. The pipeline is deeply optimized for streaming: weights, activations, and scaling factors are transferred in large burst transactions to maximize throughput. Each LMM typically provisions 32–64 KB (or larger, e.g., 512 KB per lane for certain applications), balancing static power with the offload ratio for computational tiles.

A summary of typical architectural parameters is provided below:

Platform	PEs/lane	Lanes	LMM size/PE	Clock (FPGA/ASIC)
IMAX3/FPGA	8/64	8	32–64 KB	140–145 MHz
IMAX3/ASIC (28 nm)	64	≥8	64 KB	800–840 MHz

The instruction set architecture (ISA) is rich and CISC-like, containing task-agnostic arithmetic, addressing, and bit-manipulation primitives, with domain-specific extensions such as OP_SML8 (int8 SIMD MAC), OP_AD24 (24-bit integer addition), bit-packing (OP_CVT53), SIMD dot-product, and quantized operator support.

2. Kernel Mapping and Dataflow Strategies

Mapping AI kernels onto a CGLA involves decomposing principal workloads—matrix vector/products, FFTs, convolution, quantized dot-products—into contiguous, burst-sized tiles that can be streamed to the PEs. The compiler analyzes high-level loops and statically maps them to the linear PE chain without any need for 2D routing or complex placement heuristics. For LLM inference, frameworks such as llama.cpp are adapted in a hybrid model: control-intensive tasks (tokenization, normalization, softmax, or KV cache) are retained on the host, while large dot-product kernels (attention projections, feed-forwards) are offloaded.

At runtime, the host schedules “CONF” (configuration) and “EXEC” (execution) packets to each lane, describing loop bounds, strides, and immediate constants. Each PE, executing in lock-step, streams its assigned burst from LMM and computes in a pipelined dataflow, often with four-lane parallelism enabled by custom SIMD instructions.

Data-reuse is a central optimization: vectors and weights are partitioned into blocks sized to LMM capacity, maximizing on-chip arithmetic intensity (number of MACs per LMM load), with spatial reuse factor $R = \frac{2B}{B+\mathrm{overhead}}$ for dot-products. This allows the architecture to approach near–ASIC-levels of arithmetic utilization and energy efficiency for dot-product–dominated kernels.

3. Performance, Energy Metrics, and Comparative Analysis

CGLA accelerators are evaluated using Power-Delay Product (PDP), $PDP = P \cdot T_{\mathrm{latency}}$ , and Energy-Delay Product (EDP), $EDP = P \cdot T_{\mathrm{latency}}^2$ , quantifying the tradeoff between inference cost and performance. End-to-end system analysis is emphasized, encompassing kernel execution, DMA transfer, and host orchestration.

On LLM inference, ASIC-projected IMAX achieves up to 44.4 $\times$ lower PDP and 11.5 $\times$ lower EDP relative to a high-end GPU (NVIDIA RTX 4090), with representative PDP values such as 15.5 J (IMAX, Qwen3-1.7B Q8_0) vs. 28.4 J (RTX 4090). Edge comparisons show 13.6 $\times$ PDP improvement over Jetson AGX Orin. In ASR workloads (Whisper), PDPs of 12.6 J (IMAX/ASIC Q8_0) outpace both Jetson AGX Orin (24 J) and RTX 4090 (120 J), with 1.9 $\times$ and 9.83 $\times$ energy ratios, respectively (Ando et al., 29 Nov 2025, Ando et al., 4 Nov 2025).

For image generation workloads (Stable Diffusion) in kernel isolation, IMAX3/ASIC matches or exceeds Xeon CPU PDP and approaches GPU efficiency in quantized kernels, even though end-to-end latency remains higher.

Latency characteristics reflect the linear streaming dataflow: CGLAs are not competitive with top-tier GPUs on small or low-latency workloads due to host–accelerator transfer and host resource saturation, but excel in throughput-oriented, energy-limited settings.

4. System-Level Bottlenecks and Scaling Constraints

End-to-end analysis consistently identifies host–accelerator data transfer as the primary performance bottleneck, especially for long sequence or context workloads where DMA loads (e.g., KV-cache) dominate iteration time. For Qwen3-0.6B Q3_K_S [32:16], breakdown yields: 27.4% kernel EXEC, 32.6% DMA LOAD, 33.3% host CPU tasks, and only 1.9% for DMA DRAIN, with remaining time in configuration overhead.

Multi-lane scalability is limited by host compute: dual-core ARM hosts saturate at two lanes, necessitating high-core-count or PCIe Gen4/5 hosts for datacenter use. Increasing LMM size above 64 KB yields diminishing returns, as static power increases disproportionately and offload gains plateau.

Pseudocode for kernel offload demonstrates the mixed-execution paradigm; full-burst blocks are processed on CGLA, with residual elements handled by the host:

for i in 1..M:
    offload_chunks = floor(N/B)
    for k in 0..offload_chunks-1:
        DMA_push( W[i,k*B:(k+1)*B], x[k*B:(k+1)*B] )
        CGLA_compute_dot()  # 16-element burst
    # Residual R = N mod 16
    CPU_dot( W[i,N-R:N], x[N-R:N] )

5. ISA Extensions, Quantization, and Kernel Specialization

CGLA architectures incorporate extensible ISAs supporting a broad set of AI quantization and activation formats. The IMAX platform includes:

OP_SML8: 2-way SIMD int8 MAC (output 24-bit)
OP_AD24: 24-bit adder for INT8 accumulations
OP_CVT86, OP_CVT53: SIMD bit-packing/unpacking, supporting mixed-precision (2/4/6-bit) quantization formats
SML16: 16-bit SIMD dot-product

A unified “decompress to INT8” frontend enables a single kernel backend to process 8-bit and mixed low-bit formats. For image generation (Stable Diffusion), additional opcodes support complex quantized accumulations with GGML-style 5-bit scale/3-bit weight kernels and inter-PE reduction sweeps (Ando et al., 4 Nov 2025).

At the microarchitectural level, wider SIMD datapaths (2-way/4-way), nested hardware microcode loops, and hardware-implemented fused multiply-accumulate-activation stages increase quantized kernel throughput and area efficiency.

6. Design Trade-offs, Recommendations, and Prospects

Key system guidelines derived from empirical evaluation include:

LMM sizing: 64 KB/PE for LLM/ASR tasks, 256–512 KB/PE for dense convolutions, balancing performance and power.
Lane count: Prefer increasing number of lanes (16–32+) to optimize tile throughput and hit real-time constraints.
Host–accelerator balance: Core count and interconnect should be matched to array parallelism; dual-core hosts are a bottleneck for multi-lane arrays.
Transfer coalescing: Single-burst DMA transactions improve memory bandwidth by up to 1.2 $\times$ (LOAD) and 4.8 $\times$ (DRAIN).
ISA/microcode: Enhanced support for mixed-precision and fused operators reduces mapping overhead and boosts code density.
Memory hierarchy: On-PE scratchpads and double-buffered DMA engines minimize external DDR traffic.
Programmability vs. efficiency: CGLA platforms deliver ~2 $\times$ lower raw throughput than fixed-function ASICs but far greater algorithmic flexibility; support for multiple AI tasks (LLMs, ASR, k-NN, CNNs).

By harmonizing scalable, linear PE-rich fabrics with quantization-aware ISAs, tight burst scheduling, and host-aligned execution, CGLA designs close the efficiency gap to both GPUs and ASICs in power-sensitive inference while preserving full programmability. Future advances are expected in interface integration (PCIe/host cache), deeper quantization, and on-chip network topologies (e.g., 2D torus for $>128$ PEs), further advancing the scope and impact of CGLA accelerators (Ando et al., 29 Nov 2025, Ando et al., 4 Nov 2025, Ando et al., 4 Nov 2025).

References

"Efficient Kernel Mapping and Comprehensive System Evaluation of LLM Acceleration on a CGLA" (Ando et al., 29 Nov 2025)
"Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA" (Ando et al., 4 Nov 2025)
"Implementation and Evaluation of Stable Diffusion on a General-Purpose CGLA Accelerator" (Ando et al., 4 Nov 2025)