CGLAs: Coarse-Grained Linear Arrays
- CGLAs are one-dimensional coarse-grained reconfigurable arrays that use a linear fabric of processing elements and local memory to streamline dataflow.
- The IMAX accelerator evaluation demonstrates a design that leverages deterministic streaming and specialized instructions to optimize dot-product computations for LLM inference.
- Quantitative results indicate that CGLAs offer significant energy efficiency improvements over GPUs, balancing lower performance latency with reduced power-delay and energy-delay products.
Coarse-Grained Linear Arrays (CGLAs) are a one-dimensional form of Coarse-Grained Reconfigurable Array (CGRA) in which processing and memory resources are organized as a linear fabric rather than a two-dimensional mesh. In the evaluation reported for the IMAX accelerator, a CGLA is presented as a general-purpose, task-agnostic architecture intended to balance the flexibility of a programmable fabric with the energy efficiency of fixed-function units, while remaining adaptable to domain-specific workloads such as LLM inference (Ando et al., 29 Nov 2025). The cited study positions CGLAs as an alternative to GPUs for power-constrained deployment, and reports the first comprehensive, end-to-end evaluation of a non-AI-specialized CGLA accelerator for the Qwen LLM family.
1. Architectural definition and system embedding
In the reported implementation, the Coarse-Grained Linear Array is embodied by IMAX, a one-dimensional CGRA organized as a multi-lane, linear array of Processing Elements (PEs) interleaved with Local Memory Modules (LMMs) (Ando et al., 29 Nov 2025). At the system level, a host Processing System (PS)—a dual-core Arm Cortex-A72 running Linux—interfaces through a high-bandwidth on-chip Network-on-Chip and a DMA controller to the Programmable Logic (PL), which contains up to eight independent compute lanes. In the prototype configuration, two lanes are activated to avoid saturation of the dual-core host.
The lane-level organization is strictly linear. Within each lane, PEs and LMMs alternate in a one-dimensional chain, and neighboring PEs are connected by point-to-point FIFOs that stream data in fine-grained tokens. The stated effect of this organization is to eliminate complex 2D routing and preserve a deterministic pipeline. Each LMM is a double-buffered, hardware-managed bank, with 64 KB in the reported evaluation, and its purpose is to hide load/store latency by overlapping DMA transfers with computation.
This organization gives the CGLA a distinctive systems profile. It is neither a fixed-function AI accelerator nor a conventional GPU-like programmable array. Instead, the architecture emphasizes deterministic streaming, explicit dataflow between adjacent PEs, and a local-memory hierarchy sized to support tiled kernels. A plausible implication is that the linear topology is valuable when predictable offload behavior and constrained power budgets are more important than minimizing absolute latency.
2. Processing element composition and instruction support
A single PE in the IMAX CGLA comprises three arithmetic units, two address generation units, a double-buffered LMM interface, and a small register file with control logic (Ando et al., 29 Nov 2025). The arithmetic units are heterogeneous: ALU1 is integer, ALU2 is logical, and ALU3 is shift. The two address generation units, AG1 and AG2, decouple memory-access address calculations from arithmetic. This mix is described as supporting both general-purpose control and domain-specific data-parallel operations.
To exploit this heterogeneity, IMAX adopts a CISC-style instruction set extended with custom SIMD and conversion primitives. The compiler emits these instructions directly from C/C++ kernels, and the packet format is described as compact enough to enable single-cycle execution of complex dot-product primitives.
| Instruction | Function | Reported role |
|---|---|---|
OP_SML8 |
two-way SIMD multiply-accumulate on signed 8-bit lanes, yielding 24-bit partial sums | low-bit dot-product primitive |
OP_AD24 |
two-way 24-bit integer addition for accumulation | accumulation |
SML16 |
16-bit multiply-accumulate after front-end decompression | mixed-bit back-end reuse |
CVT86 |
low-bit unpacking instruction converting mixed-precision weight formats into wider integers | front-end decode |
OP_CVT53 |
low-bit unpacking instruction converting mixed-precision weight formats into wider integers | front-end decode |
The architectural significance of these instructions lies in the unification of multiple quantized execution paths. The paper describes CVT86 and OP_CVT53 as mechanisms for unpacking mixed-precision weight formats into wider integers for a unified back-end. This suggests that the CGLA is designed not only around arithmetic throughput, but also around reducing the control and data-format overhead that accompanies modern quantized LLM inference.
3. Kernel mapping for LLM inference
The reported software stack builds on the widely adopted llama.cpp inference framework and partitions Transformer-style execution between CPU and IMAX (Ando et al., 29 Nov 2025). Complex control and non-linear operations remain on the host, while all large-matrix dot-products—specifically attention projections and SwiGLU feed-forward layers—are offloaded to the CGLA. The compiler extracts innermost loops, identified as dot-products, and tiles them into burst lengths that fit the 64 KB LMM.
The mapping strategy is quantization-specific. For FP16 kernels, each PE uses a LUT to convert incoming half-precision values into internal FP32, then employs a combination of 2×32-bit SIMD FMA and time-multiplexed micro-threads to hide latency, processing 16 multiplications per cycle. For Q8_0 kernels, OP_SML8 and OP_AD24 are pipelined across twelve PEs. Four parallel lanes of this 12-PE pipeline compute 32-element inner products in two bursts, consuming 46 arithmetic units per lane. For Q6_K and Q3_K mixed-bit kernels, the front end decodes packed 2/4-bit or 1/2-bit weights via CVT86 or OP_CVT53, emits 8-bit or 5-bit representations, and then reuses the same SML16/SML8 back-end; Q6_K uses 64 units and Q3_K uses 51.
DMA transfers of weights, activations, and scales are coalesced into a single contiguous block per kernel in order to amortize setup overhead. Double-buffered LMMs allow the next tile’s data to stream in while computation proceeds on the current tile. In effect, the kernel mapping methodology treats the CGLA as a streaming dot-product engine whose efficiency depends on burst shaping, format conversion, and overlap of transfer and compute.
4. Quantitative performance and energy characterization
The evaluation uses an FPGA prototype running at 145 MHz with two lanes, and projects a 28 nm ASIC at 840 MHz, described as approximately a 6× speedup (Ando et al., 29 Nov 2025). The benchmarks cover three Qwen3 model sizes—0.6 B, 1.7 B, and 8 B parameters—under Q3_K_S and Q8_0 quantizations, with varying prompt and output lengths.
The reported metrics are end-to-end rather than kernel-local. They include E2E latency from host prompt to first token, the Power-Delay Product,
and the Energy-Delay Product,
For the ASIC, power is obtained from synthesis at 10% switching activity and includes per-kernel dynamic power plus measured host idle power. GPU and edge-device power are modeled at TDP.
A representative Q8_0 [16:4] case illustrates the trade-off. The RTX 4090 achieves the lowest latency, approximately 0.8 s, whereas IMAX-ASIC records 5.6 s. On the same workload, the paper reports IMAX-ASIC PDP at 15.5 J, compared with 28.4 J for the RTX 4090 and 22.1 J for NVIDIA Jetson AGX Orin, corresponding in the paper’s wording to a 44.4× improvement over the RTX 4090 and a 13.6× improvement over Orin. For EDP, IMAX-ASIC records 118.9 J·s versus 216.8 J·s for the RTX 4090, which the paper reports as an 11.5× gain. The general conclusion stated by the study is that GPUs exhibit lower latency, whereas the non-AI-specific accelerator achieves higher energy efficiency.
These measurements are important because they frame the CGLA not as a universal latency winner, but as a platform offering a favorable performance-energy trade-off. The paper’s emphasis on PDP and EDP also indicates that accelerator assessment is being extended beyond raw time-to-solution toward compound metrics that better capture deployment constraints.
5. System bottlenecks and end-to-end execution limits
The study identifies host-accelerator data transfer as the primary performance bottleneck, and explicitly notes that this factor is often overlooked in kernel-level studies (Ando et al., 29 Nov 2025). A fine-grained runtime breakdown attributes 33% of E2E time to DMA LOAD from host to LMM, 33% to host CPU scheduling and PIO configurations, and 27% to kernel EXEC time. This is a system-level result rather than a compute-kernel result.
The bottleneck becomes sharper during decoding. In the decode phase, the entire growing KV cache must be reloaded each token, making LOAD the primary limiter. Even though the LMMs are double-buffered and overlap transfer with compute, DMA setup and host software overhead remain critical. This distinction is central to understanding CGLA behavior: the array itself may sustain efficient kernel execution, while end-to-end throughput is constrained by orchestration and data movement.
A common misconception addressed by these findings is that accelerator quality can be inferred from kernel execution alone. The reported breakdown shows that for LLM inference, especially autoregressive decode, host scheduling and memory traffic can dominate the observable runtime. In that sense, the paper argues for co-design of the accelerator, host interface, and software runtime rather than isolated optimization of the PE datapath.
6. Design implications for future CGLAs
The paper states that a linear CGLA with a carefully chosen LMM size—64 KB in the evaluated cases—reaches an “energy sweet spot,” achieving greater than 85% offload ratios without incurring excessive static power (Ando et al., 29 Nov 2025). Beyond that point, increasing LMM size yields diminishing PDP returns. This is one of the clearest design rules extracted from the study: local memory capacity is beneficial up to the point where added storage no longer compensates for its static-power cost.
The reported mitigation strategies for current bottlenecks are also concrete. The study recommends a higher-performance host, such as a many-core CPU, or off-chip PCIe interfacing to sustain multi-lane DMA bandwidth; hardware-accelerated slice schedulers to reduce PIO and programming overhead; and on-chip compression of KV caches or bidirectional streaming links to reduce data-movement cost. For next-generation CGLAs, the stated requirements are scalable, wide interconnects such as PCIe Gen5 or CCIX, hardware support for emerging ultra-low-bit quantization at 1–2 bit with on-the-fly decompression, and PE augmentation with microcoded primitives for kernel dispatch to minimize host intervention.
Taken together, these implications define CGLAs less as isolated compute fabrics than as system architectures whose viability depends on memory movement, interface bandwidth, quantization support, and runtime control. The reported IMAX results validate, within the study’s scope, that a general-purpose, non-AI-specialized fabric can deliver ASIC-class energy efficiency for modern LLM inference, provided that data movement and host-interface challenges are addressed jointly with accelerator design.