IMAX3 System: High-Efficiency AI Accelerator
- IMAX3 is a general-purpose accelerator featuring a scalable, systolic-optimized CGLA architecture that enhances high-throughput AI computations.
- It integrates a linear array of processing elements with local memory and custom SIMD instructions designed for efficient dense linear algebra operations.
- Evaluations on FPGA and projected ASIC platforms demonstrate IMAX3's potential for energy-efficient, on-device AI inference with improved throughput.
IMAX3 is a general-purpose accelerator system based on a Coarse-Grained Linear Array (CGLA) architecture. Designed to support high-throughput, energy-efficient execution of key computational kernels in modern AI workloads, IMAX3 employs a scalable, systolic-optimized network of processing elements (PEs) tailored for dense linear algebra operations and fine-grained parallelism. An in-depth evaluation of IMAX3 using the stable-diffusion.cpp image generation framework both on FPGA and projected ASIC implementations demonstrates its potential for AI-specialized, on-device computing and establishes guidelines for further architectural refinement (Ando et al., 4 Nov 2025).
1. Architectural Organization
The core of the IMAX3 system is a linear array of PEs grouped into lanes. Each PE incorporates:
- A small-scale ALU pipeline capable of integer and floating-point multiply–accumulate (MAC) operations and custom SIMD instructions.
- A local Register File (RF) consisting of 32 × 32-bit registers for temporaries and loop indices.
- A Local Memory Module (LMM) of 8 KB per PE (configurable to 32 KB), functioning as a software-managed scratchpad for storing activation tiles, weights, and partial sums.
PEs are arrayed in lanes, with 64 PEs per lane in the FPGA prototype and scalability up to 8 lanes (512 PEs total). Linear inter-PE connections in both directions enable nearest-neighbor communication, supporting systolic "shift-and-sum" execution for dot-product and convolution kernels. Each PE features four point-to-point links for east/west neighbor transfer, vertical communication to special REP registers, and downward configuration control.
Configuration employs a lightweight command-stream protocol: host CPUs deliver configuration words (CONF) to an on-chip FIFO via DMA, specifying PE function units (e.g., OP_SML8, OP_AD24, OP_CVT53), address generators, and loop control. Once loaded, PEs execute aligned micro-programs autonomously, minimizing host interaction and associated control overhead (Ando et al., 4 Nov 2025).
2. Memory Hierarchy and Off-Chip Interfaces
Each PE's LMM (8 KB, expandable to 32 KB) is double-buffered to hide memory latency. LMM storage is software-managed, supporting all weight, activation, and partial sum fragments arising in matrix multiplication and convolution kernels. On-chip buffer per lane is 512 KB.
External connectivity is via an AXI-4 interface to DDR4 DRAM (8 GB for OS, 4 GB for DMA). Measured off-chip bandwidth on the Versal platform is approximately 25 GB/s. Data and configuration traffic use dedicated DMA channels between host and programmable logic (PL) regions, providing parallel, low-latency command and data transmission (Ando et al., 4 Nov 2025).
3. Performance Metrics and Key Parameters
Salient parameters and measurable outcomes for IMAX3 are summarized:
| Parameter | Value (FPGA Prototype) | Projected ASIC |
|---|---|---|
| PEs per lane () | 64 | 64 |
| Lanes per system () | up to 8 | up to 8 |
| LMM per PE () | 8 KB | 8 KB-32 KB |
| On-chip buffer per lane () | 512 KB | 512 KB |
| Clock frequency () | 145 MHz | 840 MHz |
| Peak lane throughput () | 18.56 | 107.52 |
| System throughput (8 lanes) () | 148.5 | 860.2 |
| Off-chip bandwidth () | 25 GB/s | (same/projected higher) |
| Energy per int8 MAC (, ASIC proj.) | — |
End-to-end latency and power-delay product (PDP) for 512×512 SD-Turbo inference (one step) position IMAX3 ASIC at 754.5–558.0 s and 39,853–26,620 J (for Q3_K and Q8_0 quantization, respectively). In kernel-only execution, a single-lane array outperforms the ARM Cortex-A72 host by ≈1.1×, and is competitive with a CPU core at 840 MHz. Multi-lane scaling is near-ideal up to 2 lanes, after which the host's DMA bandwidth limits further performance gains (Ando et al., 4 Nov 2025).
4. Workload Mapping, Quantization, and Instruction Fusion
IMAX3 achieves high arithmetic intensity and efficiency via domain-specific mapping of stable-diffusion.cpp dot-product and convolution kernels. Supported quantization schemes include:
- Q8_0: 8-bit weight × 8-bit activation, yielding 24-bit partial sums and FP32 accumulations.
- Q3_K: 3-bit weights with 6-bit scaling, restructured as fused 5-bit scale and 3-bit weights via OP_CVT53.
Dot-product loops are tiled into fragments of elements per tile; groups of 12 PEs cooperate synchronously per tile. For dot-product dimension , the array processes tiles, distributing them across 46 PEs (Q8_0) or 51 PEs (Q3_K), with multi-lane parallelism harnessed for independent dot-products (e.g., multiple output channels) (Ando et al., 4 Nov 2025).
Kernel fusion is implemented through:
- OP_SML8: 2-way SIMD 8-bit multiply-add.
- OP_AD24: SIMD 24-bit addition.
- OP_CVT53: fused operation for 5-bit scale and 3-bit weight multiply–accumulate.
By merging the original 3-stage process (load, multiply, convert) into a 2-stage pipeline (OP_SML8+OP_AD24, OP_CVT53), IMAX3 reduces LMM data movement by 25% and eliminates conversion-stage latency (Ando et al., 4 Nov 2025).
5. Experimental Evaluation and Resource Utilization
FPGA prototyping confirms functional and quantitative viability of the architectural approach. A single lane (64 PEs) occupies ≈128 DSP48E2s, ≈16 K LUTs, and ≈4 MB URAM. The four-board FPGA system (up to 8 lanes) serves as an empirical baseline for ASIC projections.
Performance comparison across platforms for a single SD-Turbo step:
| Device | Q3_K Latency [s] | Q8_0 Latency [s] | Power [W] | PDP [J] |
|---|---|---|---|---|
| ARM Cortex-A72 (host) | 809.7 | 625.1 | 1.5 | 1,214.6 |
| IMAX3 (FPGA) | 790.3 | 654.7 | 180 | 142,254 |
| IMAX3 (ASIC proj.) | 754.5 | 558.0 | 52.8 (Q3_K) / 47.7 (Q8_0) | 39,853 / 26,620 |
| Intel Xeon w5-2465X | 59.3 | – | 200 | 11,860 |
| NVIDIA GTX 1080 Ti | 16.2 | – | 250 | 4,050 |
The data show that IMAX3 ASIC achieves an energy per int8 MAC of approximately , with notable kernel-level energy efficiency. However, for end-to-end workloads, GPUs remain superior due to more mature host–accelerator integration and higher off-chip bandwidth (Ando et al., 4 Nov 2025).
6. Design Insights and Future Architectural Directions
Guidelines for evolving the IMAX3 family include:
- Increasing the offload ratio: implementing FP16/FP32 dot-product kernels to shift a greater share of inner-loop computation from the host, and fusing backward/activation kernels to better exploit PE pipeline locality.
- Host–accelerator parallelism: adopting a manycore CPU or hardware-managed DMA to support multi-lane operation beyond two lanes without bandwidth saturation, alongside integration of a dedicated on-chip scheduler.
- Memory hierarchy: expanding LMM per PE and adopting high-bandwidth memory (≥400 GB/s) to mitigate off-chip stalls, and enabling register-file broadcast for efficient sharing of common sub-expressions.
- ISA and reconfigurability: pruning underutilized PE functional units, compressing the ISA through kernel profiling, and introducing programmable macro-operations for frequently encountered AI primitives such as 1×1 convolution and attention mechanisms.
- Energy efficiency: integrating fine-grained power gating in idle PE lanes, enabling dynamic voltage/frequency scaling (DVFS) tuned by workload, and supporting mixed-precision accumulate chains (8, 16, or 32 bits per kernel) to minimize switching energy (Ando et al., 4 Nov 2025).
This suggests that while IMAX3 demonstrates robust kernel-level efficiency and architectural scalability, further system-level integration, richer on-chip memory, and improved host communication are required to close the end-to-end performance gap with contemporary GPUs. Continued refinement is projected to advance the platform as a foundation for next-generation, energy-efficient, on-device AI computation.